style: format all files with prettier

This commit is contained in:
Seth Hobson
2026-01-19 17:07:03 -05:00
parent 8d37048deb
commit 56848874a2
355 changed files with 15215 additions and 10241 deletions

View File

@@ -20,12 +20,12 @@ Production-ready patterns for Apache Airflow including DAG design, operators, se
### 1. DAG Design Principles
| Principle | Description |
|-----------|-------------|
| **Idempotent** | Running twice produces same result |
| **Atomic** | Tasks succeed or fail completely |
| **Incremental** | Process only new/changed data |
| **Observable** | Logs, metrics, alerts at every step |
| Principle | Description |
| --------------- | ----------------------------------- |
| **Idempotent** | Running twice produces same result |
| **Atomic** | Tasks succeed or fail completely |
| **Incremental** | Process only new/changed data |
| **Observable** | Logs, metrics, alerts at every step |
### 2. Task Dependencies
@@ -503,6 +503,7 @@ airflow/
## Best Practices
### Do's
- **Use TaskFlow API** - Cleaner code, automatic XCom
- **Set timeouts** - Prevent zombie tasks
- **Use `mode='reschedule'`** - For sensors, free up workers
@@ -510,6 +511,7 @@ airflow/
- **Idempotent tasks** - Safe to retry
### Don'ts
- **Don't use `depends_on_past=True`** - Creates bottlenecks
- **Don't hardcode dates** - Use `{{ ds }}` macros
- **Don't use global state** - Tasks should be stateless

View File

@@ -20,14 +20,14 @@ Production patterns for implementing data quality with Great Expectations, dbt t
### 1. Data Quality Dimensions
| Dimension | Description | Example Check |
|-----------|-------------|---------------|
| **Completeness** | No missing values | `expect_column_values_to_not_be_null` |
| **Uniqueness** | No duplicates | `expect_column_values_to_be_unique` |
| **Validity** | Values in expected range | `expect_column_values_to_be_in_set` |
| **Accuracy** | Data matches reality | Cross-reference validation |
| **Consistency** | No contradictions | `expect_column_pair_values_A_to_be_greater_than_B` |
| **Timeliness** | Data is recent | `expect_column_max_to_be_between` |
| Dimension | Description | Example Check |
| ---------------- | ------------------------ | -------------------------------------------------- |
| **Completeness** | No missing values | `expect_column_values_to_not_be_null` |
| **Uniqueness** | No duplicates | `expect_column_values_to_be_unique` |
| **Validity** | Values in expected range | `expect_column_values_to_be_in_set` |
| **Accuracy** | Data matches reality | Cross-reference validation |
| **Consistency** | No contradictions | `expect_column_pair_values_A_to_be_greater_than_B` |
| **Timeliness** | Data is recent | `expect_column_max_to_be_between` |
### 2. Testing Pyramid for Data
@@ -191,7 +191,7 @@ validations:
data_connector_name: default_inferred_data_connector_name
data_asset_name: orders
data_connector_query:
index: -1 # Latest batch
index: -1 # Latest batch
expectation_suite_name: orders_suite
action_list:
@@ -270,7 +270,8 @@ models:
- name: order_status
tests:
- accepted_values:
values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']
values:
["pending", "processing", "shipped", "delivered", "cancelled"]
- name: total_amount
tests:
@@ -566,6 +567,7 @@ if not all(r.passed for r in results.values()):
## Best Practices
### Do's
- **Test early** - Validate source data before transformations
- **Test incrementally** - Add tests as you find issues
- **Document expectations** - Clear descriptions for each test
@@ -573,6 +575,7 @@ if not all(r.passed for r in results.values()):
- **Version contracts** - Track schema changes
### Don'ts
- **Don't test everything** - Focus on critical columns
- **Don't ignore warnings** - They often precede failures
- **Don't skip freshness** - Stale data is bad data

View File

@@ -32,19 +32,19 @@ marts/ Final analytics tables
### 2. Naming Conventions
| Layer | Prefix | Example |
|-------|--------|---------|
| Staging | `stg_` | `stg_stripe__payments` |
| Intermediate | `int_` | `int_payments_pivoted` |
| Marts | `dim_`, `fct_` | `dim_customers`, `fct_orders` |
| Layer | Prefix | Example |
| ------------ | -------------- | ----------------------------- |
| Staging | `stg_` | `stg_stripe__payments` |
| Intermediate | `int_` | `int_payments_pivoted` |
| Marts | `dim_`, `fct_` | `dim_customers`, `fct_orders` |
## Quick Start
```yaml
# dbt_project.yml
name: 'analytics'
version: '1.0.0'
profile: 'analytics'
name: "analytics"
version: "1.0.0"
profile: "analytics"
model-paths: ["models"]
analysis-paths: ["analyses"]
@@ -53,7 +53,7 @@ seed-paths: ["seeds"]
macro-paths: ["macros"]
vars:
start_date: '2020-01-01'
start_date: "2020-01-01"
models:
analytics:
@@ -107,8 +107,8 @@ sources:
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
warn_after: { count: 12, period: hour }
error_after: { count: 24, period: hour }
tables:
- name: customers
description: Stripe customer records
@@ -409,7 +409,7 @@ models:
description: Customer value tier based on lifetime value
tests:
- accepted_values:
values: ['high', 'medium', 'low']
values: ["high", "medium", "low"]
- name: lifetime_value
description: Total amount paid by customer
@@ -540,6 +540,7 @@ dbt ls --select tag:critical # List models by tag
## Best Practices
### Do's
- **Use staging layer** - Clean data once, use everywhere
- **Test aggressively** - Not null, unique, relationships
- **Document everything** - Column descriptions, model descriptions
@@ -547,6 +548,7 @@ dbt ls --select tag:critical # List models by tag
- **Version control** - dbt project in Git
### Don'ts
- **Don't skip staging** - Raw → mart is tech debt
- **Don't hardcode dates** - Use `{{ var('start_date') }}`
- **Don't repeat logic** - Extract to macros

View File

@@ -32,13 +32,13 @@ Tasks (one per partition)
### 2. Key Performance Factors
| Factor | Impact | Solution |
|--------|--------|----------|
| **Shuffle** | Network I/O, disk I/O | Minimize wide transformations |
| **Data Skew** | Uneven task duration | Salting, broadcast joins |
| **Serialization** | CPU overhead | Use Kryo, columnar formats |
| **Memory** | GC pressure, spills | Tune executor memory |
| **Partitions** | Parallelism | Right-size partitions |
| Factor | Impact | Solution |
| ----------------- | --------------------- | ----------------------------- |
| **Shuffle** | Network I/O, disk I/O | Minimize wide transformations |
| **Data Skew** | Uneven task duration | Salting, broadcast joins |
| **Serialization** | CPU overhead | Use Kryo, columnar formats |
| **Memory** | GC pressure, spills | Tune executor memory |
| **Partitions** | Parallelism | Right-size partitions |
## Quick Start
@@ -395,6 +395,7 @@ spark_configs = {
## Best Practices
### Do's
- **Enable AQE** - Adaptive query execution handles many issues
- **Use Parquet/Delta** - Columnar formats with compression
- **Broadcast small tables** - Avoid shuffle for small joins
@@ -402,6 +403,7 @@ spark_configs = {
- **Right-size partitions** - 128MB - 256MB per partition
### Don'ts
- **Don't collect large data** - Keep data distributed
- **Don't use UDFs unnecessarily** - Use built-in functions
- **Don't over-cache** - Memory is limited