mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 17:47:16 +00:00
style: format all files with prettier
This commit is contained in:
@@ -20,12 +20,12 @@ Production-ready patterns for Apache Airflow including DAG design, operators, se
|
||||
|
||||
### 1. DAG Design Principles
|
||||
|
||||
| Principle | Description |
|
||||
|-----------|-------------|
|
||||
| **Idempotent** | Running twice produces same result |
|
||||
| **Atomic** | Tasks succeed or fail completely |
|
||||
| **Incremental** | Process only new/changed data |
|
||||
| **Observable** | Logs, metrics, alerts at every step |
|
||||
| Principle | Description |
|
||||
| --------------- | ----------------------------------- |
|
||||
| **Idempotent** | Running twice produces same result |
|
||||
| **Atomic** | Tasks succeed or fail completely |
|
||||
| **Incremental** | Process only new/changed data |
|
||||
| **Observable** | Logs, metrics, alerts at every step |
|
||||
|
||||
### 2. Task Dependencies
|
||||
|
||||
@@ -503,6 +503,7 @@ airflow/
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
- **Use TaskFlow API** - Cleaner code, automatic XCom
|
||||
- **Set timeouts** - Prevent zombie tasks
|
||||
- **Use `mode='reschedule'`** - For sensors, free up workers
|
||||
@@ -510,6 +511,7 @@ airflow/
|
||||
- **Idempotent tasks** - Safe to retry
|
||||
|
||||
### Don'ts
|
||||
|
||||
- **Don't use `depends_on_past=True`** - Creates bottlenecks
|
||||
- **Don't hardcode dates** - Use `{{ ds }}` macros
|
||||
- **Don't use global state** - Tasks should be stateless
|
||||
|
||||
@@ -20,14 +20,14 @@ Production patterns for implementing data quality with Great Expectations, dbt t
|
||||
|
||||
### 1. Data Quality Dimensions
|
||||
|
||||
| Dimension | Description | Example Check |
|
||||
|-----------|-------------|---------------|
|
||||
| **Completeness** | No missing values | `expect_column_values_to_not_be_null` |
|
||||
| **Uniqueness** | No duplicates | `expect_column_values_to_be_unique` |
|
||||
| **Validity** | Values in expected range | `expect_column_values_to_be_in_set` |
|
||||
| **Accuracy** | Data matches reality | Cross-reference validation |
|
||||
| **Consistency** | No contradictions | `expect_column_pair_values_A_to_be_greater_than_B` |
|
||||
| **Timeliness** | Data is recent | `expect_column_max_to_be_between` |
|
||||
| Dimension | Description | Example Check |
|
||||
| ---------------- | ------------------------ | -------------------------------------------------- |
|
||||
| **Completeness** | No missing values | `expect_column_values_to_not_be_null` |
|
||||
| **Uniqueness** | No duplicates | `expect_column_values_to_be_unique` |
|
||||
| **Validity** | Values in expected range | `expect_column_values_to_be_in_set` |
|
||||
| **Accuracy** | Data matches reality | Cross-reference validation |
|
||||
| **Consistency** | No contradictions | `expect_column_pair_values_A_to_be_greater_than_B` |
|
||||
| **Timeliness** | Data is recent | `expect_column_max_to_be_between` |
|
||||
|
||||
### 2. Testing Pyramid for Data
|
||||
|
||||
@@ -191,7 +191,7 @@ validations:
|
||||
data_connector_name: default_inferred_data_connector_name
|
||||
data_asset_name: orders
|
||||
data_connector_query:
|
||||
index: -1 # Latest batch
|
||||
index: -1 # Latest batch
|
||||
expectation_suite_name: orders_suite
|
||||
|
||||
action_list:
|
||||
@@ -270,7 +270,8 @@ models:
|
||||
- name: order_status
|
||||
tests:
|
||||
- accepted_values:
|
||||
values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']
|
||||
values:
|
||||
["pending", "processing", "shipped", "delivered", "cancelled"]
|
||||
|
||||
- name: total_amount
|
||||
tests:
|
||||
@@ -566,6 +567,7 @@ if not all(r.passed for r in results.values()):
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
- **Test early** - Validate source data before transformations
|
||||
- **Test incrementally** - Add tests as you find issues
|
||||
- **Document expectations** - Clear descriptions for each test
|
||||
@@ -573,6 +575,7 @@ if not all(r.passed for r in results.values()):
|
||||
- **Version contracts** - Track schema changes
|
||||
|
||||
### Don'ts
|
||||
|
||||
- **Don't test everything** - Focus on critical columns
|
||||
- **Don't ignore warnings** - They often precede failures
|
||||
- **Don't skip freshness** - Stale data is bad data
|
||||
|
||||
@@ -32,19 +32,19 @@ marts/ Final analytics tables
|
||||
|
||||
### 2. Naming Conventions
|
||||
|
||||
| Layer | Prefix | Example |
|
||||
|-------|--------|---------|
|
||||
| Staging | `stg_` | `stg_stripe__payments` |
|
||||
| Intermediate | `int_` | `int_payments_pivoted` |
|
||||
| Marts | `dim_`, `fct_` | `dim_customers`, `fct_orders` |
|
||||
| Layer | Prefix | Example |
|
||||
| ------------ | -------------- | ----------------------------- |
|
||||
| Staging | `stg_` | `stg_stripe__payments` |
|
||||
| Intermediate | `int_` | `int_payments_pivoted` |
|
||||
| Marts | `dim_`, `fct_` | `dim_customers`, `fct_orders` |
|
||||
|
||||
## Quick Start
|
||||
|
||||
```yaml
|
||||
# dbt_project.yml
|
||||
name: 'analytics'
|
||||
version: '1.0.0'
|
||||
profile: 'analytics'
|
||||
name: "analytics"
|
||||
version: "1.0.0"
|
||||
profile: "analytics"
|
||||
|
||||
model-paths: ["models"]
|
||||
analysis-paths: ["analyses"]
|
||||
@@ -53,7 +53,7 @@ seed-paths: ["seeds"]
|
||||
macro-paths: ["macros"]
|
||||
|
||||
vars:
|
||||
start_date: '2020-01-01'
|
||||
start_date: "2020-01-01"
|
||||
|
||||
models:
|
||||
analytics:
|
||||
@@ -107,8 +107,8 @@ sources:
|
||||
loader: fivetran
|
||||
loaded_at_field: _fivetran_synced
|
||||
freshness:
|
||||
warn_after: {count: 12, period: hour}
|
||||
error_after: {count: 24, period: hour}
|
||||
warn_after: { count: 12, period: hour }
|
||||
error_after: { count: 24, period: hour }
|
||||
tables:
|
||||
- name: customers
|
||||
description: Stripe customer records
|
||||
@@ -409,7 +409,7 @@ models:
|
||||
description: Customer value tier based on lifetime value
|
||||
tests:
|
||||
- accepted_values:
|
||||
values: ['high', 'medium', 'low']
|
||||
values: ["high", "medium", "low"]
|
||||
|
||||
- name: lifetime_value
|
||||
description: Total amount paid by customer
|
||||
@@ -540,6 +540,7 @@ dbt ls --select tag:critical # List models by tag
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
- **Use staging layer** - Clean data once, use everywhere
|
||||
- **Test aggressively** - Not null, unique, relationships
|
||||
- **Document everything** - Column descriptions, model descriptions
|
||||
@@ -547,6 +548,7 @@ dbt ls --select tag:critical # List models by tag
|
||||
- **Version control** - dbt project in Git
|
||||
|
||||
### Don'ts
|
||||
|
||||
- **Don't skip staging** - Raw → mart is tech debt
|
||||
- **Don't hardcode dates** - Use `{{ var('start_date') }}`
|
||||
- **Don't repeat logic** - Extract to macros
|
||||
|
||||
@@ -32,13 +32,13 @@ Tasks (one per partition)
|
||||
|
||||
### 2. Key Performance Factors
|
||||
|
||||
| Factor | Impact | Solution |
|
||||
|--------|--------|----------|
|
||||
| **Shuffle** | Network I/O, disk I/O | Minimize wide transformations |
|
||||
| **Data Skew** | Uneven task duration | Salting, broadcast joins |
|
||||
| **Serialization** | CPU overhead | Use Kryo, columnar formats |
|
||||
| **Memory** | GC pressure, spills | Tune executor memory |
|
||||
| **Partitions** | Parallelism | Right-size partitions |
|
||||
| Factor | Impact | Solution |
|
||||
| ----------------- | --------------------- | ----------------------------- |
|
||||
| **Shuffle** | Network I/O, disk I/O | Minimize wide transformations |
|
||||
| **Data Skew** | Uneven task duration | Salting, broadcast joins |
|
||||
| **Serialization** | CPU overhead | Use Kryo, columnar formats |
|
||||
| **Memory** | GC pressure, spills | Tune executor memory |
|
||||
| **Partitions** | Parallelism | Right-size partitions |
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -395,6 +395,7 @@ spark_configs = {
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
|
||||
- **Enable AQE** - Adaptive query execution handles many issues
|
||||
- **Use Parquet/Delta** - Columnar formats with compression
|
||||
- **Broadcast small tables** - Avoid shuffle for small joins
|
||||
@@ -402,6 +403,7 @@ spark_configs = {
|
||||
- **Right-size partitions** - 128MB - 256MB per partition
|
||||
|
||||
### Don'ts
|
||||
|
||||
- **Don't collect large data** - Keep data distributed
|
||||
- **Don't use UDFs unnecessarily** - Use built-in functions
|
||||
- **Don't over-cache** - Memory is limited
|
||||
|
||||
Reference in New Issue
Block a user