mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
feat: add 5 new specialized agents with 20 skills
Add domain expert agents with comprehensive skill sets: - service-mesh-expert (cloud-infrastructure): Istio/Linkerd patterns, mTLS, observability - event-sourcing-architect (backend-development): CQRS, event stores, projections, sagas - vector-database-engineer (llm-application-dev): embeddings, similarity search, hybrid search - monorepo-architect (developer-essentials): Nx, Turborepo, Bazel, pnpm workspaces - threat-modeling-expert (security-scanning): STRIDE, attack trees, security requirements Update all documentation to reflect correct counts: - 67 plugins, 99 agents, 107 skills, 71 commands
This commit is contained in:
587
plugins/data-engineering/skills/data-quality-frameworks/SKILL.md
Normal file
587
plugins/data-engineering/skills/data-quality-frameworks/SKILL.md
Normal file
@@ -0,0 +1,587 @@
|
||||
---
|
||||
name: data-quality-frameworks
|
||||
description: Implement data quality validation with Great Expectations, dbt tests, and data contracts. Use when building data quality pipelines, implementing validation rules, or establishing data contracts.
|
||||
---
|
||||
|
||||
# Data Quality Frameworks
|
||||
|
||||
Production patterns for implementing data quality with Great Expectations, dbt tests, and data contracts to ensure reliable data pipelines.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Implementing data quality checks in pipelines
|
||||
- Setting up Great Expectations validation
|
||||
- Building comprehensive dbt test suites
|
||||
- Establishing data contracts between teams
|
||||
- Monitoring data quality metrics
|
||||
- Automating data validation in CI/CD
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Data Quality Dimensions
|
||||
|
||||
| Dimension | Description | Example Check |
|
||||
|-----------|-------------|---------------|
|
||||
| **Completeness** | No missing values | `expect_column_values_to_not_be_null` |
|
||||
| **Uniqueness** | No duplicates | `expect_column_values_to_be_unique` |
|
||||
| **Validity** | Values in expected range | `expect_column_values_to_be_in_set` |
|
||||
| **Accuracy** | Data matches reality | Cross-reference validation |
|
||||
| **Consistency** | No contradictions | `expect_column_pair_values_A_to_be_greater_than_B` |
|
||||
| **Timeliness** | Data is recent | `expect_column_max_to_be_between` |
|
||||
|
||||
### 2. Testing Pyramid for Data
|
||||
|
||||
```
|
||||
/\
|
||||
/ \ Integration Tests (cross-table)
|
||||
/────\
|
||||
/ \ Unit Tests (single column)
|
||||
/────────\
|
||||
/ \ Schema Tests (structure)
|
||||
/────────────\
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Great Expectations Setup
|
||||
|
||||
```bash
|
||||
# Install
|
||||
pip install great_expectations
|
||||
|
||||
# Initialize project
|
||||
great_expectations init
|
||||
|
||||
# Create datasource
|
||||
great_expectations datasource new
|
||||
```
|
||||
|
||||
```python
|
||||
# great_expectations/checkpoints/daily_validation.yml
|
||||
import great_expectations as gx
|
||||
|
||||
# Create context
|
||||
context = gx.get_context()
|
||||
|
||||
# Create expectation suite
|
||||
suite = context.add_expectation_suite("orders_suite")
|
||||
|
||||
# Add expectations
|
||||
suite.add_expectation(
|
||||
gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
|
||||
)
|
||||
suite.add_expectation(
|
||||
gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
|
||||
)
|
||||
|
||||
# Validate
|
||||
results = context.run_checkpoint(checkpoint_name="daily_orders")
|
||||
```
|
||||
|
||||
## Patterns
|
||||
|
||||
### Pattern 1: Great Expectations Suite
|
||||
|
||||
```python
|
||||
# expectations/orders_suite.py
|
||||
import great_expectations as gx
|
||||
from great_expectations.core import ExpectationSuite
|
||||
from great_expectations.core.expectation_configuration import ExpectationConfiguration
|
||||
|
||||
def build_orders_suite() -> ExpectationSuite:
|
||||
"""Build comprehensive orders expectation suite"""
|
||||
|
||||
suite = ExpectationSuite(expectation_suite_name="orders_suite")
|
||||
|
||||
# Schema expectations
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_table_columns_to_match_set",
|
||||
kwargs={
|
||||
"column_set": ["order_id", "customer_id", "amount", "status", "created_at"],
|
||||
"exact_match": False # Allow additional columns
|
||||
}
|
||||
))
|
||||
|
||||
# Primary key
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_column_values_to_not_be_null",
|
||||
kwargs={"column": "order_id"}
|
||||
))
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_column_values_to_be_unique",
|
||||
kwargs={"column": "order_id"}
|
||||
))
|
||||
|
||||
# Foreign key
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_column_values_to_not_be_null",
|
||||
kwargs={"column": "customer_id"}
|
||||
))
|
||||
|
||||
# Categorical values
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_column_values_to_be_in_set",
|
||||
kwargs={
|
||||
"column": "status",
|
||||
"value_set": ["pending", "processing", "shipped", "delivered", "cancelled"]
|
||||
}
|
||||
))
|
||||
|
||||
# Numeric ranges
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_column_values_to_be_between",
|
||||
kwargs={
|
||||
"column": "amount",
|
||||
"min_value": 0,
|
||||
"max_value": 100000,
|
||||
"strict_min": True # amount > 0
|
||||
}
|
||||
))
|
||||
|
||||
# Date validity
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_column_values_to_be_dateutil_parseable",
|
||||
kwargs={"column": "created_at"}
|
||||
))
|
||||
|
||||
# Freshness - data should be recent
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_column_max_to_be_between",
|
||||
kwargs={
|
||||
"column": "created_at",
|
||||
"min_value": {"$PARAMETER": "now - timedelta(days=1)"},
|
||||
"max_value": {"$PARAMETER": "now"}
|
||||
}
|
||||
))
|
||||
|
||||
# Row count sanity
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_table_row_count_to_be_between",
|
||||
kwargs={
|
||||
"min_value": 1000, # Expect at least 1000 rows
|
||||
"max_value": 10000000
|
||||
}
|
||||
))
|
||||
|
||||
# Statistical expectations
|
||||
suite.add_expectation(ExpectationConfiguration(
|
||||
expectation_type="expect_column_mean_to_be_between",
|
||||
kwargs={
|
||||
"column": "amount",
|
||||
"min_value": 50,
|
||||
"max_value": 500
|
||||
}
|
||||
))
|
||||
|
||||
return suite
|
||||
```
|
||||
|
||||
### Pattern 2: Great Expectations Checkpoint
|
||||
|
||||
```yaml
|
||||
# great_expectations/checkpoints/orders_checkpoint.yml
|
||||
name: orders_checkpoint
|
||||
config_version: 1.0
|
||||
class_name: Checkpoint
|
||||
run_name_template: "%Y%m%d-%H%M%S-orders-validation"
|
||||
|
||||
validations:
|
||||
- batch_request:
|
||||
datasource_name: warehouse
|
||||
data_connector_name: default_inferred_data_connector_name
|
||||
data_asset_name: orders
|
||||
data_connector_query:
|
||||
index: -1 # Latest batch
|
||||
expectation_suite_name: orders_suite
|
||||
|
||||
action_list:
|
||||
- name: store_validation_result
|
||||
action:
|
||||
class_name: StoreValidationResultAction
|
||||
|
||||
- name: store_evaluation_parameters
|
||||
action:
|
||||
class_name: StoreEvaluationParametersAction
|
||||
|
||||
- name: update_data_docs
|
||||
action:
|
||||
class_name: UpdateDataDocsAction
|
||||
|
||||
# Slack notification on failure
|
||||
- name: send_slack_notification
|
||||
action:
|
||||
class_name: SlackNotificationAction
|
||||
slack_webhook: ${SLACK_WEBHOOK}
|
||||
notify_on: failure
|
||||
renderer:
|
||||
module_name: great_expectations.render.renderer.slack_renderer
|
||||
class_name: SlackRenderer
|
||||
```
|
||||
|
||||
```python
|
||||
# Run checkpoint
|
||||
import great_expectations as gx
|
||||
|
||||
context = gx.get_context()
|
||||
result = context.run_checkpoint(checkpoint_name="orders_checkpoint")
|
||||
|
||||
if not result.success:
|
||||
failed_expectations = [
|
||||
r for r in result.run_results.values()
|
||||
if not r.success
|
||||
]
|
||||
raise ValueError(f"Data quality check failed: {failed_expectations}")
|
||||
```
|
||||
|
||||
### Pattern 3: dbt Data Tests
|
||||
|
||||
```yaml
|
||||
# models/marts/core/_core__models.yml
|
||||
version: 2
|
||||
|
||||
models:
|
||||
- name: fct_orders
|
||||
description: Order fact table
|
||||
tests:
|
||||
# Table-level tests
|
||||
- dbt_utils.recency:
|
||||
datepart: day
|
||||
field: created_at
|
||||
interval: 1
|
||||
- dbt_utils.at_least_one
|
||||
- dbt_utils.expression_is_true:
|
||||
expression: "total_amount >= 0"
|
||||
|
||||
columns:
|
||||
- name: order_id
|
||||
description: Primary key
|
||||
tests:
|
||||
- unique
|
||||
- not_null
|
||||
|
||||
- name: customer_id
|
||||
description: Foreign key to dim_customers
|
||||
tests:
|
||||
- not_null
|
||||
- relationships:
|
||||
to: ref('dim_customers')
|
||||
field: customer_id
|
||||
|
||||
- name: order_status
|
||||
tests:
|
||||
- accepted_values:
|
||||
values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']
|
||||
|
||||
- name: total_amount
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_utils.expression_is_true:
|
||||
expression: ">= 0"
|
||||
|
||||
- name: created_at
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_utils.expression_is_true:
|
||||
expression: "<= current_timestamp"
|
||||
|
||||
- name: dim_customers
|
||||
columns:
|
||||
- name: customer_id
|
||||
tests:
|
||||
- unique
|
||||
- not_null
|
||||
|
||||
- name: email
|
||||
tests:
|
||||
- unique
|
||||
- not_null
|
||||
# Custom regex test
|
||||
- dbt_utils.expression_is_true:
|
||||
expression: "email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'"
|
||||
```
|
||||
|
||||
### Pattern 4: Custom dbt Tests
|
||||
|
||||
```sql
|
||||
-- tests/generic/test_row_count_in_range.sql
|
||||
{% test row_count_in_range(model, min_count, max_count) %}
|
||||
|
||||
with row_count as (
|
||||
select count(*) as cnt from {{ model }}
|
||||
)
|
||||
|
||||
select cnt
|
||||
from row_count
|
||||
where cnt < {{ min_count }} or cnt > {{ max_count }}
|
||||
|
||||
{% endtest %}
|
||||
|
||||
-- Usage in schema.yml:
|
||||
-- tests:
|
||||
-- - row_count_in_range:
|
||||
-- min_count: 1000
|
||||
-- max_count: 10000000
|
||||
```
|
||||
|
||||
```sql
|
||||
-- tests/generic/test_sequential_values.sql
|
||||
{% test sequential_values(model, column_name, interval=1) %}
|
||||
|
||||
with lagged as (
|
||||
select
|
||||
{{ column_name }},
|
||||
lag({{ column_name }}) over (order by {{ column_name }}) as prev_value
|
||||
from {{ model }}
|
||||
)
|
||||
|
||||
select *
|
||||
from lagged
|
||||
where {{ column_name }} - prev_value != {{ interval }}
|
||||
and prev_value is not null
|
||||
|
||||
{% endtest %}
|
||||
```
|
||||
|
||||
```sql
|
||||
-- tests/singular/assert_orders_customers_match.sql
|
||||
-- Singular test: specific business rule
|
||||
|
||||
with orders_customers as (
|
||||
select distinct customer_id from {{ ref('fct_orders') }}
|
||||
),
|
||||
|
||||
dim_customers as (
|
||||
select customer_id from {{ ref('dim_customers') }}
|
||||
),
|
||||
|
||||
orphaned_orders as (
|
||||
select o.customer_id
|
||||
from orders_customers o
|
||||
left join dim_customers c using (customer_id)
|
||||
where c.customer_id is null
|
||||
)
|
||||
|
||||
select * from orphaned_orders
|
||||
-- Test passes if this returns 0 rows
|
||||
```
|
||||
|
||||
### Pattern 5: Data Contracts
|
||||
|
||||
```yaml
|
||||
# contracts/orders_contract.yaml
|
||||
apiVersion: datacontract.com/v1.0.0
|
||||
kind: DataContract
|
||||
metadata:
|
||||
name: orders
|
||||
version: 1.0.0
|
||||
owner: data-platform-team
|
||||
contact: data-team@company.com
|
||||
|
||||
info:
|
||||
title: Orders Data Contract
|
||||
description: Contract for order event data from the ecommerce platform
|
||||
purpose: Analytics, reporting, and ML features
|
||||
|
||||
servers:
|
||||
production:
|
||||
type: snowflake
|
||||
account: company.us-east-1
|
||||
database: ANALYTICS
|
||||
schema: CORE
|
||||
|
||||
terms:
|
||||
usage: Internal analytics only
|
||||
limitations: PII must not be exposed in downstream marts
|
||||
billing: Charged per query TB scanned
|
||||
|
||||
schema:
|
||||
type: object
|
||||
properties:
|
||||
order_id:
|
||||
type: string
|
||||
format: uuid
|
||||
description: Unique order identifier
|
||||
required: true
|
||||
unique: true
|
||||
pii: false
|
||||
|
||||
customer_id:
|
||||
type: string
|
||||
format: uuid
|
||||
description: Customer identifier
|
||||
required: true
|
||||
pii: true
|
||||
piiClassification: indirect
|
||||
|
||||
total_amount:
|
||||
type: number
|
||||
minimum: 0
|
||||
maximum: 100000
|
||||
description: Order total in USD
|
||||
|
||||
created_at:
|
||||
type: string
|
||||
format: date-time
|
||||
description: Order creation timestamp
|
||||
required: true
|
||||
|
||||
status:
|
||||
type: string
|
||||
enum: [pending, processing, shipped, delivered, cancelled]
|
||||
description: Current order status
|
||||
|
||||
quality:
|
||||
type: SodaCL
|
||||
specification:
|
||||
checks for orders:
|
||||
- row_count > 0
|
||||
- missing_count(order_id) = 0
|
||||
- duplicate_count(order_id) = 0
|
||||
- invalid_count(status) = 0:
|
||||
valid values: [pending, processing, shipped, delivered, cancelled]
|
||||
- freshness(created_at) < 24h
|
||||
|
||||
sla:
|
||||
availability: 99.9%
|
||||
freshness: 1 hour
|
||||
latency: 5 minutes
|
||||
```
|
||||
|
||||
### Pattern 6: Automated Quality Pipeline
|
||||
|
||||
```python
|
||||
# quality_pipeline.py
|
||||
from dataclasses import dataclass
|
||||
from typing import List, Dict, Any
|
||||
import great_expectations as gx
|
||||
from datetime import datetime
|
||||
|
||||
@dataclass
|
||||
class QualityResult:
|
||||
table: str
|
||||
passed: bool
|
||||
total_expectations: int
|
||||
failed_expectations: int
|
||||
details: List[Dict[str, Any]]
|
||||
timestamp: datetime
|
||||
|
||||
class DataQualityPipeline:
|
||||
"""Orchestrate data quality checks across tables"""
|
||||
|
||||
def __init__(self, context: gx.DataContext):
|
||||
self.context = context
|
||||
self.results: List[QualityResult] = []
|
||||
|
||||
def validate_table(self, table: str, suite: str) -> QualityResult:
|
||||
"""Validate a single table against expectation suite"""
|
||||
|
||||
checkpoint_config = {
|
||||
"name": f"{table}_validation",
|
||||
"config_version": 1.0,
|
||||
"class_name": "Checkpoint",
|
||||
"validations": [{
|
||||
"batch_request": {
|
||||
"datasource_name": "warehouse",
|
||||
"data_asset_name": table,
|
||||
},
|
||||
"expectation_suite_name": suite,
|
||||
}],
|
||||
}
|
||||
|
||||
result = self.context.run_checkpoint(**checkpoint_config)
|
||||
|
||||
# Parse results
|
||||
validation_result = list(result.run_results.values())[0]
|
||||
results = validation_result.results
|
||||
|
||||
failed = [r for r in results if not r.success]
|
||||
|
||||
return QualityResult(
|
||||
table=table,
|
||||
passed=result.success,
|
||||
total_expectations=len(results),
|
||||
failed_expectations=len(failed),
|
||||
details=[{
|
||||
"expectation": r.expectation_config.expectation_type,
|
||||
"success": r.success,
|
||||
"observed_value": r.result.get("observed_value"),
|
||||
} for r in results],
|
||||
timestamp=datetime.now()
|
||||
)
|
||||
|
||||
def run_all(self, tables: Dict[str, str]) -> Dict[str, QualityResult]:
|
||||
"""Run validation for all tables"""
|
||||
results = {}
|
||||
|
||||
for table, suite in tables.items():
|
||||
print(f"Validating {table}...")
|
||||
results[table] = self.validate_table(table, suite)
|
||||
|
||||
return results
|
||||
|
||||
def generate_report(self, results: Dict[str, QualityResult]) -> str:
|
||||
"""Generate quality report"""
|
||||
report = ["# Data Quality Report", f"Generated: {datetime.now()}", ""]
|
||||
|
||||
total_passed = sum(1 for r in results.values() if r.passed)
|
||||
total_tables = len(results)
|
||||
|
||||
report.append(f"## Summary: {total_passed}/{total_tables} tables passed")
|
||||
report.append("")
|
||||
|
||||
for table, result in results.items():
|
||||
status = "✅" if result.passed else "❌"
|
||||
report.append(f"### {status} {table}")
|
||||
report.append(f"- Expectations: {result.total_expectations}")
|
||||
report.append(f"- Failed: {result.failed_expectations}")
|
||||
|
||||
if not result.passed:
|
||||
report.append("- Failed checks:")
|
||||
for detail in result.details:
|
||||
if not detail["success"]:
|
||||
report.append(f" - {detail['expectation']}: {detail['observed_value']}")
|
||||
report.append("")
|
||||
|
||||
return "\n".join(report)
|
||||
|
||||
# Usage
|
||||
context = gx.get_context()
|
||||
pipeline = DataQualityPipeline(context)
|
||||
|
||||
tables_to_validate = {
|
||||
"orders": "orders_suite",
|
||||
"customers": "customers_suite",
|
||||
"products": "products_suite",
|
||||
}
|
||||
|
||||
results = pipeline.run_all(tables_to_validate)
|
||||
report = pipeline.generate_report(results)
|
||||
|
||||
# Fail pipeline if any table failed
|
||||
if not all(r.passed for r in results.values()):
|
||||
print(report)
|
||||
raise ValueError("Data quality checks failed!")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Test early** - Validate source data before transformations
|
||||
- **Test incrementally** - Add tests as you find issues
|
||||
- **Document expectations** - Clear descriptions for each test
|
||||
- **Alert on failures** - Integrate with monitoring
|
||||
- **Version contracts** - Track schema changes
|
||||
|
||||
### Don'ts
|
||||
- **Don't test everything** - Focus on critical columns
|
||||
- **Don't ignore warnings** - They often precede failures
|
||||
- **Don't skip freshness** - Stale data is bad data
|
||||
- **Don't hardcode thresholds** - Use dynamic baselines
|
||||
- **Don't test in isolation** - Test relationships too
|
||||
|
||||
## Resources
|
||||
|
||||
- [Great Expectations Documentation](https://docs.greatexpectations.io/)
|
||||
- [dbt Testing Documentation](https://docs.getdbt.com/docs/build/tests)
|
||||
- [Data Contract Specification](https://datacontract.com/)
|
||||
- [Soda Core](https://docs.soda.io/soda-core/overview.html)
|
||||
Reference in New Issue
Block a user