🎯

data-pipeline-engineer

🎯Skill

from erichowens/some_claude_skills

What it does

Designs and implements scalable data pipelines, transforming raw data into analytics-ready formats using modern ETL technologies and best practices.

📦

Part of

erichowens/some_claude_skills(148 items)

data-pipeline-engineer

Installation

Add MarketplaceAdd marketplace to Claude Code

/plugin marketplace add erichowens/some_claude_skills

Install PluginInstall plugin from marketplace

/plugin install adhd-design-expert@some-claude-skills

Install PluginInstall plugin from marketplace

/plugin install some-claude-skills@some-claude-skills

git cloneClone repository

git clone https://github.com/erichowens/some_claude_skills.git

Claude Desktop ConfigurationAdd this to your claude_desktop_config.json

{
  "mcpServers": {
    "prompt-learning": {
      "command": "npx",
      "args...

📖 Extracted from docs: erichowens/some_claude_skills

Need more details? View full documentation on GitHub →

14Installs

Last UpdatedJan 23, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

"Expert data engineer for ETL/ELT pipelines, streaming, data warehousing. Activate on: data pipeline, ETL, ELT, data warehouse, Spark, Kafka, Airflow, dbt, data modeling, star schema, streaming data, batch processing, data quality. NOT for: API design (use api-architect), ML training (use ML skills), dashboards (use design skills)."

Overview

# Data Pipeline Engineer

Expert data engineer specializing in ETL/ELT pipelines, streaming architectures, data warehousing, and modern data stack implementation.

Quick Start

Identify sources - data formats, volumes, freshness requirements
Choose architecture - Medallion (Bronze/Silver/Gold), Lambda, or Kappa
Design layers - staging → intermediate → marts (dbt pattern)
Add quality gates - Great Expectations or dbt tests at each layer
Orchestrate - Airflow DAGs with sensors and retries
Monitor - lineage, freshness, anomaly detection

Core Capabilities

| Capability | Technologies | Key Patterns |

|------------|--------------|--------------|

| Batch Processing | Spark, dbt, Databricks | Incremental, partitioning, Delta/Iceberg |

| Stream Processing | Kafka, Flink, Spark Streaming | Watermarks, exactly-once, windowing |

| Orchestration | Airflow, Dagster, Prefect | DAG design, sensors, task groups |

| Data Modeling | dbt, SQL | Kimball, Data Vault, SCD |

| Data Quality | Great Expectations, dbt tests | Validation suites, freshness |

Architecture Patterns

Medallion Architecture (Recommended)

```

BRONZE (Raw) → Exact source copy, schema-on-read, partitioned by ingestion

↓ Cleaning, Deduplication

SILVER (Cleansed) → Validated, standardized, business logic applied

↓ Aggregation, Enrichment

GOLD (Business) → Dimensional models, aggregates, ready for BI/ML

```

Lambda vs Kappa

Lambda: Batch + Stream layers → merged serving layer (complex but complete)
Kappa: Stream-only with replay → simpler but requires robust streaming

Reference Examples

Full implementation examples in ./references/:

| File | Description |

|------|-------------|

| dbt-project-structure.md | Complete dbt layout with staging, intermediate, marts |

| airflow-dag.py | Production DAG with sensors, task groups, quality checks |

| spark-streaming.py | Kafka-to-Delta processor with windowing |

| great-expectations-suite.json | Comprehensive data quality expectation suite |

Anti-Patterns (10 Critical Mistakes)

1. Full Table Refreshes

Symptom: Truncate and rebuild entire tables every run

Fix: Use incremental models with is_incremental(), partition by date

2. Tight Coupling to Source Schemas

Symptom: Pipeline breaks when upstream adds/removes columns

Fix: Explicit source contracts, select only needed columns in staging

3. Monolithic DAGs

Symptom: One 200-task DAG running 8 hours

Fix: Domain-specific DAGs, ExternalTaskSensor for dependencies

4. No Data Quality Gates

Symptom: Bad data reaches production before detection

Fix: Great Expectations or dbt tests at each layer, block on failures

5. Processing Before Archiving

Symptom: Raw data transformed without preserving original

Fix: Always land raw in Bronze first, make transformations reproducible

6. Hardcoded Dates in Queries

Symptom: Manual updates needed for date filters

Fix: Use Airflow templating (e.g., ds variable) or dynamic date functions

7. Missing Watermarks in Streaming

Symptom: Unbounded state growth, OOM in long-running jobs

Fix: Add withWatermark() to handle late-arriving data

8. No Retry/Backoff Strategy

Symptom: Transient failures cause DAG failures

Fix: retries=3, retry_exponential_backoff=True, max_retry_delay

9. Undocumented Data Lineage

Symptom: No one knows where data comes from or who uses it

Fix: dbt docs, data catalog integration, column-level lineage

10. Testing Only in Production

Symptom: Bugs discovered by stakeholders, not engineers

Fix: dbt --target dev, sample datasets, CI/CD for models

Quality Checklist

Pipeline Design:

[ ] Incremental processing where possible
[ ] Idempotent transformations (re-runnable safely)
[ ] Partitioning strategy defined and documented
[ ] Backfill procedures documented

Data Quality:

[ ] Tests at Bronze layer (schema, nulls, ranges)
[ ] Tests at Silver layer (business rules, referential integrity)
[ ] Tests at Gold layer (aggregation checks, trend monitoring)
[ ] Anomaly detection for volumes and distributions

Orchestration:

[ ] Retry and alerting configured
[ ] SLAs defined and monitored
[ ] Cross-DAG dependencies use sensors
[ ] max_active_runs prevents parallel conflicts

Operations:

[ ] Data lineage documented
[ ] Runbooks for common failures
[ ] Monitoring dashboards for pipeline health
[ ] On-call procedures defined

Validation Script

Run ./scripts/validate-pipeline.sh to check:

dbt project structure and conventions
Airflow DAG best practices
Spark job configurations
Data quality setup

External Resources

[dbt Best Practices](https://docs.getdbt.com/guides/best-practices)
[Airflow Best Practices](https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html)
[Great Expectations Docs](https://docs.greatexpectations.io/)
[Delta Lake Guide](https://docs.delta.io/latest/index.html)
[Kafka Streams](https://kafka.apache.org/documentation/streams/)

More from this repository10

🎯

cv-creator🎯Skill

cv-creator skill from erichowens/some_claude_skills

🎯

metal-shader-expert🎯Skill

metal-shader-expert skill from erichowens/some_claude_skills

🎯

video-processing-editing🎯Skill

video-processing-editing skill from erichowens/some_claude_skills

🎯

ai-video-production-master🎯Skill

ai-video-production-master skill from erichowens/some_claude_skills

🎯

sound-engineer🎯Skill

sound-engineer skill from erichowens/some_claude_skills

🎯

pwa-expert🎯Skill

pwa-expert skill from erichowens/some_claude_skills

🎯

mdx-sanitizer🎯Skill

Sanitizes MDX content by escaping angle brackets, generics, and JSX-conflicting patterns to prevent build failures and parsing errors.

🎯

indie-monetization-strategist🎯Skill

indie-monetization-strategist skill from erichowens/some_claude_skills

🎯

mobile-ux-optimizer🎯Skill

Optimizes mobile user interfaces by implementing touch-friendly, responsive design patterns with proper viewport handling and performance considerations.

🎯

job-application-optimizer🎯Skill

job-application-optimizer skill from erichowens/some_claude_skills