diagram.mmd — flowchart
ETL Workflow flowchart diagram

An ETL (Extract, Transform, Load) workflow is a three-phase data integration process that extracts raw data from source systems, applies transformations to make it consistent and useful, and loads the result into a target store for analysis.

ETL is one of the most foundational patterns in data engineering. Before a single analyst can run a query or a dashboard can render a chart, raw data from production systems must be moved, reshaped, and quality-checked. The ETL workflow defines exactly how that happens.

Extract is the first phase. The job connects to one or more source systems — databases, APIs, flat files, or streaming topics — and pulls data. Depending on the source, extraction may be a full snapshot (all rows every run) or incremental (only rows changed since the last watermark). Incremental extraction is faster and cheaper but requires the source to expose a reliable change indicator such as an updated_at timestamp or a CDC log.

Validate follows immediately after extraction. Before any transformation is applied, the pipeline confirms that the incoming data conforms to expected types and constraints: are required fields present? Do numeric columns contain numeric values? Are foreign key references resolvable? Records that fail validation are quarantined rather than silently dropped, so data quality issues surface as alerts rather than silent gaps in reports.

Transform is the most complex phase. It encompasses cleaning (stripping whitespace, normalizing casing), deduplication (removing records with identical business keys), type casting (parsing ISO timestamps into date objects), business logic application (categorizing orders by size tier), and enrichment (joining in reference data like country names from a geo lookup table). Well-designed transformations are idempotent — running the same job twice produces the same result — which makes reruns safe.

Load writes the transformed records into the target: a Data Warehouse Pipeline, a Data Lake Architecture, or a downstream API. A post-load verification step checks row counts and key metrics against expected ranges. If any check fails, the load is rolled back and the job is marked as failed rather than silently corrupting the destination. Data Ingestion Pipeline covers the broader infrastructure that schedules and orchestrates ETL jobs across many sources.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

An ETL workflow is a three-phase data integration process — Extract, Transform, Load — that pulls raw data from source systems, applies cleaning and business logic transformations, and writes the result into a target store ready for analysis.
The extract phase pulls data from sources using full snapshots or incremental watermarks. A validation step quarantines non-conforming records before transformation applies cleaning, deduplication, type casting, and business logic. The load phase writes results to the target and a post-load verification confirms row counts and key metrics before the job is marked successful.
Use ETL when your target system has strict schema requirements that make it impractical to load raw data first, when transformation logic is complex enough to warrant a dedicated job, or when you need auditability through each phase — including the ability to inspect quarantined records.
Common mistakes include writing non-idempotent transformations (so reruns produce different results), silently dropping records that fail validation rather than quarantining them, and skipping the post-load verification step (letting silent corruption reach downstream consumers).
ETL transforms data before loading it into the target. ELT loads raw data into the target first (typically a cloud data warehouse with abundant compute) and transforms it there using SQL. ELT is now more common for cloud warehouse workloads because warehouses like BigQuery and Snowflake can run transformations at scale cheaply, eliminating the need for a separate transform compute layer.
mermaid
flowchart TD Sources[Data Sources\nDatabases, APIs, Files] --> Extract[Extract raw data\nFull or incremental load] Extract --> Validate[Validate schema and types\nNull checks, type conformance] Validate -->|Pass| Clean[Clean and deduplicate\nNormalize, strip whitespace] Validate -->|Fail| Quarantine[Quarantine failed records\nAlert and log] Clean --> Transform[Apply business transformations\nType casting, categorization] Transform --> Enrich[Enrich with lookup data\nGeo tables, reference data] Enrich --> Format[Format for target schema\nColumn mapping, partitioning] Format --> Load[Load into target store\nData warehouse or data lake] Load --> Verify[Post-load verification\nRow counts, metric checks] Verify -->|Pass| Done[Data available for queries] Verify -->|Fail| Rollback[Rollback and alert]
Copied to clipboard