Data Lake Architecture: Mermaid Flowchart Diagram

Data Lake Architecture flowchart diagram

About Source

A data lake architecture is a storage and processing system that centralizes raw data from all sources in its native format, then applies structure progressively as data moves toward serving layers — deferring schema decisions until the data is actually queried.

The key difference between a data lake and a data warehouse is the point at which schema is enforced. A warehouse enforces schema on write: data must be cleaned and structured before it is loaded. A data lake enforces schema on read: raw data lands in the lake immediately, and structure is applied when a job or query reads it. This makes the lake far more flexible — you can store data before you know exactly how you will use it.

Modern data lake architectures are organized into zones that represent progressive levels of curation. The raw zone (also called the landing or bronze zone) is an append-only store of source data exactly as it arrived — JSON blobs, CSV exports, Avro records, application logs. Nothing is deleted or modified here; this is the system of record for replay.

The curated zone (silver zone) contains data that has been cleaned, validated, and conformed by a transformation job. Schemas are enforced at this layer, duplicate records are removed, and source-specific quirks are normalized. Data here is stored in an efficient columnar format such as Parquet or ORC, partitioned by date and entity type to minimize query scan costs.

The serving zone (gold zone) contains purpose-built aggregates and dimensional models optimized for specific use cases: a user activity summary table for the product team, a revenue rollup for finance, a feature table for the machine learning platform. These are derived from the curated zone and are rebuilt on a schedule via a transformation framework like Spark or dbt.

A catalog and governance layer runs across all zones, tracking dataset lineage, ownership, and access policies. See Data Warehouse Pipeline for how a warehouse complements the lake as the primary analytical query engine, and ETL Workflow for the transformation jobs that promote data between zones.

Frequently asked questions

A data lake architecture is a storage and processing system that centralizes raw data from all sources in its native format, then progressively applies structure as data moves from a raw landing zone through curated and serving layers.

Data lands in the raw (bronze) zone exactly as it arrives from source systems. Transformation jobs clean and conform it into the curated (silver) zone, storing it in columnar formats like Parquet. Further aggregation jobs produce purpose-built tables in the serving (gold) zone optimized for specific teams or use cases.

Use a data lake when you need to store data before you know how it will be used, when your data includes unstructured or semi-structured formats (logs, JSON, images), or when you need a low-cost archive layer for historical replay alongside a warehouse for fast analytical queries.

Common mistakes include letting the raw zone become a "data swamp" with no catalog or governance, failing to enforce schemas at the curated layer (negating its value), and not partitioning tables by date and entity (causing full-table scans on every query).

A data warehouse enforces schema on write — data must be structured before loading — making queries fast but ingestion rigid. A data lake enforces schema on read — raw data lands immediately in any format — making ingestion flexible but placing the structuring burden on query-time transformation jobs.

mermaid

flowchart TD
    Sources[Data Sources\nDatabases, streams, APIs, files] --> Ingest[Ingestion Layer\nBatch loaders and stream connectors]
    Ingest --> RawZone[Raw Zone\nNative format, append-only\nJSON, CSV, Avro, logs]
    RawZone --> Transform1[Transformation Job\nSpark, dbt, Glue]
    Transform1 --> CuratedZone[Curated Zone\nCleaned, validated, conformed\nParquet, partitioned by date]
    CuratedZone --> Transform2[Aggregation Job\nPurpose-built models]
    Transform2 --> ServingZone[Serving Zone\nDomain aggregates\nML features, finance rollups]
    ServingZone --> Warehouse[Data Warehouse\nSQL analytics engine]
    ServingZone --> MLPlatform[ML Platform\nFeature store, training data]
    ServingZone --> BI[BI Tools and Dashboards]
    RawZone --> Catalog[Data Catalog\nLineage, ownership, access policies]
    CuratedZone --> Catalog
    ServingZone --> Catalog