Feature Engineering Pipeline: Mermaid Diagram

Feature Engineering Pipeline flowchart diagram

About Source

A feature engineering pipeline transforms raw, heterogeneous data sources into clean, normalized, model-ready feature vectors and stores them in a feature store for consistent use during both training and inference.

What the diagram shows

This flowchart traces the data transformation stages from raw sources to a served feature vector:

1. Raw data sources: the pipeline ingests from multiple origins — event streams (clickstreams, transactions), relational databases, third-party APIs, and unstructured logs. 2. Data extraction: connectors pull batches or streams of raw records for the target entities (users, items, sessions). 3. Data cleaning: null values are imputed, outliers are clipped or flagged, duplicate records are removed, and timestamps are normalized to UTC. 4. Feature transformation: domain-specific features are computed — rolling aggregates (7-day purchase count), ratio features (clicks / impressions), lag features, text embeddings, or geospatial encodings. 5. Feature validation: the computed features are checked against predefined expectations: value ranges, distribution bounds, and null rate thresholds. Failures here block writes to the feature store. 6. Feature store write: validated features are written to the feature store under a versioned feature group, indexed by entity key and timestamp. 7. Online store sync: low-latency online store (Redis, DynamoDB) is updated for real-time serving during inference (see Inference Pipeline). 8. Offline store: the full historical feature table is written to the offline store (S3, BigQuery) for training data retrieval (see Model Training Pipeline).

Why this matters

Consistent feature computation between training and serving — the "training-serving skew" problem — is one of the most common sources of ML performance degradation in production. A shared feature store eliminates this class of bug entirely.

Frequently asked questions

A feature engineering pipeline is the automated data transformation workflow that converts raw, heterogeneous data sources into clean, normalized, model-ready feature vectors and stores them in a feature store for consistent retrieval during both model training and real-time inference.

The pipeline ingests raw records from event streams, databases, and APIs; cleans and normalizes the data; computes domain-specific transformations (rolling aggregates, ratios, embeddings); validates the output against expected distributions and null rates; and writes the results to a versioned feature store with separate online (low-latency) and offline (high-throughput) storage tiers.

Invest in a shared feature store once you have multiple models that need the same features, or once you identify discrepancies between how features are computed in training versus serving. Without a shared store, each model recomputes the same features independently, creating both duplication and the risk of training-serving skew.

Frequent problems include data leakage (future data visible to features computed for training samples), inconsistent time-zone handling (features computed in UTC in training but local time in serving), feature drift (raw data distributions shift without triggering recomputation), and validation gaps (new feature groups added without range or null-rate checks).

mermaid

flowchart TD
    subgraph Sources["Raw Data Sources"]
        S1([Event stream]) 
        S2([Relational DB])
        S3([Third-party API])
    end

    S1 --> Extract[Extract raw records]
    S2 --> Extract
    S3 --> Extract
    Extract --> Clean[Clean: impute nulls, remove duplicates, normalize timestamps]
    Clean --> Transform[Compute derived features: aggregates, ratios, embeddings]
    Transform --> Validate{Feature validation}
    Validate -- Fail --> Alert([Alert: feature quality failure])
    Validate -- Pass --> Write[Write to feature store as versioned feature group]
    Write --> Online[(Online store: Redis or DynamoDB)]
    Write --> Offline[(Offline store: S3 or BigQuery)]
    Online --> InferencePipeline([Serve to inference pipeline])
    Offline --> TrainingPipeline([Serve to training pipeline])