Data Replication Strategy: Mermaid Diagram

Data Replication Strategy flowchart diagram

About Source

A data replication strategy defines how copies of data are maintained across multiple nodes in a distributed system, balancing the trade-offs between consistency (all replicas agree), durability (data survives failures), and performance (write and read latency).

Replication is fundamental to fault tolerance: if one node fails, another holds a copy. But replication immediately raises the consistency question — what happens when a write reaches some replicas but not others before a failure?

Single-Leader Replication (Primary-Replica): All writes go to a designated primary node, which applies the write and forwards it to replicas. Reads can be served by any replica (risking stale reads) or only by the primary (consistent reads). MySQL's binlog replication, PostgreSQL's streaming replication, and Kafka's ISR model follow this pattern. See Raft Consensus Algorithm for how the leader position is maintained durably.

Synchronous vs. Asynchronous Replication: In synchronous mode, the primary waits for acknowledgment from a quorum (or all) replicas before confirming the write to the client. This guarantees durability but adds write latency proportional to the slowest replica. Asynchronous replication confirms immediately and replicates in the background — lower latency but risk of data loss on primary failure.

Quorum Writes and Reads: Systems like Cassandra, DynamoDB, and Riak let operators configure W (write quorum) and R (read quorum). The rule W + R > N (where N is replication factor) guarantees that at least one node in every read quorum has the latest write. Common settings: N=3, W=2, R=2.

Conflict Resolution: When network partitions allow two primaries to accept writes (split-brain), conflicts arise. Resolution strategies include last-write-wins (timestamp ordering), vector clocks (Riak, Dynamo), CRDTs (conflict-free replicated data types), or explicit application-level merges. The Cluster Coordination Architecture shows how distributed locks and fencing tokens prevent split-brain in strict consistency systems.

Frequently asked questions

A data replication strategy defines how and when copies of data are propagated to multiple nodes in a distributed system. The strategy determines the trade-off between write latency (how long a client waits for confirmation), read freshness (how stale a read can be), and durability (how many node failures the system can survive without data loss).

In single-leader replication all writes go to a primary node that forwards changes to replicas either synchronously (waiting for quorum acknowledgement before confirming to the client) or asynchronously (confirming immediately and replicating in the background). Leaderless systems like Cassandra and DynamoDB use quorum reads and writes — if W + R > N, at least one node in every read quorum has the latest write.

Use synchronous replication when data loss is unacceptable — financial transactions, user-facing writes that must be durable. Accept the added write latency. Use asynchronous replication when throughput and low latency matter more than strict durability — analytics pipelines, read replicas for reporting, or cross-region disaster-recovery copies where some lag is tolerable.

Split-brain (two nodes both accepting writes after a partition) is the most dangerous failure mode. Replication lag causing stale reads, conflicting writes when using last-write-wins with skewed clocks, and replica divergence after a network partition are also common. Fencing tokens, vector clocks, and CRDTs each address different facets of these problems.

Synchronous replication propagates a single write to replica nodes and waits for acknowledgements, but each node applies it independently. Two-phase commit (2PC) is a distributed transaction protocol that coordinates an atomic commit or abort across multiple independent participants — for example, updating two different databases in one transaction. Synchronous replication is a durability mechanism; 2PC is an atomicity mechanism.

mermaid

flowchart TD
    A([Client Write Request\nkey=user:42 value=data]) --> B{Replication\nMode}

    B -->|Single-Leader Sync| C[Primary Node\nappend to write-ahead log]
    C --> D[Replicate to Replica 1]
    C --> E[Replicate to Replica 2]
    D --> F{Both replicas\nacknowledged?}
    E --> F
    F -->|Yes| G[Confirm write to client\ndurable — sync mode]
    F -->|No — timeout| H[Return error\nwrite not durable]

    B -->|Single-Leader Async| I[Primary Node\nwrite and confirm immediately]
    I --> J([Client receives ACK])
    I --> K[Async replicate to Replica 1]
    I --> L[Async replicate to Replica 2]
    K --> M[Replicas eventually\nconsistent]
    L --> M

    B -->|Quorum — N=3 W=2 R=2| N[Send write to all 3 nodes\nin parallel]
    N --> O[Node A — accept]
    N --> P[Node B — accept]
    N --> Q[Node C — unreachable]
    O --> R{W=2 acks\nreceived?}
    P --> R
    R -->|Yes| S[Confirm write to client\nW+R is greater than N — consistent reads]
    R -->|No| T[Write failed — below quorum]

    subgraph Conflict ["Conflict Resolution — Last Write Wins"]
        CW1[Write at t=100 from Region A] --> CM[Merge — compare timestamps]
        CW2[Write at t=105 from Region B] --> CM
        CM --> CV[Keep t=105 value\ndiscard t=100]
    end

    style G fill:#d1fae5,stroke:#10b981
    style S fill:#d1fae5,stroke:#10b981
    style H fill:#fee2e2,stroke:#ef4444
    style Conflict fill:#fef3c7,stroke:#f59e0b