Database Failover: Mermaid Sequence Diagram

About Source

Database failover is the automatic (or manual) process of promoting a replica to become the new primary when the original primary node becomes unavailable, restoring write availability with minimal downtime.

This sequence diagram traces the events from the moment the primary stops responding through to the point where the application is routing writes to the new primary. The health monitor (typically a tool like Patroni, Orchestrator, AWS RDS Multi-AZ, or a Kubernetes operator) continuously sends heartbeat probes to the primary. When the primary fails to respond within a configured timeout, the monitor starts a failure confirmation round — checking multiple times from multiple vantage points to avoid a false positive caused by a transient network hiccup.

Once failure is confirmed, the monitor identifies the most advanced replica — the one with the highest applied log sequence number (LSN), meaning it has the least data loss. In a synchronous replication setup the synchronous standby is always fully caught up and is the natural choice.

The monitor issues a PROMOTE command to the chosen replica. The replica completes applying any remaining WAL from its standby queue, then transitions to primary mode: it begins accepting write connections and stops consuming replication messages. The old primary is fenced — its network access is revoked or its process is killed — to prevent a split-brain scenario where two nodes both believe they are the primary.

Finally, the DNS record or virtual IP for the primary endpoint is updated to point to the new primary. The application's Connection Pooling layer detects the connection failure, reconnects to the endpoint, and resumes normal operation. Total failover time in well-configured systems ranges from 10 to 60 seconds.

Frequently asked questions

Database failover is the process of automatically or manually promoting a replica to become the new primary when the original primary becomes unavailable. The goal is to restore write availability with minimal downtime and data loss.

A health monitor continuously probes the primary with heartbeat checks. When the primary fails to respond within a timeout, the monitor runs a multi-point failure confirmation to rule out false positives. It then selects the most advanced replica (highest LSN), issues a PROMOTE command, fences the old primary to prevent split-brain, and updates the DNS or virtual IP so applications reconnect to the new primary.

Use synchronous replication when zero data loss on failover is required — financial systems, payment processors, or any workload where losing even one committed write is unacceptable. Use asynchronous replication when write latency is a priority and a small data-loss window is tolerable. Tools like Patroni or AWS RDS Multi-AZ enforce synchronous replication to at least one standby to make failover lossless.

mermaid

sequenceDiagram
    participant Monitor as Health Monitor
    participant Primary as Primary DB
    participant Replica as Replica DB
    participant DNS as DNS / VIP
    participant App as Application

    Monitor->&gt;Primary: Heartbeat probe
    Primary--xMonitor: No response (timeout)
    Monitor->&gt;Primary: Retry probe x3
    Primary--xMonitor: Still no response

    Monitor->&gt;Monitor: Confirm primary failure
    Monitor->&gt;Replica: Check replication LSN
    Replica-->&gt;Monitor: LSN=98432 (fully caught up)

    Monitor->&gt;Primary: Fence primary (revoke network access)
    Monitor->&gt;Replica: PROMOTE to primary
    Replica->&gt;Replica: Apply remaining WAL
    Replica->&gt;Replica: Switch to read-write mode
    Replica-->&gt;Monitor: Promotion complete

    Monitor->&gt;DNS: Update primary endpoint to new primary
    DNS-->&gt;Monitor: DNS updated

    App->&gt;DNS: Reconnect to primary endpoint
    DNS-->&gt;App: Resolves to new primary
    App->&gt;Replica: Write query (new primary)
    Replica-->&gt;App: Write acknowledged
    note">Note over App,Replica: Failover complete — normal operation resumed