diagram.mmd — sequence
Database Failover sequence diagram

Database failover is the automatic (or manual) process of promoting a replica to become the new primary when the original primary node becomes unavailable, restoring write availability with minimal downtime.

This sequence diagram traces the events from the moment the primary stops responding through to the point where the application is routing writes to the new primary. The health monitor (typically a tool like Patroni, Orchestrator, AWS RDS Multi-AZ, or a Kubernetes operator) continuously sends heartbeat probes to the primary. When the primary fails to respond within a configured timeout, the monitor starts a failure confirmation round — checking multiple times from multiple vantage points to avoid a false positive caused by a transient network hiccup.

Once failure is confirmed, the monitor identifies the most advanced replica — the one with the highest applied log sequence number (LSN), meaning it has the least data loss. In a synchronous replication setup the synchronous standby is always fully caught up and is the natural choice.

The monitor issues a PROMOTE command to the chosen replica. The replica completes applying any remaining WAL from its standby queue, then transitions to primary mode: it begins accepting write connections and stops consuming replication messages. The old primary is fenced — its network access is revoked or its process is killed — to prevent a split-brain scenario where two nodes both believe they are the primary.

Finally, the DNS record or virtual IP for the primary endpoint is updated to point to the new primary. The application's Connection Pooling layer detects the connection failure, reconnects to the endpoint, and resumes normal operation. Total failover time in well-configured systems ranges from 10 to 60 seconds.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

Database failover is the process of automatically or manually promoting a replica to become the new primary when the original primary becomes unavailable. The goal is to restore write availability with minimal downtime and data loss.
A health monitor continuously probes the primary with heartbeat checks. When the primary fails to respond within a timeout, the monitor runs a multi-point failure confirmation to rule out false positives. It then selects the most advanced replica (highest LSN), issues a PROMOTE command, fences the old primary to prevent split-brain, and updates the DNS or virtual IP so applications reconnect to the new primary.
Use synchronous replication when zero data loss on failover is required — financial systems, payment processors, or any workload where losing even one committed write is unacceptable. Use asynchronous replication when write latency is a priority and a small data-loss window is tolerable. Tools like Patroni or AWS RDS Multi-AZ enforce synchronous replication to at least one standby to make failover lossless.
mermaid
sequenceDiagram participant Monitor as Health Monitor participant Primary as Primary DB participant Replica as Replica DB participant DNS as DNS / VIP participant App as Application Monitor->>Primary: Heartbeat probe Primary--xMonitor: No response (timeout) Monitor->>Primary: Retry probe x3 Primary--xMonitor: Still no response Monitor->>Monitor: Confirm primary failure Monitor->>Replica: Check replication LSN Replica-->>Monitor: LSN=98432 (fully caught up) Monitor->>Primary: Fence primary (revoke network access) Monitor->>Replica: PROMOTE to primary Replica->>Replica: Apply remaining WAL Replica->>Replica: Switch to read-write mode Replica-->>Monitor: Promotion complete Monitor->>DNS: Update primary endpoint to new primary DNS-->>Monitor: DNS updated App->>DNS: Reconnect to primary endpoint DNS-->>App: Resolves to new primary App->>Replica: Write query (new primary) Replica-->>App: Write acknowledged note">Note over App,Replica: Failover complete — normal operation resumed
Copied to clipboard