diagram.mmd — flowchart
Rollback Deployment flowchart diagram

A rollback deployment is the emergency process of reverting a production environment from a failing new version back to the last known-good version, restoring service health as quickly as possible when post-deployment monitoring detects problems.

How the rollback works

A rollback is triggered either automatically — when monitoring thresholds are breached (error rate spike, latency increase, failed health checks) — or manually by an on-call engineer who observes degradation that automated alerts have not yet caught.

The first action is to identify the last stable artifact version. This is retrieved from the release inventory in the Artifact Storage Pipeline, which records the version tag and digest of every previously deployed artifact. The inventory is queried for the deployment immediately preceding the current one.

With the rollback target identified, the deployment system begins shifting traffic back to the previous version. The mechanism depends on the deployment strategy in use: in a blue/green setup, the load balancer route is switched back to the blue environment in seconds. In a canary deployment, canary traffic weight is reduced to zero. In a rolling deployment, pods are progressively replaced with instances of the prior version.

During rollback, monitoring continues in real time. Once the error rate returns to baseline and health checks pass consistently, the rollback is considered complete. The incident remains open — the root cause of the failure must be investigated (see Incident Management Flow) before the new version is re-deployed. The failed artifact is marked as unstable in the release inventory to prevent accidental re-promotion.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

A rollback deployment is the emergency process of reverting production from a failing new version back to the last known-good version. It restores service health as quickly as possible when post-deployment monitoring detects degradation.
The rollback target — the previously deployed artifact version — is retrieved from the release inventory. Depending on the deployment strategy, traffic is shifted back via load balancer switch (blue/green), weight reduction (canary), or progressive pod replacement (rolling). Monitoring confirms recovery before the rollback is declared complete.
In a blue/green setup, rollback is near-instant — the load balancer flips back to the prior environment. In a canary deployment, rollback means reducing the canary traffic weight to zero. In a rolling deployment, pods are progressively replaced with instances of the prior version, which is slower but requires no pre-provisioned second environment.
The most common mistakes are not testing the rollback path before it is needed, failing to mark the unstable artifact in the registry (risking re-promotion), and closing the incident before the root cause is understood — leading to the same failure on the next deploy.
mermaid
flowchart TD Trigger[Rollback triggered by alert or engineer] --> IdentifyStable[Identify last stable artifact version] IdentifyStable --> FetchArtifact[Fetch previous artifact from registry] FetchArtifact --> Strategy{Deployment strategy?} Strategy -->|Blue/Green| SwitchRoute[Switch load balancer to previous environment] Strategy -->|Canary| ZeroCanary[Reduce canary traffic weight to zero] Strategy -->|Rolling| RollingReplace[Replace pods with previous version] SwitchRoute --> HealthCheck[Run health checks] ZeroCanary --> HealthCheck RollingReplace --> HealthCheck HealthCheck --> Recovered{Error rate back to baseline?} Recovered -->|No| Escalate[Escalate to incident commander] Recovered -->|Yes| ConfirmStable[Confirm service stability] ConfirmStable --> MarkUnstable[Mark failed artifact as unstable] MarkUnstable --> OpenIncident[Open incident for root cause analysis] OpenIncident --> Done[Rollback complete]
Copied to clipboard