Alerting Workflow
An alerting workflow defines the end-to-end path from a detected anomaly in metrics or logs through notification, acknowledgement, and resolution — ensuring the right engineer is paged at the right time with enough context to act quickly.
An alerting workflow defines the end-to-end path from a detected anomaly in metrics or logs through notification, acknowledgement, and resolution — ensuring the right engineer is paged at the right time with enough context to act quickly.
How the workflow works
The workflow begins with alert rules defined against a metrics or log data source. Each rule specifies a threshold condition (e.g., "HTTP error rate above 5% for 5 consecutive minutes"), a severity level, and the routing labels that determine who gets paged. The alerting engine evaluates every rule on each data polling cycle.
When a rule condition is met, the engine marks the alert as firing and sends it to an alert manager (Prometheus Alertmanager, PagerDuty, OpsGenie). The alert manager applies deduplication to suppress redundant copies of the same alert, groups related alerts by service or region to reduce noise, and applies inhibition rules to suppress lower-severity alerts when a higher-severity parent alert is already firing.
After grouping and deduplication, the alert manager routes the alert to the appropriate responder — the on-call engineer for the affected service — via configured channels: PagerDuty page, Slack message, or email. The on-call engineer receives the notification with a link to the relevant dashboard and runbook.
The engineer acknowledges the alert to stop escalation. Acknowledgement starts the incident clock. The engineer investigates, applies a fix or mitigation, and confirms the service is healthy. Once the underlying metrics return to normal, the alert transitions to resolved and the alert manager sends a resolution notification. The full timeline is recorded for the post-incident review.
See Incident Management Flow for how a firing alert escalates into a formal incident.