Metrics Collection: Mermaid Flowchart Diagram

About Source

Metrics collection is the process of gathering numeric measurements from applications, runtimes, operating systems, and cloud infrastructure at regular intervals and storing them in a time-series database for querying, dashboarding, and alerting.

How metrics collection works

Metrics originate from two sources: instrumented application code and infrastructure exporters. Application code uses a metrics library (Prometheus client, StatsD, OpenTelemetry) to define and increment counters, gauges, and histograms. Infrastructure exporters (node_exporter, kube-state-metrics, cloud provider integrations) surface OS-level and platform-level metrics without code changes.

There are two collection models. In the pull model (used by Prometheus), a central scraper polls each service's /metrics HTTP endpoint on a configurable interval (e.g., every 15 seconds), fetching the current metric snapshot. In the push model (used by StatsD, InfluxDB line protocol), services emit metric events to an aggregation daemon, which batches and forwards them.

Regardless of model, raw metrics pass through a processing stage that attaches labels — environment name, service version, host, region — enriching each data point with dimensional context. These labels are critical for filtering and aggregating data in dashboards (e.g., error rate per service per region).

Processed metrics are written to a time-series database. Prometheus uses a local TSDB; cloud-native setups use managed services like Amazon CloudWatch or Google Cloud Monitoring. Older data is downsampled or expired according to retention policies.

The stored metrics feed two consumers: visualization dashboards that render time-series graphs, and the alerting engine that evaluates threshold rules (see Alerting Workflow) and fires pages when SLOs are breached.

Frequently asked questions

Metrics collection is the process of gathering numeric measurements from applications, runtimes, and infrastructure at regular intervals and storing them in a time-series database. These measurements — counters, gauges, histograms — form the basis of dashboards and SLO-based alerting.

Prometheus uses a pull model: a central scraper polls each service's `/metrics` HTTP endpoint on a configurable interval (typically 15 seconds), fetches the current metric snapshot, and stores it in its local time-series database. Labels attached to each metric enable filtering and aggregation in queries.

In a pull model (Prometheus), the metrics server scrapes endpoints on a schedule — simple to operate, but services must expose an HTTP endpoint. In a push model (StatsD, InfluxDB), services emit events to a collector — better for short-lived batch jobs that may not be alive when a scraper polls.

The most common mistakes are high cardinality labels (using user IDs or request IDs as label values creates millions of unique time series, overloading the TSDB), not setting retention or downsampling policies (unbounded storage growth), and scraping intervals that are too coarse to detect short spikes.

mermaid

flowchart TD
    AppCode[Instrumented application code] --> AppMetrics[Expose metrics endpoint]
    Infra[Infrastructure exporters] --> InfraMetrics[Expose host and platform metrics]
    AppMetrics --> Collector[Metrics collector or scraper]
    InfraMetrics --> Collector
    Collector --> AttachLabels[Attach dimensional labels]
    AttachLabels --> Model{Collection model}
    Model -->|Pull| Scrape[Scraper polls metrics endpoint]
    Model -->|Push| Aggregator[Push to aggregation daemon]
    Scrape --> Process[Process and normalize metrics]
    Aggregator --> Process
    Process --> WriteTSDB[Write to time-series database]
    WriteTSDB --> Retention[Apply retention and downsampling policy]
    WriteTSDB --> Dashboards[Render in visualization dashboards]
    WriteTSDB --> AlertEngine[Evaluate alerting rules]
    AlertEngine --> Alert{Threshold breached?}
    Alert -->|Yes| FireAlert[Fire alert]
    Alert -->|No| Continue[Continue collection cycle]