diagram.mmd — flowchart
Stream Analytics Architecture flowchart diagram

A stream analytics architecture is a system designed to continuously ingest, process, and react to data in motion — applying queries, aggregations, and enrichments to records as they arrive rather than waiting for a batch window to close.

Traditional batch analytics operates on data at rest: files or tables that accumulate over time before a job processes them. Stream analytics inverts this model. Data is processed the moment it enters the system, enabling use cases that batch processing cannot support — fraud detection within milliseconds of a transaction, live sports scoreboards, operational dashboards that reflect the current state of a running system, and dynamic pricing that adjusts to real-time demand.

The architecture begins at event producers: applications, sensors, microservices, or log shippers that emit records continuously. These records flow into a distributed streaming platform such as Apache Kafka or AWS Kinesis, which buffers the stream durably and allows multiple downstream consumers to read from it independently at their own pace.

A stream processing engine — Flink, Spark Streaming, Kafka Streams, or a managed service like Google Dataflow — consumes from the platform and applies the analytics logic. The core operations are stateless transforms (filtering, projection, type coercion applied per record) and stateful windowed aggregations (counting events per user in the last 60 seconds, summing revenue per product per minute). Windowing strategies — tumbling, sliding, and session windows — determine how records are grouped in time.

Processed results are written to one or more output sinks appropriate to the latency and access pattern of each consumer: a time-series database for operational dashboards (see Realtime Metrics Pipeline), a key-value store for low-latency feature serving, or an object store for downstream batch jobs. A dead-letter topic captures records that fail processing so they can be inspected and replayed without blocking the main pipeline. The Clickstream Processing diagram shows how this architecture is applied specifically to web session data.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

A stream analytics architecture is a system designed to continuously ingest, process, and react to data as it arrives — applying stateless transforms and stateful windowed aggregations to records in motion rather than waiting for a batch window to close.
Event producers publish records to a distributed streaming platform like Kafka. A stream processing engine (Flink, Spark Streaming, or Kafka Streams) consumes the stream and applies stateless transforms per record and stateful windowed aggregations over configurable time windows. Results are written to output sinks matched to the latency needs of each consumer.
Use stream analytics when your use case requires latency measured in seconds rather than minutes — fraud detection, live operational dashboards, dynamic pricing, or any scenario where reacting to batch-delayed data would degrade user experience or business outcomes.
Common mistakes include choosing the wrong windowing strategy for the use case (tumbling windows for sessions that need session windows), neglecting the dead-letter topic (so processing failures block or silently drop records), and underestimating the complexity of managing stateful operators at scale.
Batch analytics processes data at rest in scheduled jobs — all data accumulates first, then a query runs over the complete dataset. Stream analytics processes data in motion as it arrives, enabling sub-second latency. Most production systems use both: stream analytics for operational and alerting use cases, and batch analytics for complex historical aggregations.
mermaid
flowchart LR Producers[Event Producers\nApps, sensors, microservices] --> Platform[Streaming Platform\nKafka / Kinesis topics] Platform --> Engine[Stream Processing Engine\nFlink, Spark Streaming, Kafka Streams] Engine --> Stateless[Stateless Transforms\nFilter, project, type coerce] Stateless --> Windowed[Windowed Aggregations\nTumbling, sliding, session windows] Windowed --> StateStore[State Store\nRocksDB or managed state backend] Engine --> DeadLetter[Dead-Letter Topic\nFailed record capture] Windowed --> Sinks[Output Sinks] Sinks --> TSDB[Time-Series DB\nInfluxDB, Prometheus] Sinks --> KVStore[Key-Value Store\nRedis feature cache] Sinks --> ObjectStore[Object Store\nParquet for batch jobs] TSDB --> Dashboard[Operational Dashboard] KVStore --> APILayer[Feature Serving API]
Copied to clipboard