diagram.mmd — flowchart
Search Indexing Pipeline flowchart diagram

A search indexing pipeline is the sequence of processing stages that transforms raw content — web pages, product listings, documents — into an inverted index that can be queried in milliseconds.

How the pipeline works

Data ingestion is the entry point. Content arrives from multiple sources: a web crawler fetching URLs from a frontier queue, upstream APIs pushing new records, or direct file uploads from content teams. Each source produces raw bytes that the pipeline must normalize before any linguistic processing can happen.

Fetch and parse extracts structured text from the raw payload. An HTML page is stripped of markup, scripts, and boilerplate navigation. A PDF is converted to plain text. JSON payloads are field-mapped. The goal is a clean document object with a URL or ID, a body, and optional structured fields like title, author, and publication date.

Normalize and tokenize applies text processing: Unicode normalization, lowercasing, whitespace collapsing, and splitting the body into individual tokens (words or sub-word pieces). A tokenizer configured for English will split on whitespace and punctuation; a multilingual tokenizer may apply language detection first and then use a language-appropriate segmenter.

Enrichment augments the token stream with signals that improve ranking later. Common enrichments include stemming or lemmatization (mapping "running" → "run"), stop-word removal, synonym expansion, entity extraction (identifying product names, locations, or people), and embedding generation for semantic search.

Deduplication checks a content hash or near-duplicate fingerprint (e.g., a SimHash or MinHash value) against the existing index. Duplicate or near-duplicate documents are discarded or merged to avoid cluttering result pages with copies.

Write to inverted index posts each token to the index as a posting list entry: the term, the document ID, the position in the document, and optional frequency data. Modern search engines like Elasticsearch and Solr use Lucene segments for this; each segment is an immutable mini-index that gets merged periodically.

Shard distribution assigns documents to shards based on a routing key. The Search Sharding Architecture diagram shows how this distribution works. Once a document lands on a shard and the shard's segment is refreshed or committed, the document becomes visible to the Search Query Processing path. Indexing latency — the time between content creation and searchability — is a key operational SLO for any real-time search system.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

A search indexing pipeline is the sequence of processing stages that converts raw content — web pages, documents, product listings — into an inverted index ready for query execution. Stages typically include ingestion, parsing, normalization, tokenization, enrichment, deduplication, and writing posting lists to distributed index shards.
An inverted index maps each unique token to a posting list: a sorted list of document IDs (and optionally positions and frequencies) in which that token appears. At query time, the engine looks up the posting lists for all query tokens and intersects or merges them to identify matching documents, making full-text search fast regardless of corpus size.
Add embedding generation when lexical keyword matching alone produces poor recall for natural-language queries — for example, when users phrase questions in ways that don't share exact vocabulary with the indexed documents. Embedding enrichment enables approximate nearest-neighbor (ANN) retrieval alongside the inverted index, supporting hybrid search.
Frequent mistakes include using a different tokenizer at index time versus query time (causing term mismatches), not running deduplication before indexing (bloating the index with near-duplicate content), and setting segment refresh intervals too long (increasing the lag before newly indexed documents become searchable).
A forward index maps each document to its list of tokens — useful for retrieving all terms in a document. An inverted index maps each token to the documents that contain it — essential for answering queries. Search engines build the inverted index from a forward index during the indexing pipeline and use the inverted index for query execution.
mermaid
flowchart TD Sources[Data Sources\nWeb crawl, APIs, Uploads] --> Fetch[Fetch raw content] Fetch --> Parse[Parse and extract text] Parse --> Normalize[Normalize and tokenize] Normalize --> Stem[Stem and remove stop words] Stem --> Enrich[Enrich with metadata\nentities, embeddings] Enrich --> Dedupe{Duplicate\ncontent?} Dedupe -->|Yes| Discard[Discard duplicate] Dedupe -->|No| Score[Compute quality score] Score --> Index[Write to inverted index\nposting lists] Index --> Commit[Commit segment] Commit --> Shard[Distribute to shards] Shard --> Searchable[Document searchable]
Copied to clipboard