diagram.mmd — flowchart
AI Moderation Pipeline flowchart diagram

An AI moderation pipeline is a multi-stage content safety system that screens both user inputs and model outputs for policy violations — including hate speech, self-harm content, PII, and prompt injection attacks — before content is processed or returned.

What the diagram shows

This flowchart illustrates a dual-path moderation architecture that runs on both the input and output sides of an LLM inference call:

Input moderation: 1. User input received: raw user text arrives at the application. 2. Rule-based filters: fast, deterministic rules catch obvious violations — blocked keywords, known jailbreak patterns, excessive repetition. 3. Classifier screening: a lightweight moderation classifier (e.g., OpenAI Moderation API, a fine-tuned BERT model) scores the input across harm categories. 4. PII detection: a named entity recognizer scans for personally identifiable information (email addresses, phone numbers, SSNs) and either blocks or redacts it. 5. Block or continue: if any check flags the input, a policy refusal is returned immediately. Otherwise the sanitized input proceeds to the LLM.

Output moderation: 1. LLM generates response: the model produces a draft response. 2. Output classifier: the generated text is screened by the same or a separate moderation classifier. 3. Factual grounding check (optional): for RAG systems, citations are verified against retrieved sources to reduce hallucination risk. 4. Policy decision: clean output is returned to the user; flagged output triggers a fallback response or escalation to human review.

Why this matters

Running moderation on both input and output creates defense in depth. Input moderation blocks attempts to manipulate the model; output moderation catches cases where the model produces unsafe content despite a benign-looking input. See AI Content Generation Pipeline for the broader generation context.

Free online editor
Edit this diagram in Graphlet
Fork, modify, and export to SVG or PNG. No sign-up required.
Open in Graphlet →

Frequently asked questions

An AI moderation pipeline is a multi-stage content safety system that screens both user inputs and model outputs for policy violations — including hate speech, self-harm content, PII, and prompt injection — before content is processed by or returned from a language model.
Input moderation runs before the LLM: fast rule-based filters catch known patterns, a classifier scores across harm categories, and a PII detector redacts sensitive data. If any check flags the input, a refusal is returned immediately. Output moderation runs after the LLM generates a draft response, screening it through a classifier and optionally a grounding check before it reaches the user.
Use a hosted moderation API (such as OpenAI Moderation) for general-purpose consumer harm categories — it is fast, maintained, and sufficient for most applications. Build a custom pipeline when your application has domain-specific policies (e.g., regulated financial advice, age-gated content, proprietary brand guidelines) that off-the-shelf classifiers do not cover.
Common failures include over-blocking (legitimate queries refused due to superficial keyword matches), under-blocking (adversarial inputs that bypass classifiers through obfuscation or encoding tricks), high latency from sequential moderation steps (addressable by running checks in parallel), and false confidence in output moderation alone without input-side protection.
mermaid
flowchart TD A([User input]) --> B[Rule-based filter: blocked keywords and jailbreak patterns] B --> C{Rule violation?} C -- Yes --> D([Return policy refusal]) C -- No --> E[Moderation classifier: score harm categories] E --> F{Score above threshold?} F -- Yes --> D F -- No --> G[PII detection: scan for emails, phone numbers, SSNs] G --> H{PII found?} H -- Block --> D H -- Redact --> I[Redact PII from input] H -- Clean --> J[Forward sanitized input to LLM] I --> J J --> K[LLM generates draft response] K --> L[Output moderation classifier] L --> M{Output flagged?} M -- Yes --> N{Severity level?} N -- High --> O([Return fallback refusal response]) N -- Low --> P([Escalate to human review queue]) M -- No --> Q{Grounding check required?} Q -- Yes --> R[Verify citations against retrieved sources] R --> S{Grounded?} S -- No --> T([Return grounding warning with response]) S -- Yes --> U([Return response to user]) Q -- No --> U
Copied to clipboard