HappyHorse 1.0 is now liveTry it now
Chatbot Content Moderation: Input, Output, and Agent Guardrails
guide

Chatbot Content Moderation: Input, Output, and Agent Guardrails

EvoLink Team
EvoLink Team
Product Team
April 29, 2026
8 min read

Adding content moderation to a chatbot or AI agent is not just about blocking bad words. A production workflow needs to check user inputs, model outputs, and the actions an agent is allowed to take.

The safest mental model is:

input moderation
  -> model or agent reasoning
  -> tool-call validation
  -> output moderation
  -> allow / review / block
Chatbot moderation three-layer architecture
Chatbot moderation three-layer architecture
This guide shows where to place each layer, what moderation can and cannot solve, and how to use EvoLink Moderation 1.0 for simple risk-based routing.

TL;DR

  • Moderate user inputs before they reach your chatbot or agent.
  • Moderate model outputs before users see them.
  • Validate tool calls before an agent executes actions.
  • Do not treat content moderation as a complete prompt-injection defense.
  • Use human review for medium-risk or high-impact cases.
  • EvoLink's evolink_summary.risk_level can simplify allow / review / block logic.

Content moderation is not the same as agent security

Content moderation helps identify harmful text or images. It can reduce the risk of unsafe prompts, harmful outputs, and policy-violating user content.

Agent security is broader. It includes prompt injection, tool permissions, data access, human approvals, structured outputs, evals, logging, and workflow design.

OpenAI's agent safety guide recommends combining techniques such as guardrails, tool approvals, trace grading, evals, and structured workflow design. OWASP also treats prompt injection as a distinct LLM security risk that can lead to data leakage, unintended actions, or tool misuse.

Sources:

The takeaway: moderation is one guardrail. It should not be the only guardrail.

Layer 1: input moderation

Input moderation checks the user's message before it reaches your model.

Use it to catch:

  • clearly harmful requests
  • policy-violating user messages
  • abusive or harassing text
  • unsafe image uploads
  • prompts asking for disallowed content

Input moderation is useful because it prevents obviously unsafe requests from entering the model workflow at all.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_EVOLINK_API_KEY",
    base_url="https://direct.evolink.ai/v1"
)

def handle_user_message(user_input: str):
    input_check = client.moderations.create(
        model="evolink-moderation-1.0",
        input=user_input
    )
    summary = input_check.model_extra["evolink_summary"]

    if summary["risk_level"] == "high":
        return {
            "status": "blocked",
            "message": "I can't help with that request."
        }

    return run_chatbot(user_input)

Important limitation: input moderation is not a complete prompt-injection defense. Attackers can use indirect prompts, encoded text, images, external documents, or tool outputs to influence agents later in the workflow.

Layer 2: output moderation

Output moderation checks the model response before it reaches the user.

Use it to catch:

  • unsafe generated content
  • policy-violating responses
  • harmful advice in sensitive domains
  • responses that slipped through despite safe-looking input
def safe_chat(user_input: str):
    input_check = client.moderations.create(
        model="evolink-moderation-1.0",
        input=user_input
    )
    if input_check.model_extra["evolink_summary"]["risk_level"] == "high":
        return "I can't help with that request."

    answer = llm.generate(user_input)

    output_check = client.moderations.create(
        model="evolink-moderation-1.0",
        input=answer
    )
    if output_check.model_extra["evolink_summary"]["risk_level"] == "high":
        return "I generated a response that needs to be revised. Please try rephrasing."

    return answer

Output moderation is especially useful for public chatbots, AI copilots, and generated-content workflows where the model response becomes user-visible.

Layer 3: tool-call validation

For AI agents, tool calls are often the highest-risk layer.

If an agent can send emails, query databases, update tickets, call APIs, or write files, the tool call should be validated before execution. This is not just moderation. It is authorization and policy enforcement.

Validate:

  • tool name
  • allowed user or account scope
  • required human approval
  • allowed parameter values
  • read vs write operations
  • external recipients
  • destructive actions
  • data exfiltration risk
# Simplified example — production systems should use parameterized queries
# and proper authorization layers, not string matching.
def validate_tool_call(tool_name: str, args: dict, user_context: dict):
    if tool_name == "send_email":
        if not args["to"].endswith("@company.com"):
            return False, "External email requires approval."

    if tool_name == "database_query":
        query = args.get("query", "")
        forbidden = ["DROP ", "DELETE ", "TRUNCATE "]
        if any(word in query.upper() for word in forbidden):
            return False, "Destructive database operation blocked."

    if tool_name == "refund_customer":
        if args["amount"] > user_context["approval_limit"]:
            return False, "Refund amount requires human approval."

    return True, None

OpenAI's agent safety guide also recommends keeping tool approvals on for MCP tools so end users can review and confirm operations.

Layer 4: allow / review / block routing

Binary safe/unsafe moderation is usually too blunt. A more useful production pattern is:

RiskAction
Lowallow
Mediumqueue for review, warn, or limit distribution
Highblock or require appeal
EvoLink Moderation returns an evolink_summary object with a risk_level field designed for this kind of routing.
def route_by_risk(content: str):
    result = client.moderations.create(
        model="evolink-moderation-1.0",
        input=content
    )
    summary = result.model_extra["evolink_summary"]

    if summary["risk_level"] == "high":
        log_block(summary)
        return {"action": "block"}

    if summary["risk_level"] == "medium":
        queue_for_review(content, summary)
        return {"action": "review"}

    return {"action": "allow"}
For request details, see the EvoLink Moderation API docs.

Full chatbot moderation pattern

Here is a practical end-to-end shape:

def handle_chatbot_turn(user_input: str, user_context: dict):
    # 1. Moderate input
    input_route = route_by_risk(user_input)
    if input_route["action"] == "block":
        return "I can't help with that request."
    if input_route["action"] == "review":
        log_review_event("input", user_input)

    # 2. Generate model response and proposed tool calls
    agent_result = agent.run(user_input, user_context)

    # 3. Validate tool calls before execution
    for call in agent_result.tool_calls:
        ok, reason = validate_tool_call(call.name, call.args, user_context)
        if not ok:
            log_security_event(call, reason)
            return "That action requires additional review."

    # 4. Execute approved tools
    tool_outputs = execute_tool_calls(agent_result.tool_calls)

    # 5. Generate final response
    final_answer = agent.format_response(tool_outputs)

    # 6. Moderate output
    output_route = route_by_risk(final_answer)
    if output_route["action"] == "block":
        return "I need to revise that response."
    if output_route["action"] == "review":
        queue_for_review(final_answer, output_route)

    return final_answer

The exact implementation will vary, but the shape is stable: check input, constrain actions, check output, and log decisions.

Handling false positives

False positives are unavoidable. The goal is to reduce harm without blocking legitimate users too often.

Use a review queue for:

  • medium-risk content
  • unclear context
  • policy-sensitive categories
  • high-value users or high-impact actions
  • appeals

Log each moderation decision with enough context to debug later:

  • content hash
  • user id or account id
  • risk level
  • triggered categories
  • action taken
  • reviewer override
  • timestamp

Avoid logging raw sensitive content unless you have a clear retention and privacy policy.

What to promise internally

Do not promise that moderation makes a chatbot "100% safe."

A better promise is:

  • harmful inputs are checked before model execution
  • model outputs are checked before user display
  • agent tool calls are validated before execution
  • high-risk or ambiguous cases can be routed to humans
  • logs and review outcomes are used to improve thresholds

That is a realistic production posture.

Common mistakes

Mistake 1: only moderating inputs

Input checks are useful, but the model can still generate unsafe output. Moderate outputs before showing them to users.

Mistake 2: treating prompts as security boundaries

System prompts guide behavior, but they should not be the only security layer. Use tool permissions, structured workflows, and approvals.

Mistake 3: letting agents execute tools without validation

Always validate tool calls outside the model. The model should propose actions; your application should decide whether the action is allowed.

Mistake 4: blocking every uncertain case

Overly strict systems create user frustration. Send uncertain cases to review instead of always blocking them.

Mistake 5: keeping no logs

Without logs, you cannot tune thresholds, investigate incidents, or understand false positives.

Implementation checklist

  • Moderate user inputs before model calls.
  • Moderate model outputs before user display.
  • Validate agent tool calls before execution.
  • Use allow / review / block routing.
  • Queue medium-risk content for review.
  • Add user reporting and appeals where needed.
  • Log risk levels, categories, actions, and reviewer overrides.
  • Avoid storing raw sensitive content unless required.
  • Test with real examples from your product.
  • Revisit thresholds after production data arrives.

FAQ

Should I moderate chatbot inputs or outputs?

Both. Input moderation can reduce unsafe requests before they reach the model. Output moderation can catch unsafe generated content before users see it.

Does content moderation stop prompt injection?

No. Moderation can catch some unsafe or suspicious content, but prompt injection requires broader defenses such as tool validation, structured outputs, least privilege, human approvals, and workflow isolation.

Where should I place moderation in an AI agent?

Place moderation before the model for user input, before display for model output, and combine it with tool-call validation before any agent action is executed.

EvoLink Moderation 1.0 provides an OpenAI-compatible moderation endpoint with evolink_summary.risk_level, making it easier to route chatbot content into allow, review, or block workflows.

Do I still need human review?

For production systems, yes. Human review is important for appeals, medium-risk cases, policy changes, and high-impact decisions.

Explore EvoLink Moderation 1.0

Ready to Reduce Your AI Costs by 89%?

Start using EvoLink today and experience the power of intelligent API routing.