Agent Guardrails: Real-Time Safety for Your AI Agents

AI agents are powerful. They can draft emails, summarize documents, call APIs, and make decisions. But they also inherit every risk that comes with running a language model in production: prompt injection, PII leakage, off-topic responses, system prompt exposure, and outputs that violate your content policies.

You can't ship an agent to production and hope it behaves. You need runtime safety checks that sit between your users and your agent, inspecting every input and every output before they cause harm. That's what Connic Guardrails does.

Why Agents Need Guardrails

Traditional software is deterministic. If you write a function that adds two numbers, it adds two numbers. Language models are different. The same prompt can produce wildly different outputs depending on context, temperature, and how creatively a user phrases their request.

This unpredictability creates real risks in production:

Prompt Injection

A user crafts input that overrides your system prompt. Your helpful customer support agent suddenly starts ignoring its instructions and doing whatever the user asks. OWASP lists prompt injection as the #1 risk for LLM applications.

PII Exposure

Users paste sensitive data into prompts: email addresses, phone numbers, social security numbers, credit card details. Without guardrails, that data flows straight into your model provider's API and potentially into your logs.

System Prompt Leakage

A cleverly worded request tricks your agent into revealing its system prompt, including internal instructions, tool configurations, and business logic you intended to keep private.

Off-Topic or Harmful Output

Your billing support agent starts giving medical advice. Your internal assistant generates toxic content. Without output checks, there's no safety net between the model's response and the end user.

Connic Guardrails address all of these. They run as a configurable layer around your agent, inspecting content in real time and taking action before damage is done.

How Guardrails Work

Guardrails sit in the execution pipeline of every agent run. They check content at two points: before the agent processes the input, and after the agent produces a response.

User Input→Input Guardrails→Agent→Output Guardrails→Response

Input guardrails evaluate the raw user message before the agent sees it. If a guardrail detects a prompt injection attempt, the message is blocked and the agent never executes. If PII is detected, it can be redacted in place so the agent receives a sanitized version.

Output guardrails evaluate the agent's response before it reaches the user. If the response contains system prompt fragments, toxic content, or data exfiltration patterns, the guardrail intercepts it. The user gets a safe rejection message instead.

Three Modes of Action

Every guardrail rule operates in one of three modes. This gives you fine-grained control over how aggressively each check should respond:

Block

Stop processing entirely. The user receives a configurable rejection message. The agent never runs (input) or the response is replaced (output). Use this for hard safety boundaries.

Warn

Log the violation as a trace span and continue. Processing is not interrupted. Use this when you want visibility into potential issues without blocking legitimate requests.

Redact

Replace sensitive content with placeholders and continue processing. Available for PII guardrails. The agent receives sanitized input, or the user receives a sanitized response, without the run being interrupted.

10 Built-In Guardrail Types

Connic ships a set of guardrail types that cover the most common safety needs for production agents. Each can run on input, output, or both.

Prompt Injection Detection

OWASP-style detection that catches instruction override attempts, typoglycemia attacks, encoding tricks, and structural manipulation. Supports Lakera as an external provider.

PII Detection (Input)

Detects personally identifiable information in user input: emails, phone numbers, SSNs, credit cards, and more. Configurable entity types. Supports block, warn, and redact modes.

PII Leakage (Output)

Catches PII that shows up in agent responses, even if it wasn't in the original input. Prevents your agent from surfacing sensitive data from its context or tools.

Content Moderation

Toxicity and harmful content detection. Uses OpenAI Moderation or Perspective API as external providers to catch hate speech, harassment, violence, and other policy violations.

Topic Restriction

Restrict your agent to specific topics. Define an allowed topics list and a custom off-topic message. Requests outside the allowed scope are blocked before the agent runs.

Regex Pattern Matching

Define custom regex patterns to catch specific strings, formats, or keywords. Useful for catching internal identifiers, proprietary terms, or domain-specific patterns.

System Prompt Leakage

Detects when an agent's response contains fragments of its system prompt. Prevents attackers from extracting your internal instructions, tool schemas, or business logic.

Output Relevance

Checks whether the agent's response is actually relevant to the original question. Catches hallucinated tangents, off-track reasoning, and responses that drift from the task.

Data Exfiltration Detection

Detects patterns that indicate an attempt to extract data through the agent, such as encoding payloads, URL smuggling, or structured extraction of private context.

Custom Guardrails

Write your own guardrail logic in Python. Drop a module into your guardrails/ directory with a check() function. Supports both sync and async execution.

Configuration in YAML

Guardrails are defined in your agent's YAML configuration. Each rule specifies its type, mode, and optional parameters. Input and output guardrails are configured separately, so you can apply different checks at each stage.

agent.yaml

guardrails:
  input:
    - type: prompt_injection
      mode: block
    - type: pii
      mode: redact
      config:
        entities: [email, phone, ssn]
    - type: topic_restriction
      mode: block
      config:
        allowed_topics: [support, billing]
        off_topic_message: "I can only help with support and billing questions."
  output:
    - type: moderation
      mode: block
    - type: system_prompt_leakage
      mode: block
    - type: pii_leakage
      mode: redact
    - type: relevance
      mode: warn

This configuration blocks prompt injection on input, redacts PII from user messages, restricts the agent to support and billing topics, and then checks the output for moderation violations, system prompt leakage, PII in the response, and relevance drift.

Tip: Order Matters

Guardrails run in the order you define them. Place cheaper, faster checks first (like regex and prompt injection) and more expensive checks (like moderation with external providers) later. If an early check blocks, the later ones never run.

Writing Custom Guardrails

When the built-in types aren't enough, you can write custom guardrails in Python. Create a module in your agent's guardrails/ directory that exports a check() function. It receives the content being checked and a context dictionary with metadata about the current run.

guardrails/competitor_mentions.py

from connic import GuardrailResult

COMPETITORS = ["acme corp", "rival inc", "other platform"]

def check(content: str, context: dict) -> GuardrailResult:
    content_lower = content.lower()
    for name in COMPETITORS:
        if name in content_lower:
            return GuardrailResult(
                passed=False,
                message="I'm not able to discuss other platforms.",
                details={"matched": name},
            )
    return GuardrailResult(passed=True)

Then reference it in your YAML configuration:

agent.yaml

guardrails:
  output:
    - type: custom
      name: competitor_mentions
      mode: block

Full Observability with Traces

Every guardrail evaluation is captured as an OpenTelemetry trace span. You get complete visibility into what was checked, what passed, and what was blocked or redacted.

Trace Spans

Each guardrail check creates a child span under guardrails:input or guardrails:output. Attributes include the rule type, mode, direction, and pass/fail status.

Run-Level Detail

Open any run in the dashboard to see exactly which guardrails fired, whether they passed or blocked, and what content triggered them. Blocked runs show the rejection reason directly in the run detail view.

That means you can answer questions like: How often is prompt injection being attempted? Which agents trigger the most PII redactions? Are topic restrictions too aggressive? The data is there for every run.

External Providers

Several built-in guardrail types support external providers for more accurate detection. You can swap the default detection engine for a specialized service without changing your guardrail configuration:

Provider	Guardrail Types	Strength
Lakera	Prompt Injection	Purpose-built for injection detection with continuously updated models
OpenAI Moderation	Moderation, PII Leakage	High-quality toxicity and category-level content classification
Perspective API	Moderation, PII Leakage	Google-backed toxicity scoring with fine-grained attribute breakdown

Real-World Examples

A few guardrail configurations teams run in production:

Customer Support Agent

A SaaS company runs a customer-facing support agent. They use prompt injection detection on input (block mode) to prevent manipulation, topic restriction to keep conversations about their product, PII redaction on input so customer emails and phone numbers are never sent to the model, and content moderation on output to ensure responses stay professional.

Internal Knowledge Assistant

An enterprise deploys an internal agent that queries their knowledge base. System prompt leakage detection on output prevents the agent from revealing its retrieval configuration. Data exfiltration detection catches attempts to extract internal documents through crafted prompts. Relevance checking (warn mode) flags when the agent starts generating tangential content.

Regulated Industry Agent

A healthcare company uses PII detection on both input and output with redact mode to ensure patient data never persists in logs. Topic restriction limits the agent to approved medical information topics. A custom guardrail validates that every response includes a required disclaimer. Guardrail trace spans provide a complete record of every check for compliance reviews.

Getting Started

Adding guardrails to an existing agent takes minutes:

1.Open your agent's YAML configuration and add a guardrails section with the rules you need
2.Deploy your agent. Guardrails activate automatically on the next run
3.Check the Traces tab in the Connic dashboard to see guardrail spans for each run
4.Open individual runs to inspect which guardrails fired and drill into blocked requests

Start with prompt injection and PII detection. Those two cover the most common attack vectors. Then layer on topic restriction, moderation, and custom checks as you understand your traffic patterns.

For the full configuration reference and all available options, check the Guardrails documentation. New to Connic? Start with the quickstart guide to deploy your first agent, then come back here to add safety layers around it.

Agent Guardrails: Real-Time Safety for Your AI Agents

Why Agents Need Guardrails

How Guardrails Work

Three Modes of Action

10 Built-In Guardrail Types

Configuration in YAML

Writing Custom Guardrails

Full Observability with Traces

External Providers

Real-World Examples

Getting Started

More from the Blog

Connic Tests: Catch Agent Regressions Before They Reach Production

AI Agent Evaluation: Automated Scoring with LLM Judges

Migrate from LangChain to Production AI Agents

Connic Bridge: AI Agents for Private Infrastructure

Composer SDK: Better Agent Development Tooling

What We Shipped in September 2025