Connic
Back to BlogProduct Spotlight

Agent Guardrails: Real-Time Safety for Your AI Agents

Connic Guardrails intercept agent inputs and outputs in real time to block prompt injection, redact PII, and enforce topic restrictions.

March 3, 20269 min read

AI agents are powerful. They can draft emails, summarize documents, call APIs, and make decisions. But they also inherit every risk that comes with running a language model in production: prompt injection, PII leakage, off-topic responses, system prompt exposure, and outputs that violate your content policies.

You cannot ship an agent to production and hope it behaves. You need runtime safety checks that sit between your users and your agent, inspecting every input and every output before they cause harm. That is what Connic Guardrails does.

Why Agents Need Guardrails

Traditional software is deterministic. If you write a function that adds two numbers, it adds two numbers. Language models are different. The same prompt can produce wildly different outputs depending on context, temperature, and how creatively a user phrases their request.

This unpredictability creates real risks in production:

Prompt Injection

A user crafts input that overrides your system prompt. Your helpful customer support agent suddenly starts ignoring its instructions and doing whatever the user asks. OWASP lists prompt injection as the #1 risk for LLM applications.

PII Exposure

Users paste sensitive data into prompts: email addresses, phone numbers, social security numbers, credit card details. Without guardrails, that data flows straight into your model provider's API and potentially into your logs.

System Prompt Leakage

A cleverly worded request tricks your agent into revealing its system prompt, including internal instructions, tool configurations, and business logic you intended to keep private.

Off-Topic or Harmful Output

Your billing support agent starts giving medical advice. Your internal assistant generates toxic content. Without output checks, there is no safety net between the model's response and the end user.

Connic Guardrails address all of these. They run as a configurable layer around your agent, inspecting content in real time and taking action before damage is done.

How Guardrails Work

Guardrails sit in the execution pipeline of every agent run. They check content at two points: before the agent processes the input, and after the agent produces a response. The flow looks like this:

User InputInput GuardrailsAgentOutput GuardrailsResponse

Input guardrails evaluate the raw user message before the agent sees it. If a guardrail detects a prompt injection attempt, for example, the message is blocked immediately and the agent never executes. If PII is detected, it can be redacted in place so the agent receives a sanitized version.

Output guardrails evaluate the agent's response before it reaches the user. If the response contains system prompt fragments, toxic content, or data exfiltration patterns, the guardrail intercepts it. The user receives a safe rejection message instead.

Three Modes of Action

Every guardrail rule operates in one of three modes. This gives you fine-grained control over how aggressively each check should respond:

Block

Stop processing entirely. The user receives a configurable rejection message. The agent never runs (input) or the response is replaced (output). Use this for hard safety boundaries.

Warn

Log the violation as a trace span and continue. Processing is not interrupted. Use this when you want visibility into potential issues without blocking legitimate requests.

Redact

Replace sensitive content with placeholders and continue processing. Available for PII guardrails. The agent receives sanitized input, or the user receives a sanitized response, without the run being interrupted.

10 Built-In Guardrail Types

Connic ships with a comprehensive set of guardrail types that cover the most common safety requirements for production agents. Each can run on input, output, or both.

Prompt Injection Detection

OWASP-style detection that catches instruction override attempts, typoglycemia attacks, encoding tricks, and structural manipulation. Supports Lakera as an external provider.

PII Detection (Input)

Detects personally identifiable information in user input: emails, phone numbers, SSNs, credit cards, and more. Configurable entity types. Supports block, warn, and redact modes.

PII Leakage (Output)

Catches PII that appears in agent responses, even if it was not in the original input. Prevents your agent from surfacing sensitive data from its context or tools.

Content Moderation

Toxicity and harmful content detection. Uses OpenAI Moderation or Perspective API as external providers to catch hate speech, harassment, violence, and other policy violations.

Topic Restriction

Restrict your agent to specific topics. Define an allowed topics list and a custom off-topic message. Requests outside the allowed scope are blocked before the agent runs.

Regex Pattern Matching

Define custom regex patterns to catch specific strings, formats, or keywords. Useful for catching internal identifiers, proprietary terms, or domain-specific patterns.

System Prompt Leakage

Detects when an agent's response contains fragments of its system prompt. Prevents attackers from extracting your internal instructions, tool schemas, or business logic.

Output Relevance

Checks whether the agent's response is actually relevant to the original question. Catches hallucinated tangents, off-track reasoning, and responses that drift from the task.

Data Exfiltration Detection

Detects patterns that indicate an attempt to extract data through the agent, such as encoding payloads, URL smuggling, or structured extraction of private context.

Custom Guardrails

Write your own guardrail logic in Python. Drop a module into your guardrails/ directory with a check() function. Supports both sync and async execution.

Configuration in YAML

Guardrails are defined in your agent's YAML configuration. Each rule specifies its type, mode, and optional parameters. Input and output guardrails are configured separately, so you can apply different checks at each stage.

agent.yaml
guardrails:
  input:
    - type: prompt_injection
      mode: block
    - type: pii
      mode: redact
      config:
        entities: [email, phone, ssn]
    - type: topic_restriction
      mode: block
      config:
        allowed_topics: [support, billing]
        off_topic_message: "I can only help with support and billing questions."
  output:
    - type: moderation
      mode: block
    - type: system_prompt_leakage
      mode: block
    - type: pii_leakage
      mode: redact
    - type: relevance
      mode: warn

This configuration blocks prompt injection on input, redacts PII from user messages, restricts the agent to support and billing topics, and then checks the output for moderation violations, system prompt leakage, PII in the response, and relevance drift.

Tip: Order Matters

Guardrails run in the order you define them. Place cheaper, faster checks first (like regex and prompt injection) and more expensive checks (like moderation with external providers) later. If an early check blocks, the later ones never run.

Writing Custom Guardrails

When the built-in types are not enough, you can write custom guardrails in Python. Create a module in your agent's guardrails/ directory that exports a check() function. It receives the content being checked and a context dictionary with metadata about the current run.

guardrails/competitor_mentions.py
from connic import GuardrailResult

COMPETITORS = ["acme corp", "rival inc", "other platform"]

def check(content: str, context: dict) -> GuardrailResult:
    content_lower = content.lower()
    for name in COMPETITORS:
        if name in content_lower:
            return GuardrailResult(
                passed=False,
                message="I'm not able to discuss other platforms.",
                details={"matched": name},
            )
    return GuardrailResult(passed=True)

Then reference it in your YAML configuration:

agent.yaml
guardrails:
  output:
    - type: custom
      name: competitor_mentions
      mode: block

Full Observability with Traces

Every guardrail evaluation is captured as an OpenTelemetry trace span. You get complete visibility into what was checked, what passed, and what was blocked or redacted.

Trace Spans

Each guardrail check creates a child span under guardrails:input or guardrails:output. Attributes include the rule type, mode, direction, and pass/fail status.

Run-Level Detail

Open any run in the dashboard to see exactly which guardrails fired, whether they passed or blocked, and what content triggered them. Blocked runs show the rejection reason directly in the run detail view.

This means you can answer questions like: How often is prompt injection being attempted? Which agents trigger the most PII redactions? Are topic restrictions too aggressive? The data is there for every run.

External Providers

Several built-in guardrail types support external providers for more accurate detection. You can swap the default detection engine for a specialized service without changing your guardrail configuration:

ProviderGuardrail TypesStrength
LakeraPrompt InjectionPurpose-built for injection detection with continuously updated models
OpenAI ModerationModeration, PII LeakageHigh-quality toxicity and category-level content classification
Perspective APIModeration, PII LeakageGoogle-backed toxicity scoring with fine-grained attribute breakdown

Real-World Examples

Here are some guardrail configurations we see teams deploying in production:

Customer Support Agent

A SaaS company runs a customer-facing support agent. They use prompt injection detection on input (block mode) to prevent manipulation, topic restriction to keep conversations about their product, PII redaction on input so customer emails and phone numbers are never sent to the model, and content moderation on output to ensure responses stay professional.

Internal Knowledge Assistant

An enterprise deploys an internal agent that queries their knowledge base. System prompt leakage detection on output prevents the agent from revealing its retrieval configuration. Data exfiltration detection catches attempts to extract internal documents through crafted prompts. Relevance checking (warn mode) flags when the agent starts generating tangential content.

Regulated Industry Agent

A healthcare company uses PII detection on both input and output with redact mode to ensure patient data never persists in logs. Topic restriction limits the agent to approved medical information topics. A custom guardrail validates that every response includes a required disclaimer. Guardrail trace spans provide a complete record of every check for compliance reviews.

Getting Started

Adding guardrails to an existing agent takes minutes. Here is how:

  • 1.Open your agent's YAML configuration and add a guardrails section with the rules you need
  • 2.Deploy your agent. Guardrails activate automatically on the next run
  • 3.Check the Traces tab in the Connic dashboard to see guardrail spans for each run
  • 4.Open individual runs to inspect which guardrails fired and drill into blocked requests

Start with prompt injection and PII detection. Those two cover the most common attack vectors. Then layer on topic restriction, moderation, and custom checks as you understand your traffic patterns.

For the full configuration reference and all available options, check the Guardrails documentation. If you are new to Connic, start with the quickstart guide to deploy your first agent, then come back here to add safety layers around it.

More from the Blog

Tutorial

AI Agent Evaluation: Automated Scoring with LLM Judges

You deployed AI agents. How do you know they are actually good? Learn how to set up automated evaluation with LLM judges that score every run against custom criteria.

March 29, 202610 min read
Product Spotlight

A/B Testing for AI Agents: Ship Better Prompts with Confidence

You changed the prompt. It feels better. But is it actually better? Learn how to run controlled experiments on your AI agents and let real traffic decide.

March 27, 20269 min read
Product Spotlight

Secure AI Agents: A Production Safety Checklist

Shipping AI agents without a security strategy is a liability. A practical checklist covering prompt injection, PII handling, output validation, and the guardrails you need before go-live.

March 21, 202612 min read
Tutorial

Database vs. Knowledge Base: Choosing the Right Storage

Learn when to use Connic's document database for structured CRUD vs. the knowledge base for semantic search. Configuration tips and best practices.

March 4, 202612 min read
Product Spotlight

Connic Bridge: AI Agents for Private Infrastructure

Connic Bridge creates a secure outbound tunnel so your AI agents can reach private Kafka, databases, and internal services without opening inbound ports.

February 19, 20267 min read
Product Spotlight

Agent Observability: Track Costs, Tokens & Runs

Deploying AI agents without visibility is flying blind. Learn how to build custom dashboards, track LLM costs per model, and catch failures before users do.

January 23, 20268 min read