You wouldn't ship a web application without authentication, input validation, and rate limiting. But many teams ship AI agents with none of the equivalent safeguards. The agent works in the demo, so it goes to production. Then the first creative user discovers they can make it do things it was never supposed to do.
AI agent security isn't theoretical. OWASP ranks prompt injection as the #1 risk for LLM applications. PII leakage creates real compliance liability. System prompt extraction exposes your business logic. These aren't edge cases. They're what happens when you expose a language model to real users without guardrails.
This is a practical checklist for securing AI agents before they go live. No theory, no hypotheticals, just the controls you need and how to implement them.
The Threat Model for AI Agents
Before you can secure an agent, you need to understand what you're securing it against. The attacks that actually happen in production:
The Production Safety Checklist
The security controls you should have in place before your agent faces real users. Each one below comes with specific implementation guidance.
1. Block Prompt Injection on Input
This is your first line of defense. Every user message should be checked for injection attempts before the agent processes it.
Effective prompt injection detection goes beyond keyword matching. It needs to catch:
- →Direct instruction overrides: "Ignore previous instructions" and variants
- →Encoding attacks: Base64-encoded instructions, hex-encoded payloads
- →Character manipulation: typoglycemia (scrambled characters), Unicode substitution
- →Structural attacks: delimiter injection, context manipulation through formatting
guardrails:
input:
- type: prompt_injection
mode: block
config:
sensitivity: medium # low, medium, or highWhen a prompt injection is detected, the agent never executes. The user receives a rejection message, and the attempt is logged as a trace span for security review.
provider field in the guardrail config.2. Handle PII Before It Reaches the Model
Users will paste sensitive data into your agent, that's a given. The question is whether the data reaches your model provider's API or gets intercepted first.
You have three options:
[EMAIL_REDACTED]. The agent processes the sanitized input. Best balance of safety and usability.guardrails:
input:
- type: pii
mode: redact
config:
entities: [email, phone, ssn, credit_card, api_key]
output:
- type: pii_leakage
mode: redactNotice the output guardrail. PII can also appear in agent responses, even if it wasn't in the original input. If your agent has access to customer data through tools or knowledge bases, output PII detection keeps it from surfacing sensitive information in responses.
3. Prevent System Prompt Leakage
Your system prompt contains your agent's personality, business rules, tool schemas, and operating instructions. If an attacker extracts it, they have a blueprint for manipulating your agent.
guardrails:
output:
- type: system_prompt_leakage
mode: blockThis checks every agent response for fragments that match the system prompt. If the agent starts revealing its instructions, the response is blocked and replaced with a safe message.
4. Restrict Your Agent to Its Lane
Language models will happily answer questions about anything. Your billing support agent shouldn't be giving cooking advice or medical recommendations. Topic restriction keeps the agent focused on what it's supposed to do.
guardrails:
input:
- type: topic_restriction
mode: block
config:
allowed_topics:
- billing and payments
- subscription management
- account settings
off_topic_message: "I can only help with billing and account questions."
model: openai/gpt-5-miniYou can define allowed topics (whitelist) or blocked topics (blacklist) depending on what makes more sense for your use case. The check uses a lightweight LLM call for classification, which is why it takes a model parameter. Use a fast, cheap model to keep latency and cost low.
5. Moderate Output Content
Even with a good system prompt, language models can generate inappropriate content. Content moderation on output catches toxicity, harassment, hate speech, and other policy violations before they reach the user.
guardrails:
output:
- type: moderation
mode: block
config:
categories:
- hate
- harassment
- violence
- self_harm6. Detect Data Exfiltration Attempts
A subtle but dangerous attack: prompts that cause the agent to encode sensitive data into URLs, markdown images, or structured output that gets sent to external servers. This is especially dangerous for agents with access to internal data through tools.
guardrails:
output:
- type: data_exfiltration
mode: block
config:
allowed_domains:
- yourdomain.com
- docs.yourdomain.comThis catches suspicious URLs with encoded data in query parameters, markdown image tags pointing to external domains, and base64-encoded data blocks in the response. The allowed domains whitelist ensures legitimate references are not blocked.
7. Add Custom Business Rules
Every business has unique safety requirements that built-in guardrails can't cover. Custom guardrails let you write arbitrary validation logic in Python.
from connic import GuardrailResult
REQUIRED_DISCLAIMERS = [
"not financial advice",
"consult a professional",
]
def check(content: str, context: dict) -> GuardrailResult:
"""Ensure financial responses include disclaimers."""
agent_name = context.get("agent_name", "")
if "finance" not in agent_name:
return GuardrailResult(passed=True)
content_lower = content.lower()
has_disclaimer = any(d in content_lower for d in REQUIRED_DISCLAIMERS)
if not has_disclaimer:
return GuardrailResult(
passed=False,
message="Response must include a disclaimer.",
)
return GuardrailResult(passed=True)Common custom guardrails include: competitor mention blocking, regulatory compliance disclaimers, internal terminology filters, and domain-specific validation rules.
8. Check Output Relevance
Sometimes an agent responds without saying anything harmful, but also without answering the actual question. Relevance checking catches goal hijacking (where an attacker subtly redirects the agent's purpose) and hallucinated tangents.
guardrails:
output:
- type: relevance
mode: warn
config:
model: openai/gpt-5-miniStart with warn mode to understand how often irrelevant responses occur, then escalate to block mode if the rate is unacceptable.
Putting It All Together
A complete guardrail configuration for a production customer support agent:
version: "1.0"
name: customer-support
model: openai/gpt-4o
system_prompt: |
You are a customer support agent for Acme Corp.
Help customers with orders, billing, and product questions.
tools:
- support.search_orders
- support.search_knowledge_base
- support.create_ticket
guardrails:
input:
- type: prompt_injection
mode: block
config:
sensitivity: medium
- type: pii
mode: redact
config:
entities: [email, phone, ssn, credit_card]
- type: topic_restriction
mode: block
config:
allowed_topics: [orders, billing, products, returns]
off_topic_message: "I can only help with Acme Corp products and orders."
model: openai/gpt-5-mini
output:
- type: moderation
mode: block
- type: system_prompt_leakage
mode: block
- type: pii_leakage
mode: redact
- type: data_exfiltration
mode: block
config:
allowed_domains: [acmecorp.com]
- type: relevance
mode: warn
config:
model: openai/gpt-5-miniObservability: See What Is Being Caught
Security controls without visibility are a black box. Every guardrail evaluation is recorded as a trace span, giving you a complete audit trail.
This data lets you answer questions like: How often is prompt injection attempted? Which agents trigger the most PII redactions? Are topic restrictions too aggressive? You can tune your guardrails based on real traffic patterns instead of guesswork.
Beyond Guardrails: Defense in Depth
Guardrails are the runtime safety layer. But a complete security posture includes additional controls:
Getting Started
If you're starting from zero, here's the recommended approach:
- 1.Start with the baseline. Add
prompt_injectionandpiion input,moderationandsystem_prompt_leakageon output. This covers the most common attack vectors. - 2.Use warn mode first. Monitor what gets flagged before switching to block. This prevents legitimate requests from being rejected while you tune sensitivity.
- 3.Layer up based on traffic. Once you see real usage patterns, add topic restriction, relevance checking, and custom guardrails based on the actual risks you observe.
- 4.Review trace data regularly. Check which guardrails fire most often and investigate patterns. Adjust sensitivity and modes as your understanding of the traffic improves.
For the full configuration reference, see the Guardrails documentation. For a deeper look at the built-in guardrail types, read Agent Guardrails: Real-Time Safety for Your AI Agents.