You would not ship a web application without authentication, input validation, and rate limiting. But many teams ship AI agents with none of the equivalent safeguards. The agent works in the demo, so it goes to production — and the first creative user discovers they can make it do things it was never supposed to do.
AI agent security is not theoretical. OWASP ranks prompt injection as the #1 risk for LLM applications. PII leakage creates real compliance liability. System prompt extraction exposes your business logic. These are not edge cases — they are what happens when you expose a language model to real users without guardrails.
This is a practical checklist for securing AI agents before they go live. No theory, no hypotheticals — just the controls you need and how to implement them.
The Threat Model for AI Agents
Before you can secure an agent, you need to understand what you are securing it against. Here are the attacks that actually happen in production:
Prompt Injection
A user crafts input that overrides your system prompt. "Ignore all previous instructions and..." is the obvious version, but sophisticated attacks use encoding tricks, scrambled characters, and structural manipulation that bypass simple keyword filters. OWASP LLM Top 10 risk #1.
PII Exposure
Users paste sensitive data into prompts: email addresses, credit card numbers, social security numbers, phone numbers. Without guardrails, this data flows into model provider APIs and potentially into your logs. Under GDPR and CCPA, this creates real compliance liability.
System Prompt Extraction
Attackers trick the agent into revealing its system prompt, including internal instructions, tool configurations, API schemas, and business rules. Once exposed, they know exactly how to manipulate the agent.
Data Exfiltration
Crafted prompts cause the agent to encode sensitive context data into URLs, markdown images, or structured output — silently sending internal data to external servers.
Off-Topic and Harmful Output
Without content controls, your billing support agent might start giving medical advice. Your internal assistant might generate toxic or inappropriate content. The model does not understand your content policy unless you enforce it.
The Production Safety Checklist
Here are the security controls you should have in place before your agent faces real users. We will walk through each one with specific implementation guidance.
1. Block Prompt Injection on Input
This is your first line of defense. Every user message should be checked for injection attempts before the agent processes it.
Effective prompt injection detection goes beyond keyword matching. It needs to catch:
- →Direct instruction overrides — "Ignore previous instructions" and variants
- →Encoding attacks — Base64-encoded instructions, hex-encoded payloads
- →Character manipulation — Typoglycemia (scrambled characters), Unicode substitution
- →Structural attacks — Delimiter injection, context manipulation through formatting
guardrails:
input:
- type: prompt_injection
mode: block
config:
sensitivity: medium # low, medium, or highWhen a prompt injection is detected, the agent never executes. The user receives a rejection message, and the attempt is logged as a trace span for security review.
For High-Risk Applications
Connic supports specialized injection detection providers like Lakera, which offer continuously updated detection models and support for 100+ languages. Add them by setting the provider field in the guardrail config.
2. Handle PII Before It Reaches the Model
Users will paste sensitive data into your agent. That is a given. The question is whether that data reaches your model provider's API or gets intercepted first.
You have three options:
Block
Reject the entire message if PII is detected. Strictest option. Good for environments where PII should never enter the system at all.
Redact
Replace detected PII with placeholders like [EMAIL_REDACTED]. The agent processes the sanitized input. Best balance of safety and usability.
Warn
Log the detection but continue processing. Use this to measure how often PII appears before deciding whether to block or redact.
guardrails:
input:
- type: pii
mode: redact
config:
entities: [email, phone, ssn, credit_card, api_key]
output:
- type: pii_leakage
mode: redactNotice the output guardrail. PII can also appear in agent responses — even if it was not in the original input. If your agent has access to customer data through tools or knowledge bases, output PII detection prevents it from surfacing sensitive information in responses.
3. Prevent System Prompt Leakage
Your system prompt contains your agent's personality, business rules, tool schemas, and operating instructions. If an attacker extracts it, they have a blueprint for manipulating your agent.
guardrails:
output:
- type: system_prompt_leakage
mode: blockThis checks every agent response for fragments that match the system prompt. If the agent starts revealing its instructions, the response is blocked and replaced with a safe message.
4. Restrict Your Agent to Its Lane
Language models will happily answer questions about anything. Your billing support agent should not be giving cooking advice or medical recommendations. Topic restriction keeps the agent focused on what it is supposed to do.
guardrails:
input:
- type: topic_restriction
mode: block
config:
allowed_topics:
- billing and payments
- subscription management
- account settings
off_topic_message: "I can only help with billing and account questions."
model: openai/gpt-4o-miniYou can define allowed topics (whitelist) or blocked topics (blacklist) depending on what makes more sense for your use case. The check uses a lightweight LLM call for classification, which is why it takes a model parameter — use a fast, cheap model to keep latency and cost low.
5. Moderate Output Content
Even with a good system prompt, language models can generate inappropriate content. Content moderation on output catches toxicity, harassment, hate speech, and other policy violations before they reach the user.
guardrails:
output:
- type: moderation
mode: block
config:
categories:
- hate
- harassment
- violence
- self_harm6. Detect Data Exfiltration Attempts
A subtle but dangerous attack: prompts that cause the agent to encode sensitive data into URLs, markdown images, or structured output that gets sent to external servers. This is especially dangerous for agents with access to internal data through tools.
guardrails:
output:
- type: data_exfiltration
mode: block
config:
allowed_domains:
- yourdomain.com
- docs.yourdomain.comThis catches suspicious URLs with encoded data in query parameters, markdown image tags pointing to external domains, and base64-encoded data blocks in the response. The allowed domains whitelist ensures legitimate references are not blocked.
7. Add Custom Business Rules
Every business has unique safety requirements that built-in guardrails cannot cover. Custom guardrails let you write arbitrary validation logic in Python.
from connic import GuardrailResult
REQUIRED_DISCLAIMERS = [
"not financial advice",
"consult a professional",
]
def check(content: str, context: dict) -> GuardrailResult:
"""Ensure financial responses include disclaimers."""
agent_name = context.get("agent_name", "")
if "finance" not in agent_name:
return GuardrailResult(passed=True)
content_lower = content.lower()
has_disclaimer = any(d in content_lower for d in REQUIRED_DISCLAIMERS)
if not has_disclaimer:
return GuardrailResult(
passed=False,
message="Response must include a disclaimer.",
)
return GuardrailResult(passed=True)Common custom guardrails include: competitor mention blocking, regulatory compliance disclaimers, internal terminology filters, and domain-specific validation rules.
8. Check Output Relevance
Sometimes an agent responds without saying anything harmful — but also without answering the actual question. Relevance checking catches goal hijacking (where an attacker subtly redirects the agent's purpose) and hallucinated tangents.
guardrails:
output:
- type: relevance
mode: warn
config:
model: openai/gpt-4o-miniStart with warn mode to understand how often irrelevant responses occur, then escalate to block mode if the rate is unacceptable.
Putting It All Together
Here is a complete guardrail configuration for a production customer support agent:
version: "1.0"
name: customer-support
model: openai/gpt-4o
system_prompt: |
You are a customer support agent for Acme Corp.
Help customers with orders, billing, and product questions.
tools:
- support.search_orders
- support.search_knowledge_base
- support.create_ticket
guardrails:
input:
- type: prompt_injection
mode: block
config:
sensitivity: medium
- type: pii
mode: redact
config:
entities: [email, phone, ssn, credit_card]
- type: topic_restriction
mode: block
config:
allowed_topics: [orders, billing, products, returns]
off_topic_message: "I can only help with Acme Corp products and orders."
model: openai/gpt-4o-mini
output:
- type: moderation
mode: block
- type: system_prompt_leakage
mode: block
- type: pii_leakage
mode: redact
- type: data_exfiltration
mode: block
config:
allowed_domains: [acmecorp.com]
- type: relevance
mode: warn
config:
model: openai/gpt-4o-miniOrdering Matters
Guardrails execute in the order you define them. Place cheap, fast checks first (prompt injection, PII, regex) and expensive checks last (topic restriction, relevance). If an early check blocks the request, the expensive ones never run.
Observability: See What Is Being Caught
Security controls without visibility are a black box. Every guardrail evaluation is recorded as a trace span, giving you a complete audit trail.
Trace Spans
Each guardrail check creates a span with the rule type, mode, direction, pass/fail status, and detection details. See exactly what triggered and why.
Audit Trail
For compliance teams: every blocked request, every redaction, and every warning is logged with timestamps and full context. Ready for regulatory review.
This data lets you answer questions like: How often is prompt injection attempted? Which agents trigger the most PII redactions? Are topic restrictions too aggressive? You can tune your guardrails based on real traffic patterns instead of guesswork.
Beyond Guardrails: Defense in Depth
Guardrails are the runtime safety layer. But a complete security posture includes additional controls:
Iteration Limits
Cap the number of LLM calls per agent run. Prevents infinite loops, reduces cost exposure from runaway agents, and catches bugs where agents get stuck in cycles.
Concurrency Control
Key-based concurrency ensures only one agent run per unique key is active at a time. Prevents duplicate processing and reduces exposure to replay attacks.
Database Access Controls
If your agent has database tools, restrict which collections it can read and write. Apply the principle of least privilege so agents only access the data they need.
Knowledge Base Namespacing
Scope agent access to specific knowledge namespaces. Prevent one agent from accessing another agent's sensitive data in shared knowledge bases.
Automated Quality Evaluation
Use LLM judges to continuously score agent quality. Catch regressions in accuracy, safety, or compliance before they become incidents.
Getting Started
If you are starting from zero, here is the recommended approach:
- 1.Start with the baseline. Add
prompt_injectionandpiion input,moderationandsystem_prompt_leakageon output. This covers the most common attack vectors. - 2.Use warn mode first. Monitor what gets flagged before switching to block. This prevents legitimate requests from being rejected while you tune sensitivity.
- 3.Layer up based on traffic. Once you see real usage patterns, add topic restriction, relevance checking, and custom guardrails based on the actual risks you observe.
- 4.Review trace data regularly. Check which guardrails fire most often and investigate patterns. Adjust sensitivity and modes as your understanding of the traffic improves.
For the full configuration reference, see the Guardrails documentation. For a deeper look at the built-in guardrail types, read Agent Guardrails: Real-Time Safety for Your AI Agents.