Connic
Back to BlogProduct Spotlight

Secure AI Agents: A Production Safety Checklist

Shipping AI agents without a security strategy is a liability. A practical checklist covering prompt injection, PII handling, output validation, and the guardrails you need before go-live.

March 21, 202612 min read

You would not ship a web application without authentication, input validation, and rate limiting. But many teams ship AI agents with none of the equivalent safeguards. The agent works in the demo, so it goes to production — and the first creative user discovers they can make it do things it was never supposed to do.

AI agent security is not theoretical. OWASP ranks prompt injection as the #1 risk for LLM applications. PII leakage creates real compliance liability. System prompt extraction exposes your business logic. These are not edge cases — they are what happens when you expose a language model to real users without guardrails.

This is a practical checklist for securing AI agents before they go live. No theory, no hypotheticals — just the controls you need and how to implement them.

The Threat Model for AI Agents

Before you can secure an agent, you need to understand what you are securing it against. Here are the attacks that actually happen in production:

Prompt Injection

A user crafts input that overrides your system prompt. "Ignore all previous instructions and..." is the obvious version, but sophisticated attacks use encoding tricks, scrambled characters, and structural manipulation that bypass simple keyword filters. OWASP LLM Top 10 risk #1.

PII Exposure

Users paste sensitive data into prompts: email addresses, credit card numbers, social security numbers, phone numbers. Without guardrails, this data flows into model provider APIs and potentially into your logs. Under GDPR and CCPA, this creates real compliance liability.

System Prompt Extraction

Attackers trick the agent into revealing its system prompt, including internal instructions, tool configurations, API schemas, and business rules. Once exposed, they know exactly how to manipulate the agent.

Data Exfiltration

Crafted prompts cause the agent to encode sensitive context data into URLs, markdown images, or structured output — silently sending internal data to external servers.

Off-Topic and Harmful Output

Without content controls, your billing support agent might start giving medical advice. Your internal assistant might generate toxic or inappropriate content. The model does not understand your content policy unless you enforce it.

The Production Safety Checklist

Here are the security controls you should have in place before your agent faces real users. We will walk through each one with specific implementation guidance.

1. Block Prompt Injection on Input

This is your first line of defense. Every user message should be checked for injection attempts before the agent processes it.

Effective prompt injection detection goes beyond keyword matching. It needs to catch:

  • Direct instruction overrides — "Ignore previous instructions" and variants
  • Encoding attacks — Base64-encoded instructions, hex-encoded payloads
  • Character manipulation — Typoglycemia (scrambled characters), Unicode substitution
  • Structural attacks — Delimiter injection, context manipulation through formatting
agent.yaml
guardrails:
  input:
    - type: prompt_injection
      mode: block
      config:
        sensitivity: medium  # low, medium, or high

When a prompt injection is detected, the agent never executes. The user receives a rejection message, and the attempt is logged as a trace span for security review.

For High-Risk Applications

Connic supports specialized injection detection providers like Lakera, which offer continuously updated detection models and support for 100+ languages. Add them by setting the provider field in the guardrail config.

2. Handle PII Before It Reaches the Model

Users will paste sensitive data into your agent. That is a given. The question is whether that data reaches your model provider's API or gets intercepted first.

You have three options:

Block

Reject the entire message if PII is detected. Strictest option. Good for environments where PII should never enter the system at all.

Redact

Replace detected PII with placeholders like [EMAIL_REDACTED]. The agent processes the sanitized input. Best balance of safety and usability.

Warn

Log the detection but continue processing. Use this to measure how often PII appears before deciding whether to block or redact.

agent.yaml
guardrails:
  input:
    - type: pii
      mode: redact
      config:
        entities: [email, phone, ssn, credit_card, api_key]
  output:
    - type: pii_leakage
      mode: redact

Notice the output guardrail. PII can also appear in agent responses — even if it was not in the original input. If your agent has access to customer data through tools or knowledge bases, output PII detection prevents it from surfacing sensitive information in responses.

3. Prevent System Prompt Leakage

Your system prompt contains your agent's personality, business rules, tool schemas, and operating instructions. If an attacker extracts it, they have a blueprint for manipulating your agent.

agent.yaml
guardrails:
  output:
    - type: system_prompt_leakage
      mode: block

This checks every agent response for fragments that match the system prompt. If the agent starts revealing its instructions, the response is blocked and replaced with a safe message.

4. Restrict Your Agent to Its Lane

Language models will happily answer questions about anything. Your billing support agent should not be giving cooking advice or medical recommendations. Topic restriction keeps the agent focused on what it is supposed to do.

agent.yaml
guardrails:
  input:
    - type: topic_restriction
      mode: block
      config:
        allowed_topics:
          - billing and payments
          - subscription management
          - account settings
        off_topic_message: "I can only help with billing and account questions."
        model: openai/gpt-4o-mini

You can define allowed topics (whitelist) or blocked topics (blacklist) depending on what makes more sense for your use case. The check uses a lightweight LLM call for classification, which is why it takes a model parameter — use a fast, cheap model to keep latency and cost low.

5. Moderate Output Content

Even with a good system prompt, language models can generate inappropriate content. Content moderation on output catches toxicity, harassment, hate speech, and other policy violations before they reach the user.

agent.yaml
guardrails:
  output:
    - type: moderation
      mode: block
      config:
        categories:
          - hate
          - harassment
          - violence
          - self_harm

6. Detect Data Exfiltration Attempts

A subtle but dangerous attack: prompts that cause the agent to encode sensitive data into URLs, markdown images, or structured output that gets sent to external servers. This is especially dangerous for agents with access to internal data through tools.

agent.yaml
guardrails:
  output:
    - type: data_exfiltration
      mode: block
      config:
        allowed_domains:
          - yourdomain.com
          - docs.yourdomain.com

This catches suspicious URLs with encoded data in query parameters, markdown image tags pointing to external domains, and base64-encoded data blocks in the response. The allowed domains whitelist ensures legitimate references are not blocked.

7. Add Custom Business Rules

Every business has unique safety requirements that built-in guardrails cannot cover. Custom guardrails let you write arbitrary validation logic in Python.

guardrails/compliance_check.py
from connic import GuardrailResult

REQUIRED_DISCLAIMERS = [
    "not financial advice",
    "consult a professional",
]

def check(content: str, context: dict) -> GuardrailResult:
    """Ensure financial responses include disclaimers."""
    agent_name = context.get("agent_name", "")
    if "finance" not in agent_name:
        return GuardrailResult(passed=True)

    content_lower = content.lower()
    has_disclaimer = any(d in content_lower for d in REQUIRED_DISCLAIMERS)

    if not has_disclaimer:
        return GuardrailResult(
            passed=False,
            message="Response must include a disclaimer.",
        )
    return GuardrailResult(passed=True)

Common custom guardrails include: competitor mention blocking, regulatory compliance disclaimers, internal terminology filters, and domain-specific validation rules.

8. Check Output Relevance

Sometimes an agent responds without saying anything harmful — but also without answering the actual question. Relevance checking catches goal hijacking (where an attacker subtly redirects the agent's purpose) and hallucinated tangents.

agent.yaml
guardrails:
  output:
    - type: relevance
      mode: warn
      config:
        model: openai/gpt-4o-mini

Start with warn mode to understand how often irrelevant responses occur, then escalate to block mode if the rate is unacceptable.

Putting It All Together

Here is a complete guardrail configuration for a production customer support agent:

agents/support.yaml
version: "1.0"
name: customer-support
model: openai/gpt-4o
system_prompt: |
  You are a customer support agent for Acme Corp.
  Help customers with orders, billing, and product questions.
tools:
  - support.search_orders
  - support.search_knowledge_base
  - support.create_ticket

guardrails:
  input:
    - type: prompt_injection
      mode: block
      config:
        sensitivity: medium
    - type: pii
      mode: redact
      config:
        entities: [email, phone, ssn, credit_card]
    - type: topic_restriction
      mode: block
      config:
        allowed_topics: [orders, billing, products, returns]
        off_topic_message: "I can only help with Acme Corp products and orders."
        model: openai/gpt-4o-mini
  output:
    - type: moderation
      mode: block
    - type: system_prompt_leakage
      mode: block
    - type: pii_leakage
      mode: redact
    - type: data_exfiltration
      mode: block
      config:
        allowed_domains: [acmecorp.com]
    - type: relevance
      mode: warn
      config:
        model: openai/gpt-4o-mini

Ordering Matters

Guardrails execute in the order you define them. Place cheap, fast checks first (prompt injection, PII, regex) and expensive checks last (topic restriction, relevance). If an early check blocks the request, the expensive ones never run.

Observability: See What Is Being Caught

Security controls without visibility are a black box. Every guardrail evaluation is recorded as a trace span, giving you a complete audit trail.

Trace Spans

Each guardrail check creates a span with the rule type, mode, direction, pass/fail status, and detection details. See exactly what triggered and why.

Audit Trail

For compliance teams: every blocked request, every redaction, and every warning is logged with timestamps and full context. Ready for regulatory review.

This data lets you answer questions like: How often is prompt injection attempted? Which agents trigger the most PII redactions? Are topic restrictions too aggressive? You can tune your guardrails based on real traffic patterns instead of guesswork.

Beyond Guardrails: Defense in Depth

Guardrails are the runtime safety layer. But a complete security posture includes additional controls:

Iteration Limits

Cap the number of LLM calls per agent run. Prevents infinite loops, reduces cost exposure from runaway agents, and catches bugs where agents get stuck in cycles.

Concurrency Control

Key-based concurrency ensures only one agent run per unique key is active at a time. Prevents duplicate processing and reduces exposure to replay attacks.

Database Access Controls

If your agent has database tools, restrict which collections it can read and write. Apply the principle of least privilege so agents only access the data they need.

Knowledge Base Namespacing

Scope agent access to specific knowledge namespaces. Prevent one agent from accessing another agent's sensitive data in shared knowledge bases.

Automated Quality Evaluation

Use LLM judges to continuously score agent quality. Catch regressions in accuracy, safety, or compliance before they become incidents.

Getting Started

If you are starting from zero, here is the recommended approach:

  • 1.Start with the baseline. Add prompt_injection and pii on input, moderation and system_prompt_leakage on output. This covers the most common attack vectors.
  • 2.Use warn mode first. Monitor what gets flagged before switching to block. This prevents legitimate requests from being rejected while you tune sensitivity.
  • 3.Layer up based on traffic. Once you see real usage patterns, add topic restriction, relevance checking, and custom guardrails based on the actual risks you observe.
  • 4.Review trace data regularly. Check which guardrails fire most often and investigate patterns. Adjust sensitivity and modes as your understanding of the traffic improves.

For the full configuration reference, see the Guardrails documentation. For a deeper look at the built-in guardrail types, read Agent Guardrails: Real-Time Safety for Your AI Agents.