Guardrails
Add configurable safety guardrails to your agents for input validation, output filtering, PII protection, prompt injection detection, and more. Guardrails wrap agent execution and are configured directly in your agent YAML.
Overview
Guardrails are safety checks that run before and after agent execution. They protect against prompt injection attacks, PII leakage, toxic content, off-topic requests, and more. Each guardrail rule has a mode that determines what happens when a violation is detected.
Modes
| Mode | Behavior | Run Status |
|---|---|---|
block | Stop processing. Return a rejection message (configurable via rejection_message). Agent never executes (input) or response is replaced (output). | completed (set fail_run: true in config to mark as failed) |
warn | Log the violation as a trace span. Continue processing normally. | completed |
redact | Replace detected content with placeholders (e.g., [EMAIL_REDACTED]). Only available for pii and pii_leakage. | completed |
Quick Start
Add a security baseline to any agent with just a few lines:
guardrails:
input:
- type: prompt_injection
mode: block
- type: pii
mode: redact
output:
- type: moderation
mode: block
- type: system_prompt_leakage
mode: blockFull Example
A comprehensive configuration demonstrating all guardrail types:
version: "1.0"
name: support-agent
model: gemini/gemini-2.5-pro
system_prompt: "You are a customer support agent."
guardrails:
input:
- type: prompt_injection
mode: block
- type: pii
mode: redact
config:
entities: [email, phone, ssn, credit_card, iban]
- type: moderation
mode: block
config:
categories: [hate, self_harm, violence, sexual]
- type: topic_restriction
mode: block
config:
allowed_topics: [product support, billing, account help]
off_topic_message: "I can only help with product support, billing, and account questions."
model: openai/gpt-4o-mini
- type: regex
mode: block
config:
patterns:
- pattern: "(?i)\\b(drop|delete|truncate)\\s+table\\b"
message: "SQL commands are not allowed"
- type: custom
name: validate-ticket-id
mode: block
output:
- type: moderation
mode: block
- type: pii_leakage
mode: block
config:
entities: [ssn, credit_card, api_key]
- type: system_prompt_leakage
mode: block
- type: relevance
mode: warn
config:
model: openai/gpt-4o-mini
- type: regex
mode: warn
config:
patterns:
- pattern: "(?i)internal use only|confidential"
message: "Output contains internal-only content"
- type: custom
name: brand-voice-check
mode: warnInput Guardrails
prompt_injection
Multi-layered prompt injection detection, the #1 risk for agentic AI. Built-in detection includes:
- Heuristic pattern matching - detects known injection phrases
- Typoglycemia defence - catches scrambled-letter variants (e.g., "ignroe prevoius insturctoins")
- Encoding detection - decodes and scans Base64/hex/unicode content
- Structural analysis - detects delimiter abuse and instruction boundary violations
| Config | Description |
|---|---|
sensitivity | low | medium (default) | high |
provider | Optional. Set to lakera for Lakera Guard API (100+ languages) |
Modes: block, warn
pii
PII detection and redaction. Detects emails, phone numbers, SSNs, credit cards, IBANs, IP addresses, API keys, and more.
# Input: "My email is john@example.com and SSN is 123-45-6789"
# After redaction: "My email is [EMAIL_REDACTED] and SSN is [SSN_REDACTED]"
guardrails:
input:
- type: pii
mode: redact
config:
entities: [email, phone, ssn, credit_card]| Config | Description |
|---|---|
entities | List of entity types to detect (default: all). Available: email, phone, ssn, credit_card, iban, ip_address, api_key, and more |
Modes: block, warn, redact
moderation
Content moderation and safety classification. Works on both input and output.
| Config | Description |
|---|---|
categories | hate, harassment, self_harm, violence, sexual, illegal_activity, dangerous_instructions |
threshold | 0.0-1.0 confidence threshold (default: 0.7, external providers only) |
provider | Optional. openai (OpenAI Moderation API) or perspective (Google Perspective API) |
Modes: block, warn
topic_restriction
Keep your agent on-topic using an LLM classifier call.
Adds one LLM call per request (latency + token cost). Use a small/cheap model via the model field.
| Config | Description |
|---|---|
allowed_topics | List of allowed topic descriptions |
blocked_topics | List of explicitly blocked topics (alternative to allowed_topics) |
off_topic_message | Custom rejection message |
model | Model for classification (e.g., openai/gpt-4o-mini). Defaults to agent model. |
Modes: block, warn
regex
Custom regex pattern matching for business rules. Works on both input and output.
| Config | Description |
|---|---|
patterns | List of {pattern, message} objects |
Modes: block, warn
Output Guardrails
In addition to moderation and regex (which work on both input and output), these guardrails are specific to output:
pii_leakage
Detect PII leaking in agent responses. Critical for GDPR/CCPA compliance. Uses the same detection engine as input pii.
| Config | Description |
|---|---|
entities | Entity types to detect (default: ssn, credit_card, api_key) |
Modes: block, warn, redact
system_prompt_leakage
Detect when the agent leaks its system prompt in the output. Uses pattern matching and similarity comparison against the actual system prompt.
| Config | Description |
|---|---|
similarity_threshold | 0.0-1.0 (default: 0.6). How similar to the system prompt the output must be to trigger. |
Modes: block, warn
relevance
Detect off-topic or irrelevant responses using an LLM classifier. Catches goal hijacking attacks.
Adds one LLM call per request (latency + token cost). Use a small/cheap model via the model field.
| Config | Description |
|---|---|
context | Additional description of what "relevant" means |
model | Model for classification. Defaults to agent model. |
Modes: block, warn
data_exfiltration
Detect data exfiltration attempts in output: suspicious URLs with encoded data, markdown images to external domains, and more.
| Config | Description |
|---|---|
allowed_domains | List of allowed external domains (default: none, all flagged) |
Modes: block, warn
Custom Guardrails
Write Python files in a guardrails/ directory (following the middleware pattern):
from connic import GuardrailResult
import re
def check(content: str, context: dict) -> GuardrailResult:
"""Verify the input contains a valid ticket ID format."""
if not re.search(r'TICKET-\d{4,8}', content):
return GuardrailResult(
passed=False,
message="Please include a valid ticket ID (e.g., TICKET-12345)"
)
return GuardrailResult(passed=True)Contract
- File name matches the
namein config (e.g.,guardrails/validate-ticket-id.pyforname: validate-ticket-id) - Must export a
check(content: str, context: dict) -> GuardrailResultfunction (sync or async) contentis the text being checked (input or output)contextis the same context dict available to middleware (run_id,agent_name,connector_id,timestamp, and any user-set values)- Returns
GuardrailResult(passed=True)orGuardrailResult(passed=False, message="...")
Async Example
from connic import GuardrailResult
import httpx
async def check(content: str, context: dict) -> GuardrailResult:
"""Check if the user has permission to use this agent."""
user_id = context.get("user_id")
if not user_id:
return GuardrailResult(passed=False, message="User not authenticated")
async with httpx.AsyncClient() as client:
resp = await client.get(f"https://api.example.com/users/{user_id}/permissions")
if resp.status_code != 200:
return GuardrailResult(passed=False, message="Could not verify permissions")
data = resp.json()
if not data.get("allowed"):
return GuardrailResult(passed=False, message="Insufficient permissions")
return GuardrailResult(passed=True)External Providers
Each built-in guardrail ships with a lightweight default. For production accuracy, swap in an external provider via config.provider:
| Provider | For | Setup |
|---|---|---|
openai | moderation | Uses your configured OpenAI provider credentials (no extra setup needed). The moderation endpoint is free to use |
lakera | prompt_injection | Add LAKERA_API_KEY as an environment variable. Get your key from platform.lakera.ai |
perspective | moderation | Add GOOGLE_PERSPECTIVE_API_KEY as an environment variable. Enable the API in your Google Cloud Console and create an API key |
OpenAI Moderation Example
guardrails:
input:
- type: moderation
mode: block
config:
provider: openai
categories: [hate, self_harm, violence, sexual]
threshold: 0.7Lakera Guard Example
guardrails:
input:
- type: prompt_injection
mode: block
config:
provider: lakeraPerspective API Example
guardrails:
input:
- type: moderation
mode: block
config:
provider: perspective
threshold: 0.7
languages: [en] # Configurable language listObservability
Every guardrail evaluation is automatically recorded as a trace span. Violations appear in the traces timeline with full details:
- Span name:
guardrail:prompt_injection,guardrail:pii, etc. - Status:
passed,blocked,warned, orredacted - Direction:
inputoroutput - Provider used and detection details
- Start with a security baseline:
prompt_injection+pii(input),moderation+system_prompt_leakage(output) - Use
warnmode first to monitor violations before switching toblock - For
topic_restrictionandrelevance, use a cheap model likeopenai/gpt-4o-minito minimize latency and cost - Use external providers (
openai,lakera) for production workloads requiring higher accuracy - Keep guardrail rules ordered by cost: cheap checks first (regex, heuristics), expensive checks last (LLM-based)