You deployed your AI agents. They're handling real traffic. But how do you know they're actually doing a good job? Without measurement, you're flying blind, shipping features based on vibes instead of data.
Manual review doesn't scale. You can't read every agent response when you're processing hundreds or thousands of runs per day. What you need is automated evaluation: an LLM that scores every agent run against criteria you define, surfaces quality trends, and alerts you when things degrade.
That's the LLM-as-a-judge pattern. Connic calls them Judges, and they turn agent quality from a guessing game into a measured, tracked metric.
Why You Need Automated Evaluation
AI agents are non-deterministic. The same input can produce different outputs across runs. A prompt change that improves one use case might break another. Without continuous evaluation, you only discover quality problems when users complain.
How LLM Judges Work
The concept is straightforward. Use a language model to evaluate another language model's output. You define a scoring rubric (the specific criteria that matter for your use case) and the judge LLM scores every agent run against that rubric.
The judge receives the full run context: the input, the output, the execution traces, token usage, and any context metadata. It evaluates each criterion independently and returns a score with reasoning for every one.
Setting Up a Judge
A judge has four components: the agent it evaluates, the model it uses for scoring, the scoring criteria, and the trigger configuration.
1. Define Your Scoring Criteria
This is the most important step. Vague criteria produce inconsistent scores. Specific criteria produce reliable, actionable evaluations.
- • "Quality" (max 10)
- • "Helpfulness" (max 10)
- • "Good response" (max 10)
- • "Extracted all invoice line items" (max 10)
- • "Used correct tool for the task" (max 5)
- • "Response under 200 words" (max 5)
Each criterion gets a name, a description that tells the judge exactly what to evaluate, and a maximum score. The judge scores each one independently, so you can see which aspects of performance are strong and which need work.
2. Choose a Trigger Mode
3. Add a System Prompt (Optional but Recommended)
Give the judge domain context. If your agent processes medical insurance claims, tell the judge what a correct claim determination looks like. If your agent writes marketing copy, describe your brand voice and quality bar.
You are evaluating a customer support agent for an e-commerce
platform. A high-quality response should:
- Directly address the customer's specific question
- Reference the correct order, product, or policy
- Provide actionable next steps (not generic advice)
- Maintain a professional, empathetic tone
- Avoid making promises the company cannot keep
If the agent used tools, verify it queried the correct data
before responding.You can also include the evaluated agent's own system prompt in the judge's context. This helps the judge understand what the agent was supposed to do, not just what it actually did.
A Complete Example
A practical example: evaluating a document extraction agent that processes invoices.
Model
openai/gpt-5.4 (or any model with strong reasoning)
Trigger
Automatic — 100% sample rate (low volume, evaluate every run)
Criteria
With this setup, every invoice extraction run is scored out of 30. You can see at a glance whether accuracy is high but completeness is dropping, or whether the agent is guessing instead of flagging uncertainty. Each score comes with reasoning explaining why the judge assigned that number.
Run Filters: Focus Evaluations Where They Matter
Not every run needs evaluation. Filters let you target specific traffic:
That lets you create separate judges for different scenarios. A support agent handling both billing and technical questions might need two judges with different criteria and filters instead of one blended evaluation.
Quality Alerts
Scores are useful. Alerts are actionable. Configure a score threshold and get notified when quality drops below it.
What You See in the Dashboard
The judge detail page gives you a comprehensive view of quality over time:
Practical Patterns
A few evaluation setups that work well in production:
Cost Considerations
Each judge evaluation is one additional LLM call. A few ways to keep costs reasonable:
- •Use sample rates for high-volume agents. 10-20% gives statistically meaningful quality signals without evaluating every run.
- •Choose efficient judge models. You don't need the most expensive model for evaluation. A mid-range model with strong reasoning often scores just as reliably.
- •Keep criteria focused. Five specific criteria beat twenty vague ones. Fewer criteria means shorter judge prompts and lower token usage.
- •Use filters to exclude noise. Don't evaluate test runs, internal debugging, or low-value traffic.
Getting Started
Adding a judge to an existing agent takes five minutes:
- 1.Navigate to Judges in your project and create a new judge
- 2.Select the agent to evaluate and choose a model for the judge
- 3.Define 3-5 specific scoring criteria with clear descriptions
- 4.Set to manual mode and test on a few recent runs
- 5.Verify scores match your expectations, then switch to automatic
- 6.Set a score alert threshold so you know when quality drops
The hardest part is writing good criteria descriptions. Spend time on those. Everything else is configuration.
For the full setup guide, check the Judges documentation. New to Connic? Start with the quickstart guide to deploy your first agent, then add a judge to start measuring quality.