You deployed your AI agents. They are handling real traffic. But how do you know they are actually doing a good job? Without measurement, you are flying blind — shipping features based on vibes instead of data.
Manual review does not scale. You cannot read every agent response when you are processing hundreds or thousands of runs per day. What you need is automated evaluation: an LLM that scores every agent run against criteria you define, surfaces quality trends, and alerts you when things degrade.
This is the LLM-as-a-judge pattern. Connic calls them Judges, and they turn agent quality from a guessing game into a measured, tracked metric.
Why You Need Automated Evaluation
AI agents are non-deterministic. The same input can produce different outputs across runs. A prompt change that improves one use case might break another. Without continuous evaluation, you only discover quality problems when users complain.
The Vibe Check Trap
"It seems to be working well" is not a quality metric. Teams that rely on spot-checking miss systematic failures. An agent might nail 90% of queries but consistently fail on one category — and you would never know without measurement.
The Silent Regression
You update a system prompt, swap a model, or change a tool. The agent still responds. But the quality dropped 20%. Without automated scoring, this regression sits in production for weeks until someone notices the customer satisfaction dip.
The ROI Question
Your CTO asks: "How are the agents performing?" Without evaluation data, you have anecdotes. With it, you have a dashboard showing accuracy trends, per-criteria scores, and quality comparisons across deployments.
How LLM Judges Work
The concept is straightforward: use a language model to evaluate another language model's output. You define a scoring rubric — the specific criteria that matter for your use case — and the judge LLM scores every agent run against that rubric.
The judge receives the full run context: the input, the output, the execution traces, token usage, and any context metadata. It evaluates each criterion independently and returns a score with reasoning for every one.
Setting Up a Judge
A judge has four components: the agent it evaluates, the model it uses for scoring, the scoring criteria, and the trigger configuration.
1. Define Your Scoring Criteria
This is the most important step. Vague criteria produce inconsistent scores. Specific criteria produce reliable, actionable evaluations.
Vague Criteria
- • "Quality" (max 10)
- • "Helpfulness" (max 10)
- • "Good response" (max 10)
Specific Criteria
- • "Extracted all invoice line items" (max 10)
- • "Used correct tool for the task" (max 5)
- • "Response under 200 words" (max 5)
Each criterion gets a name, a description that tells the judge exactly what to evaluate, and a maximum score. The judge scores each one independently, so you can see exactly which aspects of performance are strong and which need work.
2. Choose a Trigger Mode
Automatic
Evaluates runs as they complete. Set a sample rate (1-100%) to control costs. At 10% sample rate on a high-volume agent, you get statistically meaningful data without evaluating every single run.
Manual
You explicitly trigger evaluation on specific runs. Useful when testing a new judge configuration, re-evaluating after criteria changes, or spot-checking suspicious runs.
Tip: Start Manual, Go Automatic
Test your judge on a handful of runs in manual mode first. Verify the scores match your expectations. Tweak the criteria descriptions until the judge evaluates consistently. Then switch to automatic mode for continuous monitoring.
3. Add a System Prompt (Optional but Recommended)
Give the judge domain context. If your agent processes medical insurance claims, tell the judge what a correct claim determination looks like. If your agent writes marketing copy, describe your brand voice and quality bar.
You are evaluating a customer support agent for an e-commerce
platform. A high-quality response should:
- Directly address the customer's specific question
- Reference the correct order, product, or policy
- Provide actionable next steps (not generic advice)
- Maintain a professional, empathetic tone
- Avoid making promises the company cannot keep
If the agent used tools, verify it queried the correct data
before responding.You can also include the evaluated agent's own system prompt in the judge's context. This helps the judge understand what the agent was supposed to do, not just what it actually did.
A Complete Example
Here is a practical example: evaluating a document extraction agent that processes invoices.
Invoice Extraction Judge
Model
openai/gpt-5.4 (or any model with strong reasoning)
Trigger
Automatic — 100% sample rate (low volume, evaluate every run)
Criteria
With this setup, every invoice extraction run is scored out of 30. You can see at a glance whether accuracy is high but completeness is dropping, or whether the agent is guessing instead of flagging uncertainty. Each score comes with reasoning explaining why the judge assigned that number.
Run Filters: Focus Evaluations Where They Matter
Not every run needs evaluation. Filters let you target specific traffic:
Status Filters
Only evaluate completed runs (skip failures). Or only evaluate failed runs to understand what went wrong.
Context Filters
Filter on custom context values set by your middleware. Only evaluate production traffic, premium customers, or specific use case categories.
This lets you create separate judges for different scenarios. A support agent handling both billing and technical questions might need two judges with different criteria and filters, rather than one blended evaluation.
Quality Alerts
Scores are useful. Alerts are actionable. Configure a score threshold and get notified when quality drops below it.
Threshold
Set a minimum score percentage. When the average drops below it, you get an alert.
Average Window
Average over the last 1, 10, 50, or 100 evaluations. Smaller windows react faster. Larger windows smooth out outliers.
Low Score Filter
Filter evaluations to show only runs below a configurable percentage. Quickly find the worst performers for investigation.
What You See in the Dashboard
The judge detail page gives you a comprehensive view of quality over time:
Average Score
Overall quality metric with trend indicator. Color-coded: green (80%+), amber (50-80%), red (below 50%).
Criteria Averages
Per-criterion breakdown showing exactly which aspects are strong and which are weak. If accuracy is 95% but tool usage is 60%, you know where to focus.
Evaluation Detail
Click any evaluation to see the full reasoning. The judge explains why it assigned each score, so you can verify the evaluation logic and refine criteria.
Cost Tracking
Total tokens, average tokens per evaluation, and cumulative cost of running the judge. Know exactly what quality monitoring costs you.
Practical Patterns
Here are evaluation setups we see working well in production:
Deployment Canary
Set the alert window to "last 10 runs" and deploy a new agent version. If the average score drops below your threshold within the first 10 runs, you get an immediate alert. Roll back before the regression affects more users.
Per-Use-Case Quality Tracking
A single agent handles multiple use cases. Create separate judges with context filters for each one. Track billing question quality independently from technical question quality. Each gets its own criteria, thresholds, and alerts.
Model Comparison
Thinking about switching from GPT to Claude or Gemini? Run both models in parallel (using a fallback model configuration), then compare judge scores across model versions. Make the decision with data, not assumptions.
Prompt Engineering Feedback Loop
Update a system prompt. Filter evaluations by low scores. Read the judge's reasoning for the worst runs. Refine the prompt based on specific failure patterns. Repeat until the criteria averages stabilize.
Cost Considerations
Each judge evaluation is one additional LLM call. Here is how to keep costs reasonable:
- •Use sample rates for high-volume agents. 10-20% gives statistically meaningful quality signals without evaluating every run.
- •Choose efficient judge models. You do not need the most expensive model for evaluation. A mid-range model with strong reasoning often scores just as reliably.
- •Keep criteria focused. Five specific criteria beat twenty vague ones. Fewer criteria means shorter judge prompts and lower token usage.
- •Use filters to exclude noise. Do not evaluate test runs, internal debugging, or low-value traffic.
Getting Started
Adding a judge to an existing agent takes five minutes:
- 1.Navigate to Judges in your project and create a new judge
- 2.Select the agent to evaluate and choose a model for the judge
- 3.Define 3-5 specific scoring criteria with clear descriptions
- 4.Set to manual mode and test on a few recent runs
- 5.Verify scores match your expectations, then switch to automatic
- 6.Set a score alert threshold so you know when quality drops
The hardest part is writing good criteria descriptions. Spend time on those. Everything else is configuration.
For the full setup guide, check the Judges documentation. If you are new to Connic, start with the quickstart guide to deploy your first agent, then add a judge to start measuring quality.