AI Agent Evaluation: Automated Scoring with LLM Judges

You deployed your AI agents. They're handling real traffic. But how do you know they're actually doing a good job? Without measurement, you're flying blind, shipping features based on vibes instead of data.

Manual review doesn't scale. You can't read every agent response when you're processing hundreds or thousands of runs per day. What you need is automated evaluation: an LLM that scores every agent run against criteria you define, surfaces quality trends, and alerts you when things degrade.

That's the LLM-as-a-judge pattern. Connic calls them Judges, and they turn agent quality from a guessing game into a measured, tracked metric.

Why You Need Automated Evaluation

AI agents are non-deterministic. The same input can produce different outputs across runs. A prompt change that improves one use case might break another. Without continuous evaluation, you only discover quality problems when users complain.

The Vibe Check Trap

"It seems to be working well" is not a quality metric. Teams that rely on spot-checking miss systematic failures. An agent might nail 90% of queries but consistently fail on one category, and you'd never know without measurement.

The Silent Regression

You update a system prompt, swap a model, or change a tool. The agent still responds. But the quality dropped 20%. Without automated scoring, this regression sits in production for weeks until someone notices the customer satisfaction dip.

The ROI Question

Your CTO asks: "How are the agents performing?" Without evaluation data, you have anecdotes. With it, you have a dashboard showing accuracy trends, per-criteria scores, and quality comparisons across deployments.

How LLM Judges Work

The concept is straightforward. Use a language model to evaluate another language model's output. You define a scoring rubric (the specific criteria that matter for your use case) and the judge LLM scores every agent run against that rubric.

Agent Run Completes→Judge Evaluates→Scores + Reasoning→Dashboard + Alerts

The judge receives the full run context: the input, the output, the execution traces, token usage, and any context metadata. It evaluates each criterion independently and returns a score with reasoning for every one.

Setting Up a Judge

A judge has four components: the agent it evaluates, the model it uses for scoring, the scoring criteria, and the trigger configuration.

1. Define Your Scoring Criteria

This is the most important step. Vague criteria produce inconsistent scores. Specific criteria produce reliable, actionable evaluations.

Vague Criteria

• "Quality" (max 10)
• "Helpfulness" (max 10)
• "Good response" (max 10)

Specific Criteria

• "Extracted all invoice line items" (max 10)
• "Used correct tool for the task" (max 5)
• "Response under 200 words" (max 5)

Each criterion gets a name, a description that tells the judge exactly what to evaluate, and a maximum score. The judge scores each one independently, so you can see which aspects of performance are strong and which need work.

2. Choose a Trigger Mode

Automatic

Evaluates runs as they complete. Set a sample rate (1-100%) to control costs. At 10% sample rate on a high-volume agent, you get statistically meaningful data without evaluating every single run.

Manual

You explicitly trigger evaluation on specific runs. Useful when testing a new judge configuration, re-evaluating after criteria changes, or spot-checking suspicious runs.

Tip: Start Manual, Go Automatic

Test your judge on a handful of runs in manual mode first. Verify the scores match your expectations. Tweak the criteria descriptions until the judge evaluates consistently. Then switch to automatic mode for continuous monitoring.

3. Add a System Prompt (Optional but Recommended)

Give the judge domain context. If your agent processes medical insurance claims, tell the judge what a correct claim determination looks like. If your agent writes marketing copy, describe your brand voice and quality bar.

Example Judge System Prompt

You are evaluating a customer support agent for an e-commerce
platform. A high-quality response should:

- Directly address the customer's specific question
- Reference the correct order, product, or policy
- Provide actionable next steps (not generic advice)
- Maintain a professional, empathetic tone
- Avoid making promises the company cannot keep

If the agent used tools, verify it queried the correct data
before responding.

You can also include the evaluated agent's own system prompt in the judge's context. This helps the judge understand what the agent was supposed to do, not just what it actually did.

A Complete Example

A practical example: evaluating a document extraction agent that processes invoices.

Invoice Extraction Judge

Model

openai/gpt-5.4 (or any model with strong reasoning)

Trigger

Automatic — 100% sample rate (low volume, evaluate every run)

Criteria

max 10Data Completeness— Did the agent extract all required fields? (vendor, date, line items, total)

max 10Accuracy— Are the extracted values correct? Do line item totals match?

max 5Tool Usage— Did the agent use the correct tools in the right order?

max 5Uncertainty Handling— Did it flag unclear fields instead of guessing?

With this setup, every invoice extraction run is scored out of 30. You can see at a glance whether accuracy is high but completeness is dropping, or whether the agent is guessing instead of flagging uncertainty. Each score comes with reasoning explaining why the judge assigned that number.

Run Filters: Focus Evaluations Where They Matter

Not every run needs evaluation. Filters let you target specific traffic:

Status Filters

Only evaluate completed runs (skip failures). Or only evaluate failed runs to understand what went wrong.

Context Filters

Filter on custom context values set by your middleware. Only evaluate production traffic, premium customers, or specific use case categories.

That lets you create separate judges for different scenarios. A support agent handling both billing and technical questions might need two judges with different criteria and filters instead of one blended evaluation.

Quality Alerts

Scores are useful. Alerts are actionable. Configure a score threshold and get notified when quality drops below it.

Threshold

Set a minimum score percentage. When the average drops below it, you get an alert.

Average Window

Average over the last 1, 10, 50, or 100 evaluations. Smaller windows react faster. Larger windows smooth out outliers.

Low Score Filter

Filter evaluations to show only runs below a configurable percentage. Quickly find the worst performers for investigation.

What You See in the Dashboard

The judge detail page gives you a comprehensive view of quality over time:

Average Score

Overall quality metric with trend indicator. Color-coded: green (80%+), amber (50-80%), red (below 50%).

Criteria Averages

Per-criterion breakdown showing exactly which aspects are strong and which are weak. If accuracy is 95% but tool usage is 60%, you know where to focus.

Evaluation Detail

Click any evaluation to see the full reasoning. The judge explains why it assigned each score, so you can verify the evaluation logic and refine criteria.

Cost Tracking

Total tokens, average tokens per evaluation, and cumulative cost of running the judge. Know exactly what quality monitoring costs you.

Practical Patterns

A few evaluation setups that work well in production:

Deployment Canary

Set the alert window to "last 10 runs" and deploy a new agent version. If the average score drops below your threshold within the first 10 runs, you get an immediate alert. Roll back before the regression affects more users.

Per-Use-Case Quality Tracking

A single agent handles multiple use cases. Create separate judges with context filters for each one. Track billing question quality independently from technical question quality. Each gets its own criteria, thresholds, and alerts.

Model Comparison

Thinking about switching from GPT to Claude or Gemini? Run both models in parallel (using a fallback model configuration), then compare judge scores across model versions. Make the decision with data, not assumptions.

Prompt Engineering Feedback Loop

Update a system prompt. Filter evaluations by low scores. Read the judge's reasoning for the worst runs. Refine the prompt based on specific failure patterns. Repeat until the criteria averages stabilize.

Cost Considerations

Each judge evaluation is one additional LLM call. A few ways to keep costs reasonable:

•Use sample rates for high-volume agents. 10-20% gives statistically meaningful quality signals without evaluating every run.
•Choose efficient judge models. You don't need the most expensive model for evaluation. A mid-range model with strong reasoning often scores just as reliably.
•Keep criteria focused. Five specific criteria beat twenty vague ones. Fewer criteria means shorter judge prompts and lower token usage.
•Use filters to exclude noise. Don't evaluate test runs, internal debugging, or low-value traffic.

Getting Started

Adding a judge to an existing agent takes five minutes:

1.Navigate to Judges in your project and create a new judge
2.Select the agent to evaluate and choose a model for the judge
3.Define 3-5 specific scoring criteria with clear descriptions
4.Set to manual mode and test on a few recent runs
5.Verify scores match your expectations, then switch to automatic
6.Set a score alert threshold so you know when quality drops

The hardest part is writing good criteria descriptions. Spend time on those. Everything else is configuration.

For the full setup guide, check the Judges documentation. New to Connic? Start with the quickstart guide to deploy your first agent, then add a judge to start measuring quality.

AI Agent Evaluation: Automated Scoring with LLM Judges

Why You Need Automated Evaluation

How LLM Judges Work

Setting Up a Judge

1. Define Your Scoring Criteria

2. Choose a Trigger Mode

3. Add a System Prompt (Optional but Recommended)

A Complete Example

Run Filters: Focus Evaluations Where They Matter

Quality Alerts

What You See in the Dashboard

Practical Patterns

Cost Considerations

Getting Started

More from the Blog

Agent Approvals: Human-in-the-Loop for Production AI

What We Shipped in March 2026

A/B Testing for AI Agents: Ship Better Prompts with Confidence

Agent Guardrails: Real-Time Safety for Your AI Agents

What We Shipped in December 2025

Hidden Costs of Self-Hosting AI Agents