Judges
Automatically evaluate agent runs using LLMs. Define structured scoring rubrics, configure sampling rates, and track evaluation quality over time.
Overview
Judges let you use an LLM to evaluate how well your agents perform. Instead of manually reviewing runs, you define a scoring rubric with criteria and point values, assign it to an agent, and let Connic automatically score runs against that rubric.
Each judge evaluation receives the full agent run data (input, output, error, traces, context, token usage, status, and duration) and returns per-criteria scores with reasoning.
Structured Scoring
Define named criteria with descriptions and max scores. Get repeatable, comparable evaluations across all your runs.
Sample Rate
Control costs by evaluating a percentage of runs automatically, or trigger evaluations manually for specific runs.
Run Filters
Only evaluate runs matching specific conditions like status, agent name, or context properties.
Creating a Judge
Navigate to the Judges tab in your project and click New Judge. Each judge is configured with the following:
| Field | Description |
|---|---|
| Name | A descriptive name for the judge, e.g. "Invoice Quality Check" or "Response Accuracy". |
| Agent | The agent whose runs this judge will evaluate. A judge is always scoped to a single agent. |
| Model | The LLM model used for evaluation. Uses the same provider/model-name format as your agent configuration. Uses your project's API keys. |
| System Prompt | Optional additional instructions for the judge. Use this to provide domain-specific context, examples of good/bad responses, or special evaluation rules. |
| Scoring Criteria | One or more named criteria, each with a description and max score. The judge evaluates each criterion independently. |
| Trigger Mode | Automatic evaluates runs when they complete. Manual requires explicit triggering per run. |
| Sample Rate | For automatic judges, the percentage of matching runs to evaluate (1-100%). Set to 100% to evaluate every run, or lower to control costs. |
| Filters | Optional conditions to narrow which runs are eligible. Filter on fields like status, agent_name, or context properties like context.tier. |
Scoring Criteria
Criteria are the core of a judge configuration. Each criterion defines what aspect of the agent's performance to evaluate. The judge LLM scores each criterion independently and provides reasoning for each score.
Did the agent produce a factually correct and complete response based on the input?
Did the agent use the appropriate tools and interpret their results correctly?
Is the response well-structured, clear, and appropriately formatted?
The overall score is the sum of all criteria scores. In the example above, a perfect score would be 20/20. Average scores are tracked over time on the judge detail page and shown as criteria breakdowns with progress bars.
How It Works
When an agent run completes, the backend checks all active automatic judges assigned to that agent. For each matching judge (after applying filters and sample rate), an evaluation task is queued.
Run completes
The agent finishes processing and the run status is set to completed.
Filters and sample rate applied
Each active judge's filters are checked. If the run matches, the sample rate determines if it gets evaluated.
Evaluation queued
A judge run record is created with status "queued" and pushed to the processing queue.
LLM evaluates
The judge worker sends all run data to the configured LLM with the scoring rubric. The model returns per-criteria scores and reasoning.
Results stored
Scores, reasoning, and token usage are saved. Results appear on the judge detail page and in the run detail dialog.
Run Filters
Filters let you control exactly which runs get evaluated. Each filter has a field, an operator, and a value. All filters must match for a run to be eligible (AND logic).
| Operator | Example Field | Description |
|---|---|---|
| equals | status | Exact string match. E.g. only evaluate runs with status = completed. |
| not equals | context.tier | Exclude specific values. E.g. skip runs where context.tier != test. |
| contains | context.source | Substring match. E.g. evaluate runs where context.source contains production. |
| exists | context.customer_id | Check if a field is present and not empty. Useful for only judging runs that have specific context values. |
Fields prefixed with context. are resolved against the run context dictionary. This works well with middleware that tags runs with metadata like customer tier, request source, or feature flags.
Manual Triggering
You can trigger a judge evaluation for any specific agent run, regardless of the judge's trigger mode or sample rate. This is useful for:
- Evaluating a run that was not automatically sampled
- Re-evaluating a run after adjusting the judge's criteria or system prompt
- Testing a new judge configuration on existing runs before enabling automatic mode
- Spot-checking runs that look suspicious
On the judge detail page, click Trigger Manually and select a run from the dropdown. The dropdown shows recent completed runs for the judge's agent. You can also paste a run ID directly for older runs.
Agent Validation
When triggering manually, the backend validates that the selected run belongs to the judge's configured agent. You cannot evaluate a run from a different agent.
Viewing Results
Judge results are visible in two places:
Judge Detail Page
Click any judge in the overview to see its detail page. The page follows the same layout as agent and connector detail pages:
- Statistics: Average score with trend, total evaluated, sample rate, and failed count with completed/failed breakdown
- Criteria Averages: Per-criterion average scores with progress bars, so you can see which criteria agents struggle with most
- Evaluations list: All judge runs with status, run ID, score, and timestamp. Click any evaluation to open the agent run detail dialog
- Configuration: Agent, model, trigger mode, sample rate, system prompt, criteria, and filters
Run Detail Dialog
When viewing any agent run that has been evaluated, judge results appear at the top of the dialog (before input/output). Each evaluation shows:
- The judge name and evaluation status
- Overall score as a percentage with the raw score
- Per-criteria scores with progress bars and reasoning
- An overall assessment summarizing the evaluation
A score pill also appears in the run header next to the status badge, giving you a quick quality signal at a glance.
Evaluation Statuses
Each judge evaluation goes through these lifecycle states:
| Status | Description |
|---|---|
| Queued | The evaluation is waiting to be processed. |
| Running | The judge LLM is actively evaluating the run. |
| Completed | Evaluation finished successfully. Scores and reasoning are available. |
| Failed | The evaluation failed. Common causes: invalid model name, authentication error, or rate limiting. The error message is shown on the evaluation. |
API Keys
Judges use your project's LLM provider API keys, the same keys configured in your project settings. The model field uses the same provider/model-name format as your agent YAML configuration.
All providers supported for agent execution are also supported for judges, including OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Vertex AI, and OpenRouter.
Billing
Each successful judge evaluation counts as one additional billable run on the agent run it evaluated. This means a run that gets evaluated by one judge counts as 2 billable runs (1 for the original execution + 1 for the evaluation). A run evaluated by two judges counts as 3 billable runs, and so on.
Failed evaluations are not billed. Only completed evaluations increment the billable run count.
| Scenario | Billable Runs |
|---|---|
| Agent run, no judge | 1 |
| Agent run + 1 judge evaluation | 2 |
| Agent run + 2 judge evaluations | 3 |
| Agent run + 1 failed evaluation | 1 |
Cost Considerations
Judge evaluations have two cost components: the Connic billable run and the LLM API cost charged by your provider. Use the sample rate to control both. For high-volume agents, a 10-20% sample rate is often sufficient to track quality trends without significantly increasing costs.
Notifications
Judges support score-based notifications that alert you when the overall average evaluation score drops below a configurable threshold. Notification channels (in-app, email) depend on each member's preferences.
Score Alert Threshold
Each judge has an optional Score Alert setting. When enabled, you set a threshold percentage (e.g. 60%). A notification is triggered when the judge's average score drops below that threshold.
The alert only fires on the transition: the specific evaluation that causes the average to drop below the threshold triggers the notification. Subsequent evaluations that keep the average below the threshold do not generate additional alerts. If the average recovers above the threshold and drops below again, a new alert will fire.
Average Window
By default, the score alert uses the average of the last 10 completed runs to decide whether to fire. You can change this window to suit your use case:
- Every run (1): Alerts whenever a single evaluation scores below the threshold — useful for catching every bad run.
- Last 10 / 50 / 100 runs: Smooths out outliers by averaging over a window of recent evaluations.
- All time: Uses the average across all completed evaluations.
A smaller window reacts faster to quality drops but may be noisier. A larger window or all-time average is more stable but slower to alert.
Notification Channels
Judge score alerts support the same two channels as all other Connic notifications:
- In-App: Appears in the notification bell in the top navigation bar. Clicking the notification takes you directly to the judge detail page.
- Email: Sends a styled email with the judge name, current average score, threshold, agent name, total evaluations, and a direct link to the judge.
Managing Preferences
Each project member can independently control whether they receive judge score alerts via in-app, email, or both. Go to Settings > Notifications in your project to toggle the Judge Score Low event type. By default, both in-app and email are enabled.
Threshold vs. Sample Rate
The notification threshold is based on the average within the configured window, not individual evaluations (unless the window is set to 1). The sample rate affects which runs are evaluated and therefore contribute to the average. A lower sample rate means fewer data points, so the average may fluctuate more. For the most accurate quality signal, use a higher sample rate.
Best Practices
Start with manual triggering
Create a judge in manual mode first and evaluate a handful of runs to verify your criteria and system prompt produce useful scores. Once you are happy with the results, switch to automatic mode.
Write specific criteria descriptions
Vague criteria like "Quality" produce inconsistent scores. Be specific: "Did the response correctly extract all invoice line items including quantity, unit price, and total?" gives the judge LLM clear evaluation guidelines.
Use the system prompt for domain context
If your agent handles domain-specific tasks, provide context in the system prompt. For example: "This agent processes medical insurance claims. A correct response must include the claim number, patient name, and determination." This helps the judge LLM evaluate accuracy in context.
Use filters to focus evaluations
If your agent handles multiple use cases, use context filters to create separate judges for each. This gives you targeted quality metrics per use case rather than blended averages.
Monitor criteria averages for regressions
The criteria averages on the judge detail page show which aspects of your agent's performance are strong and which need work. After deploying a new agent version, compare these averages to catch regressions in specific criteria.
Automated quality assurance for your agents
With judges, you can continuously monitor agent quality at scale. Define what good performance looks like, and let Connic track it across every run.