Platform

Judges

Automatically evaluate agent runs using LLMs. Define structured scoring rubrics, configure sampling rates, and track evaluation quality over time.

Last updated July 5, 2026

Overview

Judges let you use an LLM to evaluate how well your agents perform. Instead of manually reviewing runs, you define a scoring rubric with criteria and point values, assign it to an agent, and let Connic automatically score runs against that rubric.

Each judge evaluation receives the full agent run data (input, output, error, traces, context, token usage, status, and duration) and returns per-criteria scores with reasoning.

Structured Scoring

Define named criteria with descriptions and max scores. Get repeatable, comparable evaluations across all your runs.

Sample Rate

Control costs by evaluating a percentage of runs automatically, or trigger evaluations manually for specific runs.

Run Filters

Only evaluate runs matching specific conditions like status, agent name, or context properties.

Creating a Judge

Navigate to the Judges tab in your project and click New Judge. Each judge is configured with the following:

Field	Description
Name	A descriptive name for the judge, e.g. "Invoice Quality Check" or "Response Accuracy".
Agent	The agent whose runs this judge will evaluate. A judge is always scoped to a single agent.
Model	The LLM model used for evaluation. Uses the same `provider/model-name` format as your agent configuration. Uses your project's API keys.
System Prompt	Optional additional instructions for the judge. Use this to provide domain-specific context, examples of good/bad responses, or special evaluation rules.
Scoring Criteria	One or more named criteria, each with a description and max score. The judge evaluates each criterion independently.
Trigger Mode	Automatic evaluates runs when they complete. Manual requires explicit triggering per run.
Sample Rate	For automatic judges, the percentage of matching runs to evaluate (1-100%). Set to 100% to evaluate every run, or lower to control costs.
Filters	Optional conditions to narrow which runs are eligible. Filter on fields like `status`, `agent_name`, or context properties like `context.tier`.

Scoring Criteria

Criteria are the core of a judge configuration. Each criterion defines what aspect of the agent's performance to evaluate. The judge LLM scores each criterion independently and provides reasoning for each score.

Accuracymax 10

Did the agent produce a factually correct and complete response based on the input?

Tool Usagemax 5

Did the agent use the appropriate tools and interpret their results correctly?

Response Qualitymax 5

Is the response well-structured, clear, and appropriately formatted?

The overall score is the sum of all criteria scores. In the example above, a perfect score would be 20/20. Average scores are tracked over time on the judge detail page and shown as criteria breakdowns with progress bars.

How It Works

When an agent run completes, the backend checks all active automatic judges assigned to that agent. For each matching judge (after applying the filter expression and sample rate), an evaluation task is queued.

Run completes

The agent finishes processing and the run status is set to completed.

Filter expression and sample rate applied

Each active judge's expression is checked. If the run matches, the sample rate determines if it gets evaluated.

Evaluation queued

A judge run record is created with status "queued" and pushed to the processing queue.

LLM evaluates

The judge worker sends all run data to the configured LLM with the scoring rubric. The model returns per-criteria scores and reasoning.

Results stored

Scores, reasoning, and token usage are saved. Results appear on the judge detail page and in the run detail dialog.

Run Filter Expressions

Filter expressions let you control exactly which runs get evaluated. They use the same Python-like expression engine as observability widgets.

Example	Description
context.tier == 'enterprise'	Evaluate runs tagged with a specific context value.
input.channel in ('web', 'email')	Match JSON trigger payload fields when the run input is structured.
output.priority >= 3	Match structured output fields when the agent returned JSON.
context.customer_id	Use a bare path as a truthy check for present, non-empty values.

Expressions support and, or, comparisons, membership tests, and dot-path access on context, input, and output.

Manual Triggering

You can trigger a judge evaluation for any specific agent run, regardless of the judge's trigger mode or sample rate. This is useful for:

Evaluating a run that was not automatically sampled
Re-evaluating a run after adjusting the judge's criteria or system prompt
Testing a new judge configuration on existing runs before enabling automatic mode
Spot-checking runs that look suspicious

On the judge detail page, click Trigger Manually and select a run from the dropdown. The dropdown shows recent completed runs for the judge's agent. You can also paste a run ID directly for older runs.

Agent Validation

When triggering manually, the backend validates that the selected run belongs to the judge's configured agent. You cannot evaluate a run from a different agent.

Viewing Results

Judge results are visible in two places:

Judge Detail Page

Click any judge in the overview to see its detail page. The page follows the same layout as agent and connector detail pages:

Statistics: Average score with trend, total evaluated, sample rate, and failed count with completed/failed breakdown
Criteria Averages: Per-criterion average scores with progress bars, so you can see which criteria agents struggle with most
Evaluations list: All judge runs with status, run ID, score, and timestamp. Click any evaluation to open the agent run detail dialog
Configuration: Agent, model, trigger mode, sample rate, system prompt, criteria, and filters

The judge detail page showing statistics (average score with trend, total evaluated, total tokens, average tokens per evaluation, and total cost), per-criterion average scores with progress bars, and the evaluations list. — A judge's detail page: average score and trend, run totals and cost, per-criterion averages, and the list of individual evaluations.

Run Detail Dialog

When viewing any agent run that has been evaluated, judge results appear at the top of the dialog (before input/output). Each evaluation shows:

The judge name and evaluation status
Overall score as a percentage with the raw score
Per-criteria scores with progress bars and reasoning
An overall assessment summarizing the evaluation

The run detail dialog with the judge results block at the top: the overall score as a percentage with the raw score, per-criteria scores for quality and groundedness with progress bars, and the overall assessment. — Judge results at the top of the run detail dialog: the overall score as a percentage, each criterion's score with a progress bar, and the overall assessment.

A score pill also appears in the run header next to the status badge, giving you a quick quality signal at a glance.

Evaluation Statuses

Each judge evaluation goes through these lifecycle states:

Status	Description
Queued	The evaluation is waiting to be processed.
Running	The judge LLM is actively evaluating the run.
Completed	Evaluation finished successfully. Scores and reasoning are available.
Failed	The evaluation failed. Common causes: invalid model name, authentication error, or rate limiting. The error message is shown on the evaluation.

API Keys

Judges use your project's LLM provider API keys, the same keys configured in your project settings. The model field uses the same provider/model-name format as your agent YAML configuration.

All providers supported for agent execution are also supported for judges, including OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Vertex AI, OpenRouter, and any custom OpenAI-compatible providers configured in your project settings.

Billing

Each successful judge evaluation counts as one additional billable run on the agent run it evaluated. This means a run that gets evaluated by one judge counts as 2 billable runs (1 for the original execution + 1 for the evaluation). A run evaluated by two judges counts as 3 billable runs, and so on.

Failed evaluations are not billed. Only completed evaluations increment the billable run count.

Scenario	Billable Runs
Agent run, no judge	1
Agent run + 1 judge evaluation	2
Agent run + 2 judge evaluations	3
Agent run + 1 failed evaluation	1

Cost Considerations

Judge evaluations have two cost components: the Connic billable run and the LLM API cost charged by your provider. Use the sample rate to control both. For high-volume agents, a 10-20% sample rate is often sufficient to track quality trends without significantly increasing costs.

Notifications

Judges support score-based notifications that alert you when the overall average evaluation score drops below a configurable threshold. Notification channels (in-app, email) depend on each member's preferences.

Score Alert Threshold

Each judge has an optional Score Alert setting. When enabled, you set a threshold percentage (e.g. 60%). A notification is triggered when the judge's average score drops below that threshold.

The alert only fires on the transition: the specific evaluation that causes the average to drop below the threshold triggers the notification. Subsequent evaluations that keep the average below the threshold do not generate additional alerts. If the average recovers above the threshold and drops below again, a new alert will fire.

Average Window

By default, the score alert uses the average of the last 10 completed runs to decide whether to fire. You can change this window to suit your use case:

Every run (1): Alerts whenever a single evaluation scores below the threshold. Useful for catching every bad run.
Last 10 / 50 / 100 runs: Smooths out outliers by averaging over a window of recent evaluations.
All time: Uses the average across all completed evaluations.

A smaller window reacts faster to quality drops but may be noisier. A larger window or all-time average is more stable but slower to alert.

Notification Channels

Judge score alerts support the same two channels as all other Connic notifications:

In-App: Appears in the notification bell in the top navigation bar. Clicking the notification takes you directly to the judge detail page.
Email: Sends a styled email with the judge name, current average score, threshold, agent name, total evaluations, and a direct link to the judge.

Managing Preferences

Each project member can independently control whether they receive judge score alerts via in-app, email, or both. Go to Settings > Notifications in your project to toggle the Judge Score Low event type. By default, both in-app and email are enabled.

Threshold vs. Sample Rate

The notification threshold is based on the average within the configured window, not individual evaluations (unless the window is set to 1). The sample rate decides which runs the judge evaluates, and only evaluated runs feed the average. A lower sample rate means fewer data points, so the average may fluctuate more. For the most accurate quality signal, use a higher sample rate.

Best Practices

Start with manual triggering

Create a judge in manual mode first and evaluate a handful of runs to verify your criteria and system prompt produce useful scores. Once you're happy with the results, switch to automatic mode.

Write specific criteria descriptions

Vague criteria like "Quality" produce inconsistent scores. Be specific: "Did the response correctly extract all invoice line items including quantity, unit price, and total?" gives the judge LLM clear evaluation guidelines.

Use the system prompt for domain context

If your agent handles domain-specific tasks, provide context in the system prompt. For example: "This agent processes medical insurance claims. A correct response must include the claim number, patient name, and determination." This helps the judge LLM evaluate accuracy in context.

Use expressions to focus evaluations

If your agent handles multiple use cases, use filter expressions to create separate judges for each context slice. This gives you targeted quality metrics per use case rather than blended averages.

Monitor criteria averages for regressions

The criteria averages on the judge detail page show which aspects of your agent's performance are strong and which need work. After deploying a new agent version, compare these averages to catch regressions in specific criteria.

Automated quality assurance for your agents

With judges, you can continuously monitor agent quality at scale. Define what good performance looks like, and let Connic track it across every run.

Observability

Monitor runs, traces, and dashboards

Agent Configuration

Configure your agents

Middleware

Enrich context for better filtering

Environments

Manage environments and API keys