Connic
Platform

Judges

Automatically evaluate agent runs using LLMs. Define structured scoring rubrics, configure sampling rates, and track evaluation quality over time.

Overview

Judges let you use an LLM to evaluate how well your agents perform. Instead of manually reviewing runs, you define a scoring rubric with criteria and point values, assign it to an agent, and let Connic automatically score runs against that rubric.

Each judge evaluation receives the full agent run data (input, output, error, traces, context, token usage, status, and duration) and returns per-criteria scores with reasoning.

Structured Scoring

Define named criteria with descriptions and max scores. Get repeatable, comparable evaluations across all your runs.

Sample Rate

Control costs by evaluating a percentage of runs automatically, or trigger evaluations manually for specific runs.

Run Filters

Only evaluate runs matching specific conditions like status, agent name, or context properties.

Creating a Judge

Navigate to the Judges tab in your project and click New Judge. Each judge is configured with the following:

FieldDescription
NameA descriptive name for the judge, e.g. "Invoice Quality Check" or "Response Accuracy".
AgentThe agent whose runs this judge will evaluate. A judge is always scoped to a single agent.
ModelThe LLM model used for evaluation. Uses the same provider/model-name format as your agent configuration. Uses your project's API keys.
System PromptOptional additional instructions for the judge. Use this to provide domain-specific context, examples of good/bad responses, or special evaluation rules.
Scoring CriteriaOne or more named criteria, each with a description and max score. The judge evaluates each criterion independently.
Trigger ModeAutomatic evaluates runs when they complete. Manual requires explicit triggering per run.
Sample RateFor automatic judges, the percentage of matching runs to evaluate (1-100%). Set to 100% to evaluate every run, or lower to control costs.
FiltersOptional conditions to narrow which runs are eligible. Filter on fields like status, agent_name, or context properties like context.tier.

Scoring Criteria

Criteria are the core of a judge configuration. Each criterion defines what aspect of the agent's performance to evaluate. The judge LLM scores each criterion independently and provides reasoning for each score.

Accuracymax 10

Did the agent produce a factually correct and complete response based on the input?

Tool Usagemax 5

Did the agent use the appropriate tools and interpret their results correctly?

Response Qualitymax 5

Is the response well-structured, clear, and appropriately formatted?

The overall score is the sum of all criteria scores. In the example above, a perfect score would be 20/20. Average scores are tracked over time on the judge detail page and shown as criteria breakdowns with progress bars.

How It Works

When an agent run completes, the backend checks all active automatic judges assigned to that agent. For each matching judge (after applying filters and sample rate), an evaluation task is queued.

1

Run completes

The agent finishes processing and the run status is set to completed.

2

Filters and sample rate applied

Each active judge's filters are checked. If the run matches, the sample rate determines if it gets evaluated.

3

Evaluation queued

A judge run record is created with status "queued" and pushed to the processing queue.

4

LLM evaluates

The judge worker sends all run data to the configured LLM with the scoring rubric. The model returns per-criteria scores and reasoning.

5

Results stored

Scores, reasoning, and token usage are saved. Results appear on the judge detail page and in the run detail dialog.

Run Filters

Filters let you control exactly which runs get evaluated. Each filter has a field, an operator, and a value. All filters must match for a run to be eligible (AND logic).

OperatorExample FieldDescription
equalsstatusExact string match. E.g. only evaluate runs with status = completed.
not equalscontext.tierExclude specific values. E.g. skip runs where context.tier != test.
containscontext.sourceSubstring match. E.g. evaluate runs where context.source contains production.
existscontext.customer_idCheck if a field is present and not empty. Useful for only judging runs that have specific context values.

Fields prefixed with context. are resolved against the run context dictionary. This works well with middleware that tags runs with metadata like customer tier, request source, or feature flags.

Manual Triggering

You can trigger a judge evaluation for any specific agent run, regardless of the judge's trigger mode or sample rate. This is useful for:

  • Evaluating a run that was not automatically sampled
  • Re-evaluating a run after adjusting the judge's criteria or system prompt
  • Testing a new judge configuration on existing runs before enabling automatic mode
  • Spot-checking runs that look suspicious

On the judge detail page, click Trigger Manually and select a run from the dropdown. The dropdown shows recent completed runs for the judge's agent. You can also paste a run ID directly for older runs.

Agent Validation

When triggering manually, the backend validates that the selected run belongs to the judge's configured agent. You cannot evaluate a run from a different agent.

Viewing Results

Judge results are visible in two places:

Judge Detail Page

Click any judge in the overview to see its detail page. The page follows the same layout as agent and connector detail pages:

  • Statistics: Average score with trend, total evaluated, sample rate, and failed count with completed/failed breakdown
  • Criteria Averages: Per-criterion average scores with progress bars, so you can see which criteria agents struggle with most
  • Evaluations list: All judge runs with status, run ID, score, and timestamp. Click any evaluation to open the agent run detail dialog
  • Configuration: Agent, model, trigger mode, sample rate, system prompt, criteria, and filters

Run Detail Dialog

When viewing any agent run that has been evaluated, judge results appear at the top of the dialog (before input/output). Each evaluation shows:

  • The judge name and evaluation status
  • Overall score as a percentage with the raw score
  • Per-criteria scores with progress bars and reasoning
  • An overall assessment summarizing the evaluation

A score pill also appears in the run header next to the status badge, giving you a quick quality signal at a glance.

Evaluation Statuses

Each judge evaluation goes through these lifecycle states:

StatusDescription
QueuedThe evaluation is waiting to be processed.
RunningThe judge LLM is actively evaluating the run.
CompletedEvaluation finished successfully. Scores and reasoning are available.
FailedThe evaluation failed. Common causes: invalid model name, authentication error, or rate limiting. The error message is shown on the evaluation.

API Keys

Judges use your project's LLM provider API keys, the same keys configured in your project settings. The model field uses the same provider/model-name format as your agent YAML configuration.

All providers supported for agent execution are also supported for judges, including OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Vertex AI, and OpenRouter.

Billing

Each successful judge evaluation counts as one additional billable run on the agent run it evaluated. This means a run that gets evaluated by one judge counts as 2 billable runs (1 for the original execution + 1 for the evaluation). A run evaluated by two judges counts as 3 billable runs, and so on.

Failed evaluations are not billed. Only completed evaluations increment the billable run count.

ScenarioBillable Runs
Agent run, no judge1
Agent run + 1 judge evaluation2
Agent run + 2 judge evaluations3
Agent run + 1 failed evaluation1

Cost Considerations

Judge evaluations have two cost components: the Connic billable run and the LLM API cost charged by your provider. Use the sample rate to control both. For high-volume agents, a 10-20% sample rate is often sufficient to track quality trends without significantly increasing costs.

Notifications

Judges support score-based notifications that alert you when the overall average evaluation score drops below a configurable threshold. Notification channels (in-app, email) depend on each member's preferences.

Score Alert Threshold

Each judge has an optional Score Alert setting. When enabled, you set a threshold percentage (e.g. 60%). A notification is triggered when the judge's average score drops below that threshold.

The alert only fires on the transition: the specific evaluation that causes the average to drop below the threshold triggers the notification. Subsequent evaluations that keep the average below the threshold do not generate additional alerts. If the average recovers above the threshold and drops below again, a new alert will fire.

Average Window

By default, the score alert uses the average of the last 10 completed runs to decide whether to fire. You can change this window to suit your use case:

  • Every run (1): Alerts whenever a single evaluation scores below the threshold — useful for catching every bad run.
  • Last 10 / 50 / 100 runs: Smooths out outliers by averaging over a window of recent evaluations.
  • All time: Uses the average across all completed evaluations.

A smaller window reacts faster to quality drops but may be noisier. A larger window or all-time average is more stable but slower to alert.

Notification Channels

Judge score alerts support the same two channels as all other Connic notifications:

  • In-App: Appears in the notification bell in the top navigation bar. Clicking the notification takes you directly to the judge detail page.
  • Email: Sends a styled email with the judge name, current average score, threshold, agent name, total evaluations, and a direct link to the judge.

Managing Preferences

Each project member can independently control whether they receive judge score alerts via in-app, email, or both. Go to Settings > Notifications in your project to toggle the Judge Score Low event type. By default, both in-app and email are enabled.

Threshold vs. Sample Rate

The notification threshold is based on the average within the configured window, not individual evaluations (unless the window is set to 1). The sample rate affects which runs are evaluated and therefore contribute to the average. A lower sample rate means fewer data points, so the average may fluctuate more. For the most accurate quality signal, use a higher sample rate.

Best Practices

Start with manual triggering

Create a judge in manual mode first and evaluate a handful of runs to verify your criteria and system prompt produce useful scores. Once you are happy with the results, switch to automatic mode.

Write specific criteria descriptions

Vague criteria like "Quality" produce inconsistent scores. Be specific: "Did the response correctly extract all invoice line items including quantity, unit price, and total?" gives the judge LLM clear evaluation guidelines.

Use the system prompt for domain context

If your agent handles domain-specific tasks, provide context in the system prompt. For example: "This agent processes medical insurance claims. A correct response must include the claim number, patient name, and determination." This helps the judge LLM evaluate accuracy in context.

Use filters to focus evaluations

If your agent handles multiple use cases, use context filters to create separate judges for each. This gives you targeted quality metrics per use case rather than blended averages.

Monitor criteria averages for regressions

The criteria averages on the judge detail page show which aspects of your agent's performance are strong and which need work. After deploying a new agent version, compare these averages to catch regressions in specific criteria.

Automated quality assurance for your agents

With judges, you can continuously monitor agent quality at scale. Define what good performance looks like, and let Connic track it across every run.