Score every run.
Catch drift early.

Define what 'good' looks like as a structured rubric, then let an LLM score every production run against it. Configured per agent in the dashboard.

Start Building Free Read the deep dive

Read the judges docs

Judges

last 24h

StatusJudgeAvg ScoreEvaluated

Active
answer_accuracy
on support_agent
86%
412 runs
Active
tool_usage
on support_agent
91%
412 runs
Active
tone_match
on invoice-processor
64%
218 runs
Active
completeness
on support_agent
89%
412 runs
Active
response_quality
on invoice-processor
97%
218 runs

Anatomy of a judge

A scoring rubric, configured in the dashboard

Each judge is scoped to one agent. Define named criteria with descriptions and max scores. The judge LLM scores every criterion independently and shows its reasoning.

New Judge

Name

Invoice Quality Check

Agent

invoice-processor

Model

anthropic/claude-sonnet-4-20250514

Trigger

Automatic

Sample rate

20%

Filters

status equals completed

Scoring criteria

17 / 20

Accuracy9 / 10

Did the agent produce a factually correct and complete response based on the input?

Tool Usage4 / 5

Did the agent use the appropriate tools and interpret their results correctly?

Response Quality4 / 5

Is the response well-structured, clear, and appropriately formatted?

Configuration

Everything you control on a judge

Connic doesn't ship a fixed judge library. You write the rubric for your agent. These are the levers each judge gives you.

Custom criteria

Define one or more named criteria, each with a description and a max score. The judge LLM scores every criterion independently, with reasoning per score.

Sample rate

Pick the percentage of matching runs to evaluate (1-100%). Set 100% to score every run, or lower to control cost on high-volume agents.

Run filters

Narrow which runs are eligible using equals, not equals, contains, and exists operators on status, agent name, or context fields.

Automatic or manual

Automatic judges score runs as they complete. Manual judges only score on demand. Useful for spot-checks or re-scoring after editing the rubric.

Score alerts

Get notified in-app or by email when the judge's average score drops below a threshold over the last 1, 10, 50, 100, or all-time runs.

Per-criterion averages

Each criterion is tracked over time on the judge detail page, so you can see which aspects of the agent are slipping.

See docs for the full reference.

Where judges fit

From completed runs to actionable signal

When a run finishes, the backend checks every active judge for the agent, applies filters and sample rate, and queues an evaluation. Scores attach to the run and show up in A/B tests, alerts, and run detail.

Filters and sample rate are applied. Eligible runs are queued for evaluation.

The judge LLM scores each criterion. Scores, reasoning, and token usage are saved against the run.

Average judge scores show up in the side-by-side comparison between control and variant.

Details

Get an in-app or email notification when a judge's rolling average drops below your threshold.

Details

Per-criterion scores and reasoning appear at the top of the run detail dialog, with a score pill in the header.

Details

Tune as you learn

Iterate on the rubric, re-score the runs you care about

Start a judge in manual mode. Score a handful of real runs, refine the system prompt and criteria descriptions, then flip it to automatic when you trust the scores.

System prompt

Domain context for the judge LLM

This agent processes medical insurance claims.
A correct response must include the claim number,
the patient name, and the determination
(approved, denied, or pending review).

Be strict on missing fields and lenient on
phrasing — the response goes to a downstream
system, not directly to the patient.

Optional. Use it to give the judge LLM the context it needs to evaluate accuracy in your domain.

Trigger Manually

queued

Run

run_a1b2c3 - invoice-processor - completed

Score any run on demand, regardless of trigger mode or sample rate. Useful for spot-checks, re-scoring after editing criteria, or testing a new rubric on past runs before flipping to automatic.

Pick from recent completed runs in the dropdown
Or paste a run ID directly for older runs
The agent on the run must match the judge's agent

Keep exploring

Observability

Trace every step, every token.

A/B testing

Compare variants in production.

Approvals

Human-in-the-loop on sensitive calls.

Testing

Assertions and judges, in CI.

Composer SDK

Define agents in YAML.

Environments

Per-env config and secrets.

Frequently Asked Questions

Each evaluation is one LLM call (billed by your provider) plus one Connic billable run. The per-judge sample rate (1-100%) controls both. For high-volume agents, 10-20% is usually enough to track quality trends.

In the Judges tab of your project. Click New Judge, pick the agent, choose a model, write a system prompt and one or more named criteria with descriptions and max scores. Then set trigger mode (Automatic or Manual), sample rate, and optional run filters.

Start in Manual mode and trigger evaluations on a handful of recent runs. Inspect the per-criterion scores and reasoning, refine the system prompt or criteria descriptions, then switch to Automatic when you trust the scores.

Yes. Each judge has filters with equals, not equals, contains, and exists operators against fields like status, agent_name, or any context.* property your middleware sets. For example: only evaluate runs where context.tier equals enterprise.

Each judge has an optional Score Alert with a threshold percentage and an averaging window (1, 10, 50, 100, or all-time). The alert fires once when the average crosses below the threshold. Delivery is in-app and by email, based on each member's notification preferences.

Any provider you've already configured for agents: OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Vertex AI, OpenRouter, or any custom OpenAI-compatible provider. Judges use the same provider/model-name format and the same project API keys as your agents.

Score every run.Catch drift early.

Judges

A scoring rubric, configured in the dashboard

Everything you control on a judge

From completed runs to actionable signal

Iterate on the rubric, re-score the runs you care about

Keep exploring

Frequently Asked Questions

How much do judges cost to run?

Where do I configure a judge?

How do I tune a judge before turning it on?

Can I limit which runs get evaluated?

What happens when scores drop?

Which models can I use as the judge?

Score every run.
Catch drift early.