- Ship
- Judges
Score every run.
Catch drift early.
Define what 'good' looks like as a structured rubric, then let an LLM score every production run against it. Configured per agent in the dashboard.
Read the judges docsJudges
last 24h- Activeanswer_accuracyon support_agent86%412 runs
- Activetool_usageon support_agent91%412 runs
- Activetone_matchon invoice-processor64%218 runs
- Activecompletenesson support_agent89%412 runs
- Activeresponse_qualityon invoice-processor97%218 runs
A scoring rubric, configured in the dashboard
Each judge is scoped to one agent. Define named criteria with descriptions and max scores. The judge LLM scores every criterion independently and shows its reasoning.
Did the agent produce a factually correct and complete response based on the input?
Did the agent use the appropriate tools and interpret their results correctly?
Is the response well-structured, clear, and appropriately formatted?
Everything you control on a judge
Connic doesn't ship a fixed judge library. You write the rubric for your agent. These are the levers each judge gives you.
Define one or more named criteria, each with a description and a max score. The judge LLM scores every criterion independently, with reasoning per score.
Pick the percentage of matching runs to evaluate (1-100%). Set 100% to score every run, or lower to control cost on high-volume agents.
Narrow which runs are eligible using equals, not equals, contains, and exists operators on status, agent name, or context fields.
Automatic judges score runs as they complete. Manual judges only score on demand. Useful for spot-checks or re-scoring after editing the rubric.
Get notified in-app or by email when the judge's average score drops below a threshold over the last 1, 10, 50, 100, or all-time runs.
Each criterion is tracked over time on the judge detail page, so you can see which aspects of the agent are slipping.
From completed runs to actionable signal
When a run finishes, the backend checks every active judge for the agent, applies filters and sample rate, and queues an evaluation. Scores attach to the run and show up in A/B tests, alerts, and run detail.
Average judge scores show up in the side-by-side comparison between control and variant.
DetailsGet an in-app or email notification when a judge's rolling average drops below your threshold.
DetailsPer-criterion scores and reasoning appear at the top of the run detail dialog, with a score pill in the header.
DetailsIterate on the rubric, re-score the runs you care about
Start a judge in manual mode. Score a handful of real runs, refine the system prompt and criteria descriptions, then flip it to automatic when you trust the scores.
This agent processes medical insurance claims. A correct response must include the claim number, the patient name, and the determination (approved, denied, or pending review). Be strict on missing fields and lenient on phrasing — the response goes to a downstream system, not directly to the patient.
Optional. Use it to give the judge LLM the context it needs to evaluate accuracy in your domain.
Score any run on demand, regardless of trigger mode or sample rate. Useful for spot-checks, re-scoring after editing criteria, or testing a new rubric on past runs before flipping to automatic.
- Pick from recent completed runs in the dropdown
- Or paste a run ID directly for older runs
- The agent on the run must match the judge's agent