Connic

Score every run.
Catch drift early.

Define what 'good' looks like as a structured rubric, then let an LLM score every production run against it. Configured per agent in the dashboard.

Read the judges docs

Judges

last 24h
StatusJudgeAvg ScoreEvaluated
  • Active
    answer_accuracy
    on support_agent
    86%
    412 runs
  • Active
    tool_usage
    on support_agent
    91%
    412 runs
  • Active
    tone_match
    on invoice-processor
    64%
    218 runs
  • Active
    completeness
    on support_agent
    89%
    412 runs
  • Active
    response_quality
    on invoice-processor
    97%
    218 runs
Anatomy of a judge

A scoring rubric, configured in the dashboard

Each judge is scoped to one agent. Define named criteria with descriptions and max scores. The judge LLM scores every criterion independently and shows its reasoning.

New Judge
Name
Invoice Quality Check
Agent
invoice-processor
Model
anthropic/claude-sonnet-4-20250514
Trigger
Automatic
Sample rate
20%
Filters
status equals completed
Scoring criteria
17 / 20
Accuracy9 / 10

Did the agent produce a factually correct and complete response based on the input?

Tool Usage4 / 5

Did the agent use the appropriate tools and interpret their results correctly?

Response Quality4 / 5

Is the response well-structured, clear, and appropriately formatted?

Configuration

Everything you control on a judge

Connic doesn't ship a fixed judge library. You write the rubric for your agent. These are the levers each judge gives you.

Custom criteria

Define one or more named criteria, each with a description and a max score. The judge LLM scores every criterion independently, with reasoning per score.

Sample rate

Pick the percentage of matching runs to evaluate (1-100%). Set 100% to score every run, or lower to control cost on high-volume agents.

Run filters

Narrow which runs are eligible using equals, not equals, contains, and exists operators on status, agent name, or context fields.

Automatic or manual

Automatic judges score runs as they complete. Manual judges only score on demand. Useful for spot-checks or re-scoring after editing the rubric.

Score alerts

Get notified in-app or by email when the judge's average score drops below a threshold over the last 1, 10, 50, 100, or all-time runs.

Per-criterion averages

Each criterion is tracked over time on the judge detail page, so you can see which aspects of the agent are slipping.

See docs for the full reference.

Where judges fit

From completed runs to actionable signal

When a run finishes, the backend checks every active judge for the agent, applies filters and sample rate, and queues an evaluation. Scores attach to the run and show up in A/B tests, alerts, and run detail.

Filters and sample rate are applied. Eligible runs are queued for evaluation.
The judge LLM scores each criterion. Scores, reasoning, and token usage are saved against the run.

Average judge scores show up in the side-by-side comparison between control and variant.

Details

Get an in-app or email notification when a judge's rolling average drops below your threshold.

Details

Per-criterion scores and reasoning appear at the top of the run detail dialog, with a score pill in the header.

Details
Tune as you learn

Iterate on the rubric, re-score the runs you care about

Start a judge in manual mode. Score a handful of real runs, refine the system prompt and criteria descriptions, then flip it to automatic when you trust the scores.

System prompt
Domain context for the judge LLM
This agent processes medical insurance claims.
A correct response must include the claim number,
the patient name, and the determination
(approved, denied, or pending review).

Be strict on missing fields and lenient on
phrasing — the response goes to a downstream
system, not directly to the patient.

Optional. Use it to give the judge LLM the context it needs to evaluate accuracy in your domain.

Trigger Manually
queued
Run
run_a1b2c3 - invoice-processor - completed

Score any run on demand, regardless of trigger mode or sample rate. Useful for spot-checks, re-scoring after editing criteria, or testing a new rubric on past runs before flipping to automatic.

  • Pick from recent completed runs in the dropdown
  • Or paste a run ID directly for older runs
  • The agent on the run must match the judge's agent

Frequently Asked Questions

Each evaluation is one LLM call (billed by your provider) plus one Connic billable run. The per-judge sample rate (1-100%) controls both. For high-volume agents, 10-20% is usually enough to track quality trends.

In the Judges tab of your project. Click New Judge, pick the agent, choose a model, write a system prompt and one or more named criteria with descriptions and max scores. Then set trigger mode (Automatic or Manual), sample rate, and optional run filters.

Start in Manual mode and trigger evaluations on a handful of recent runs. Inspect the per-criterion scores and reasoning, refine the system prompt or criteria descriptions, then switch to Automatic when you trust the scores.

Yes. Each judge has filters with equals, not equals, contains, and exists operators against fields like status, agent_name, or any context.* property your middleware sets. For example: only evaluate runs where context.tier equals enterprise.

Each judge has an optional Score Alert with a threshold percentage and an averaging window (1, 10, 50, 100, or all-time). The alert fires once when the average crosses below the threshold. Delivery is in-app and by email, based on each member's notification preferences.

Any provider you've already configured for agents: OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Vertex AI, OpenRouter, or any custom OpenAI-compatible provider. Judges use the same provider/model-name format and the same project API keys as your agents.