Connic

Ship the variant
that actually works

Route a percentage of production traffic to a test variant. Compare runs, cost, duration, success rate, and judge scores side-by-side. Auto-rollback when failure rate spikes.

Read the A/B testing docs

order-handler

A/B test · 4,128 runs

running
control50%
49%
success rate
test-friendly-tone50%
68%
+19pp vs control
Define a test variant

A variant is just another agent

Test variants follow the naming convention {base-agent}-test-{name}. Deploy the variant alongside the base agent. Then configure traffic split, minimum sample size, and auto-rollback in the dashboard.

agents/order-handler-test-friendly-tone.yaml
name: order-handler-test-friendly-tone
model: gemini/gemini-2.0-flash
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders with a warm,
  friendly tone...
tools:
  - orders.process
  - inventory.check
order-handler · Manage A/B Tests
running · 4,128 runs
order-handler (base)50%
test-friendly-tone50%
started 2 days ago
Side-by-side comparison

Runs, cost, duration, quality

Connic tracks runs, average cost, average duration with P50/P95, success rate, and judge scores for both control and variant. Quality comes from the judges you've configured on the base agent.

Success rate
68%vs 49%
+19pp
Avg duration (P95)
2.3svs 2.1s
+0.2s
Avg cost per run
$0.021vs $0.018
+$0.003
Judge score
18/20vs 16/20
+2
Lifecycle and decisions

Draft, run, conclude — or auto-pause

Tests start in Draft, route traffic once you click Start, and run until you conclude them. Auto-rollback pauses any variant whose failure rate exceeds the threshold within a rolling window, before users notice.

Recent A/B tests
  • test-friendly-tone
    on order-handler · 4,128 runs · winner declared · success rate 68% vs 49%
    concluded
  • test-haiku
    on support-triage · 8,901 runs · 10% traffic · judge scores within 1pt of base
    running
  • test-shorter-prompt
    on invoice-processor · 0 runs · ready to start · 20% traffic configured
    draft
  • test-aggressive-retries
    on fraud-detector · 612 runs · auto-rollback · failure rate exceeded threshold
    paused

Set a minimum sample size so results aren't declared meaningful too early. Configure auto-rollback so a failing variant pauses itself. When you've got enough data, conclude the test and optionally declare a winner.

Operating an experiment

The buttons you reach for at 2am

A/B tests are operational, not just analytical. Auto-rollback, concurrent tests, and sticky sessions are the controls you reach for when a variant is live in production.

Auto-rollback

Set a failure rate threshold. If the variant exceeds it within a rolling window of runs, the test pauses itself before more users hit a broken variant.

Multiple tests, one agent

Run several tests against the same base agent at once. Their traffic percentages are summed and must total 100% or less. The remainder always routes to control.

Sticky sessions

When sessions are configured, every request in the same session sees the same variant. If a test is paused or concluded, sticky sessions fall back to the base agent.

Frequently Asked Questions

Configure a Min sample size on the test. Connic doesn't surface results as meaningful until each group has that many completed runs. The docs suggest 50–100 runs per group as a starting point for reliable comparisons.

Yes. Each test gets its own traffic percentage. The percentages of all active tests on an agent must sum to 100% or less; the remainder routes to the base agent.

Yes, when sessions are configured. Every request in the same session sees the same variant. If a test is paused or concluded mid-session, the session falls back to the base agent.

Two paths. Automatic: enable auto-rollback with a failure rate threshold, and the test pauses itself if the variant exceeds it within a rolling window. Manual: pause or conclude the test from the dashboard at any time.

Feature flags route traffic. A/B testing routes traffic and gives you a side-by-side comparison of cost, duration, success rate, and judge scores between control and variant. Auto-rollback on failure rate means a bad variant pauses itself.

Judge scores are reported alongside cost, duration, and success rate in the side-by-side comparison. Configure judges on the base agent and they evaluate runs from both groups, so you can see how each variant scores against your own rubric.