- Ship
- A/B testing
Ship the variant
that actually works
Route a percentage of production traffic to a test variant. Compare runs, cost, duration, success rate, and judge scores side-by-side. Auto-rollback when failure rate spikes.
Read the A/B testing docsorder-handler
A/B test · 4,128 runs
A variant is just another agent
Test variants follow the naming convention {base-agent}-test-{name}. Deploy the variant alongside the base agent. Then configure traffic split, minimum sample size, and auto-rollback in the dashboard.
name: order-handler-test-friendly-tone
model: gemini/gemini-2.0-flash
description: "Processes incoming customer orders"
system_prompt: |
You process incoming orders with a warm,
friendly tone...
tools:
- orders.process
- inventory.checkRuns, cost, duration, quality
Connic tracks runs, average cost, average duration with P50/P95, success rate, and judge scores for both control and variant. Quality comes from the judges you've configured on the base agent.
Draft, run, conclude — or auto-pause
Tests start in Draft, route traffic once you click Start, and run until you conclude them. Auto-rollback pauses any variant whose failure rate exceeds the threshold within a rolling window, before users notice.
- concludedtest-friendly-toneon order-handler · 4,128 runs · winner declared · success rate 68% vs 49%
- runningtest-haikuon support-triage · 8,901 runs · 10% traffic · judge scores within 1pt of base
- drafttest-shorter-prompton invoice-processor · 0 runs · ready to start · 20% traffic configured
- pausedtest-aggressive-retrieson fraud-detector · 612 runs · auto-rollback · failure rate exceeded threshold
Set a minimum sample size so results aren't declared meaningful too early. Configure auto-rollback so a failing variant pauses itself. When you've got enough data, conclude the test and optionally declare a winner.
The buttons you reach for at 2am
A/B tests are operational, not just analytical. Auto-rollback, concurrent tests, and sticky sessions are the controls you reach for when a variant is live in production.
Set a failure rate threshold. If the variant exceeds it within a rolling window of runs, the test pauses itself before more users hit a broken variant.
Run several tests against the same base agent at once. Their traffic percentages are summed and must total 100% or less. The remainder always routes to control.
When sessions are configured, every request in the same session sees the same variant. If a test is paused or concluded, sticky sessions fall back to the base agent.