Ship the variant
that actually works

Route a percentage of production traffic to a test variant. Compare runs, cost, duration, success rate, and judge scores side-by-side. Auto-rollback when failure rate spikes.

Start Building Free Read the deep dive

Read the A/B testing docs

order-handler

A/B test · 4,128 runs

running

control50%

49%

success rate

test-friendly-tone50%

68%

+19pp vs control

Define a test variant

A variant is just another agent

Test variants follow the naming convention {base-agent}-test-{name}. Deploy the variant alongside the base agent. Then configure traffic split, minimum sample size, and auto-rollback in the dashboard.

agents/order-handler-test-friendly-tone.yaml

name: order-handler-test-friendly-tone
model: gemini/gemini-2.0-flash
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders with a warm,
  friendly tone...
tools:
  - orders.process
  - inventory.check

order-handler · Manage A/B Tests

running · 4,128 runs

order-handler (base)50%

test-friendly-tone50%

started 2 days ago

Side-by-side comparison

Runs, cost, duration, quality

Connic tracks runs, average cost, average duration with P50/P95, success rate, and judge scores for both control and variant. Quality comes from the judges you've configured on the base agent.

Success rate

68%vs 49%

+19pp

Avg duration (P95)

2.3svs 2.1s

+0.2s

Avg cost per run

$0.021vs $0.018

+$0.003

Judge score

18/20vs 16/20

Lifecycle and decisions

Draft, run, conclude — or auto-pause

Tests start in Draft, route traffic once you click Start, and run until you conclude them. Auto-rollback pauses any variant whose failure rate exceeds the threshold within a rolling window, before users notice.

Recent A/B tests

test-friendly-tone
on order-handler · 4,128 runs · winner declared · success rate 68% vs 49%
concluded
test-haiku
on support-triage · 8,901 runs · 10% traffic · judge scores within 1pt of base
running
test-shorter-prompt
on invoice-processor · 0 runs · ready to start · 20% traffic configured
draft
test-aggressive-retries
on fraud-detector · 612 runs · auto-rollback · failure rate exceeded threshold
paused

Set a minimum sample size so results aren't declared meaningful too early. Configure auto-rollback so a failing variant pauses itself. When you've got enough data, conclude the test and optionally declare a winner.

Operating an experiment

The buttons you reach for at 2am

A/B tests are operational, not just analytical. Auto-rollback, concurrent tests, and sticky sessions are the controls you reach for when a variant is live in production.

Auto-rollback

Set a failure rate threshold. If the variant exceeds it within a rolling window of runs, the test pauses itself before more users hit a broken variant.

Multiple tests, one agent

Run several tests against the same base agent at once. Their traffic percentages are summed and must total 100% or less. The remainder always routes to control.

Sticky sessions

When sessions are configured, every request in the same session sees the same variant. If a test is paused or concluded, sticky sessions fall back to the base agent.

Keep exploring

Observability

Trace every step, every token.

Judges

Score every run. Catch drift early.

Approvals

Human-in-the-loop on sensitive calls.

Testing

Assertions and judges, in CI.

Environments

Per-env config and secrets.

Composer SDK

Define agents in YAML.

Frequently Asked Questions

Configure a Min sample size on the test. Connic doesn't surface results as meaningful until each group has that many completed runs. The docs suggest 50–100 runs per group as a starting point for reliable comparisons.

Yes. Each test gets its own traffic percentage. The percentages of all active tests on an agent must sum to 100% or less; the remainder routes to the base agent.

Yes, when sessions are configured. Every request in the same session sees the same variant. If a test is paused or concluded mid-session, the session falls back to the base agent.

Two paths. Automatic: enable auto-rollback with a failure rate threshold, and the test pauses itself if the variant exceeds it within a rolling window. Manual: pause or conclude the test from the dashboard at any time.

Feature flags route traffic. A/B testing routes traffic and gives you a side-by-side comparison of cost, duration, success rate, and judge scores between control and variant. Auto-rollback on failure rate means a bad variant pauses itself.

Judge scores are reported alongside cost, duration, and success rate in the side-by-side comparison. Configure judges on the base agent and they evaluate runs from both groups, so you can see how each variant scores against your own rubric.

Ship the variantthat actually works

order-handler

A variant is just another agent

Runs, cost, duration, quality

Draft, run, conclude — or auto-pause

The buttons you reach for at 2am

Keep exploring

Frequently Asked Questions

How much traffic do I need before results are meaningful?

Can I run multiple tests on the same agent at once?

Do sessions stick to the same variant?

When do I kill a losing variant?

How is A/B testing different from a feature flag?

Can judge scores decide the variant?

Ship the variant
that actually works