Platform

A/B Testing

Compare agent variants side-by-side by routing a configurable percentage of traffic to test agents, then measure cost, latency, and quality differences.

Last updated April 19, 2026

Overview

A/B testing lets you run two versions of an agent simultaneously and compare their performance. A percentage of incoming traffic is routed to a test variant while the rest goes to the original (control). Connic tracks cost, latency, success rate, and judge scores for both, so you can make informed decisions about which version to keep.

Traffic Splitting

Route a configurable percentage of requests to each variant. Run multiple tests concurrently with independent splits.

Side-by-Side Comparison

Compare cost, latency, success rate, and judge scores between control and variant in real time.

Auto-Rollback

Set failure rate thresholds to automatically pause underperforming variants before they impact users.

Creating Test Agents

Test variants are regular agent YAML files that follow a naming convention: {base-agent}-test-{name}. The part before -test- must match an existing base agent name. The part after is the test identifier.

agents/

order-processor.yamlbase agent

order-processor-test-faster-model.yamlvariant: "faster-model"

order-processor-test-new-prompt.yamlvariant: "new-prompt"

support-agent.yamlunrelated agent (not affected)

The base agent stays exactly the same. The variant can change anything: model, instructions, tools, temperature, etc.

agents/order-processor.yaml

name: order-processor
model: gemini/gemini-2.0-flash
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders...
tools:
  - orders.process
  - inventory.check

agents/order-processor-test-faster-model.yaml

name: order-processor-test-faster-model
model: gemini/gemini-2.5-flash
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders...
tools:
  - orders.process
  - inventory.check

Validation

If an agent name contains -test- but no matching base agent exists, the deployment will fail with an error.

Tool Versioning

Since each agent references tools by module path, you can point a variant at a different tool module to test new implementations. Create a new tool file and reference it in the variant.

tools/

orders.pycurrent implementation

orders_v2.pyexperimental implementation

inventory.py

agents/order-processor-test-new-tools.yaml

name: order-processor-test-new-tools
model: gemini/gemini-2.0-flash
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders...
tools:
  - orders_v2.process    # different tool module
  - inventory.check

How Traffic Routing Works

When a request comes in for an agent with active A/B tests, Connic randomly assigns it to the control or a variant based on the configured traffic percentages. When sessions are configured, the assignment is sticky: all requests in the same session see the same version. If a test is paused or concluded, sticky sessions fall back to the base agent.

Configuring Tests

After deploying your test variant, open the base agent's detail page and click Manage A/B Tests in the header.

Deploy your test variant

Push the variant agent YAML alongside the base agent. After deployment, it will appear as an available variant.

Create a new test

Click New Test, select a deployed variant from the dropdown, and configure traffic percentage, minimum sample size, and auto-rollback.

Start the test

Tests are created in Draft status. Click Start to begin routing traffic to the variant.

Monitor and conclude

Click on a test to see side-by-side comparison metrics. When you have enough data, conclude the test and optionally declare a winner.

Configuration options

Traffic %Percentage of requests routed to the variant (0–100). When multiple tests are active, their percentages are summed and must total ≤100%.

Min sample sizeMinimum number of completed runs per group before results are considered meaningful.

Auto-rollbackAutomatically pause the test if the variant's failure rate exceeds a configurable threshold within a rolling window of runs.

Reading Results

The test detail view shows a side-by-side comparison of control vs. variant:

RunsTotal completed runs for each group.

Avg CostAverage token cost per run.

Avg DurationAverage end-to-end execution time.

P50 / P95Median and 95th percentile duration. Useful for detecting tail latency.

Judge ScoresAverage scores from configured judges on the base agent.

Success RatePercentage of runs that completed without errors.

Runs in the history table show the test variant name as a pill next to the status badge, so you can quickly tell which runs were part of an A/B test.

Deployment & Test Lifecycle

When a new deployment activates, Connic checks all running and paused A/B tests. If a test's variant agent is no longer present in the new deployment, the test is automatically marked as Failed. This means you can safely remove a variant from your codebase and deploy. The test will be cleaned up automatically.

Best Practices

Start with low traffic: Begin with 5–10% to catch obvious issues before scaling up
Use judges: Configure judges on the base agent to get automated quality scores for both groups
Set a minimum sample size: 50–100 runs per group gives more reliable comparisons
Change one thing at a time: For the clearest signal, each variant should differ in one dimension (model, prompt, or tools)
Enable auto-rollback in production: Set a failure rate threshold to automatically pause problematic variants
Test in staging first: Validate that the variant works before running a production A/B test in separate environments

Judges

Automatically evaluate and score agent runs

Observability

Monitor runs, traces, and dashboards

Agent Configuration

Configure your agents with YAML

Deployment

Deploy agents to environments