Connic
Platform

A/B Testing

Compare agent variants side-by-side by routing a configurable percentage of traffic to test agents, then measure cost, latency, and quality differences.

Last updated

Overview

A/B testing lets you run two versions of an agent simultaneously and compare their performance. A percentage of incoming traffic is routed to a test variant while the rest goes to the original (control). Connic tracks cost, latency, success rate, and judge scores for both, so you can make informed decisions about which version to keep.

Traffic Splitting
Route a configurable percentage of requests to each variant. Run multiple tests concurrently with independent splits.
Side-by-Side Comparison
Compare cost, latency, success rate, and judge scores between control and variant in real time.
Auto-Rollback
Set failure rate thresholds to automatically pause underperforming variants before they impact users.

Creating Test Agents

Test variants are regular agent YAML files that follow a naming convention: {base-agent}-test-{name}. The part before -test- must match an existing base agent name. The part after is the test identifier.

agents/
agents/
order-processor.yamlbase agent
order-processor-test-faster-model.yamlvariant: "faster-model"
order-processor-test-new-prompt.yamlvariant: "new-prompt"
support-agent.yamlunrelated agent (not affected)

The base agent stays exactly the same. The variant can change anything: model, instructions, tools, temperature, etc.

agents/order-processor.yaml
name: order-processor
model: gemini/gemini-2.0-flash
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders...
tools:
  - orders.process
  - inventory.check
agents/order-processor-test-faster-model.yaml
name: order-processor-test-faster-model
model: gemini/gemini-2.5-flash
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders...
tools:
  - orders.process
  - inventory.check
Validation
If an agent name contains -test- but no matching base agent exists, the deployment will fail with an error.

Tool Versioning

Since each agent references tools by module path, you can point a variant at a different tool module to test new implementations. Create a new tool file and reference it in the variant.

tools/
tools/
orders.pycurrent implementation
orders_v2.pyexperimental implementation
inventory.py
agents/order-processor-test-new-tools.yaml
name: order-processor-test-new-tools
model: gemini/gemini-2.0-flash
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders...
tools:
  - orders_v2.process    # different tool module
  - inventory.check
How Traffic Routing Works
When a request comes in for an agent with active A/B tests, Connic randomly assigns it to the control or a variant based on the configured traffic percentages. When sessions are configured, the assignment is sticky: all requests in the same session see the same version. If a test is paused or concluded, sticky sessions fall back to the base agent.

Configuring Tests

After deploying your test variant, open the base agent's detail page and click Manage A/B Tests in the header.

1

Deploy your test variant

Push the variant agent YAML alongside the base agent. After deployment, it will appear as an available variant.

2

Create a new test

Click New Test, select a deployed variant from the dropdown, and configure traffic percentage, minimum sample size, and auto-rollback.

3

Start the test

Tests are created in Draft status. Click Start to begin routing traffic to the variant.

4

Monitor and conclude

Click on a test to see side-by-side comparison metrics. When you have enough data, conclude the test and optionally declare a winner.

Configuration options
Traffic %Percentage of requests routed to the variant (0–100). When multiple tests are active, their percentages are summed and must total ≤100%.
Min sample sizeMinimum number of completed runs per group before results are considered meaningful.
Auto-rollbackAutomatically pause the test if the variant's failure rate exceeds a configurable threshold within a rolling window of runs.

Reading Results

The test detail view shows a side-by-side comparison of control vs. variant:

RunsTotal completed runs for each group.
Avg CostAverage token cost per run.
Avg DurationAverage end-to-end execution time.
P50 / P95Median and 95th percentile duration. Useful for detecting tail latency.
Judge ScoresAverage scores from configured judges on the base agent.
Success RatePercentage of runs that completed without errors.
Runs in the history table show the test variant name as a pill next to the status badge, so you can quickly tell which runs were part of an A/B test.

Deployment & Test Lifecycle

When a new deployment activates, Connic checks all running and paused A/B tests. If a test's variant agent is no longer present in the new deployment, the test is automatically marked as Failed. This means you can safely remove a variant from your codebase and deploy. The test will be cleaned up automatically.

Best Practices
  • Start with low traffic: Begin with 5–10% to catch obvious issues before scaling up
  • Use judges: Configure judges on the base agent to get automated quality scores for both groups
  • Set a minimum sample size: 50–100 runs per group gives more reliable comparisons
  • Change one thing at a time: For the clearest signal, each variant should differ in one dimension (model, prompt, or tools)
  • Enable auto-rollback in production: Set a failure rate threshold to automatically pause problematic variants
  • Test in staging first: Validate that the variant works before running a production A/B test in separate environments