A/B Testing for AI Agents: Ship Better Prompts with Confidence

You rewrote the system prompt. You swapped from Gemini 2.5 to Sonnet 4.6. The agent feels faster, maybe a little sharper. So you ship it. Two days later, your support queue fills up. Turns out the new version handles edge cases worse. You just ran an uncontrolled experiment on your users.

AI agents are non-deterministic. The same input can produce different outputs across runs. A prompt tweak that improves one category of requests might break another. Unlike traditional software, you can't write unit tests that cover every possible interaction. What you need is a way to try changes on real traffic, measure the impact, and decide based on data.

That's what A/B testing does. Connic now lets you run controlled experiments on your agents. Split traffic between the current version and a variant, compare cost, latency, success rate, and quality scores side by side, then pick the winner with confidence.

Why A/B Test Your Agents

Every change to an AI agent is a hypothesis. "This prompt will be more accurate." "This model will be cheaper without losing quality." "This new tool will speed up responses." A/B testing turns hypotheses into experiments with measurable outcomes.

Quality

Did the prompt rewrite actually improve response quality? Compare judge scores between the original and the variant to find out.

Cost

Switching to a cheaper model? See the exact cost difference per run while monitoring whether quality holds up under real traffic.

Latency

Is the new model faster? Compare P50 and P95 response times. A lower average means nothing if the tail latency spikes.

How It Works

The concept is simple. You have a base agent (the control), and you create a variant with the change you want to test. Connic routes a percentage of live traffic to the variant, while the rest continues going to the control. Both versions run in parallel on real requests, and every run is tracked and attributed to its group.

Incoming Request→Traffic Split→

Control (90%)Variant (10%)

→Compare Results

Variant agents are regular agent configurations. They can change anything: the model, the system prompt, the tools, the temperature. A naming convention keeps it obvious which agent is the base and which is the variant:

agents/

order-processor.yaml                    # base agent (control)
order-processor-test-faster-model.yaml  # variant: testing a cheaper model
order-processor-test-new-prompt.yaml    # variant: testing a rewritten prompt

Deploy both alongside each other. The variant sits dormant until you create a test and start routing traffic to it.

What You Can Test

Since variants are full agent configurations, the experiments you can run are wide open.

Model Swaps

Gemini 2.5 Flash vs Claude Sonnet 4.6. Claude vs GPT. Test whether a cheaper or faster model delivers comparable quality for your specific use case. Stop guessing based on benchmarks and measure on your actual traffic.

Prompt Iterations

You rewrote the system prompt to be more concise. Or added few-shot examples. Or changed the output format. Run the new prompt on a slice of traffic and compare judge scores before rolling it out to everyone.

Tool Versions

You built a v2 of your data processing tool. Does it actually perform better? Point the variant at the new tool implementation and compare success rates and latency with the original.

agents/order-processor-test-faster-model.yaml

name: order-processor-test-faster-model
model: gemini/gemini-2.5-flash        # cheaper, faster model
description: "Processes incoming customer orders"
system_prompt: |
  You process incoming orders...
tools:
  - orders.process
  - inventory.check

Everything else stays the same. The variant inherits the exact same workflow. The only difference is the variable you're testing.

Running a Test

Setting up a test takes less than a minute once your variant is deployed.

1.Deploy your variant. Push the variant YAML alongside your base agent. After deployment, it shows up as an available test variant.
2.Open the base agent and click Manage A/B Tests in the header.
3.Create a new test. Pick the variant, set the traffic percentage, and configure a minimum sample size.
4.Start the test. Tests begin in Draft status so you can review the configuration before going live.
5.Monitor and conclude. Watch the comparison metrics fill in as runs come through. When you have enough data, conclude the test and declare a winner.

Run Multiple Tests at Once

You can run several A/B tests on the same agent simultaneously, each with its own traffic split. Testing a new model and a new prompt at the same time? Give each variant 10% and keep 80% on the control. Traffic percentages across all active tests cannot exceed 100%.

Reading the Results

The test detail view gives you a side-by-side comparison across every metric that matters:

Run Counts

Completed runs for control vs variant. Check that both groups have enough data for meaningful comparison. The minimum sample size setting helps here.

Avg Cost & Duration

Average token cost and execution time per run. The variant might be 40% cheaper but 200ms slower. Now you can make that tradeoff consciously.

P50 / P95 Latency

Median and 95th percentile response times. Averages hide outliers. If your variant has a great P50 but a terrible P95, some users are having a bad time.

Judge Scores & Success Rate

Average quality scores from your configured judges, plus the percentage of runs that completed without errors. The numbers that tell you if the change is actually better.

Every run in your history shows a variant badge so you can quickly spot which requests went to which version. Filter the runs table by variant name to drill into specific results.

Auto-Rollback: The Safety Net

Experiments shouldn't break production. When you enable auto-rollback, Connic watches the variant's failure rate within a rolling window of recent runs. If it crosses your configured threshold, the test pauses and traffic stops going to the variant.

Failure Threshold

Set the maximum acceptable failure rate. If 20% of variant runs are failing while the control sits at 2%, something is wrong and the test pauses automatically.

Rolling Window

The number of recent runs to evaluate. A window of 50 means the system checks the failure rate across the last 50 variant runs, smoothing out isolated hiccups.

When a rollback triggers, all traffic immediately returns to the control agent. The test stays paused with a clear error message explaining what happened, so you can investigate, fix the variant, and try again.

Pair It with Judges

A/B testing tells you which version is better. Judges tell you why. When you configure a judge on the base agent, it automatically evaluates both control and variant runs. The average judge score shows up in your A/B test comparison, giving you a quality signal alongside cost and latency.

Without Judges

"The variant is 30% cheaper and has the same success rate." Sounds great. But are the responses actually as good? Success rate only tells you the agent did not crash, not that the output was useful.

With Judges

"The variant is 30% cheaper, same success rate, and the average judge score dropped from 8.5 to 7.2 on accuracy." Now you know: the savings come at a quality cost. You can decide if that tradeoff is acceptable.

Sticky Sessions

If your agents handle multi-turn conversations, you don't want a user bouncing between the control and variant mid-session. When sessions are configured, Connic keeps the same user on the same version for the entire conversation. If the test ends or pauses, sessions fall back to the base agent.

Practical Scenarios

A few experiments that work well in practice:

"Can We Use a Cheaper Model?"

Your agent runs on Opus 4.6 at $0.04 per run. Sonnet 4.6 costs a fraction of that. Create a variant that swaps the model, route 10% of traffic, and compare cost alongside judge scores over 200+ runs. If quality holds, you just cut your bill significantly.

"Is the New Prompt Better?"

You rewrote the system prompt with more specific instructions and added few-shot examples. Instead of replacing the current prompt and hoping for the best, run both prompts in parallel. Let the judge scores decide which one your users actually get better answers from.

"Did the New Tool Version Improve Reliability?"

You rebuilt the data processing tool to handle edge cases better. Point the variant at the new version and watch the success rate. If failures drop from 8% to 2%, you have the evidence to ship it to everyone.

"What Happens If We Remove RAG Context?"

Your knowledge base adds cost and latency. Is it actually improving responses? Create a stripped-down variant without RAG retrieval and compare. Maybe the base model handles 80% of queries fine on its own, and you only need RAG for the remaining 20%.

Best Practices

Start with Low Traffic

Begin with 5-10% routed to the variant. This catches obvious failures early while limiting the blast radius. Scale up to 25-50% once the variant looks stable.

Change One Thing at a Time

Swap the model or rewrite the prompt, not both. When the variant performs differently, you want to know exactly which change caused it. If you need to test multiple changes, run separate tests.

Set a Minimum Sample Size

50-100 completed runs per group gives you statistically meaningful comparisons. Don't conclude a test after 10 runs. LLMs are noisy, and small samples produce unreliable results.

Always Enable Auto-Rollback in Production

Experiments are exciting until they break things for real users. Set a failure threshold and let Connic pause the test if things go wrong. You can always investigate and restart.

Use Judges for Quality Signals

Success rate tells you the agent did not error. Judge scores tell you the output was good. Configure judges on the base agent so both groups get evaluated by the same criteria.

Getting Started

Ready to stop guessing? Here's how to run your first experiment:

1.Pick one thing you want to test: a different model, a new prompt, or an updated tool
2.Create the variant agent YAML with the change and deploy it alongside the base
3.Open the base agent, click Manage A/B Tests, and create a test with 10% traffic
4.Enable auto-rollback, set a minimum sample size of 100, and start the test
5.Wait for data, compare the results, and ship the winner

No more "deploy and pray," no more reverting changes because something feels off. Let real traffic tell you which version is better.

For the complete setup guide, check the A/B Testing documentation. New to Connic? Start with the quickstart guide to deploy your first agent, then come back here and run your first experiment.

A/B Testing for AI Agents: Ship Better Prompts with Confidence

Why A/B Test Your Agents

How It Works

What You Can Test

Running a Test

Reading the Results

Auto-Rollback: The Safety Net

Pair It with Judges

Sticky Sessions

Practical Scenarios

Best Practices

Getting Started

More from the Blog

What We Shipped in April 2026

Agent Approvals: Human-in-the-Loop for Production AI

Secure AI Agents: A Production Safety Checklist

Database vs. Knowledge Base: Choosing the Right Storage

AI Agents: From Prototype to Production

What We Shipped in October 2025