Connic Tests: Catch Agent Regressions Before They Reach Production

You changed a prompt, swapped a model, or refactored a tool. Did anything break? With traditional code, your test suite gives you an answer in seconds. With AI agents, most teams ship a change, watch the dashboard, and hope for the best. That's not engineering. That's gambling against a non-deterministic system.

Today we're spotlighting Connic Tests, a testing framework built specifically for the way agents actually behave. It treats non-determinism as a first-class citizen, lets you assert on the entire trace (not just the final string), and becomes the gate between a green commit and a production deploy.

Why Agents Break Traditional Test Frameworks

Pytest, Jest, JUnit. They all assume the same input always produces the same output. That assumption is the foundation of every assertion you've ever written. Agents shred that assumption. The same prompt can return different tokens, take different tool paths, or surface a flaky failure that disappears on retry. Test suites that pretend the output is deterministic end up either useless (passing every red build) or unusable (failing every green one).

Failure modes ordinary test frameworks miss:

Flaky-by-design Behavior

The agent gets the right answer 7 out of 10 times. A unit test assertion either always passes or always fails. Both are wrong. You need a way to express “this is good enough at scale.”

Silent Tool-Call Drift

Your refactor still produces a plausible-looking final answer, but the agent quietly stopped calling the calculator and started guessing. The output looks fine. The behavior is broken.

Forbidden Side Effects

A subtle prompt change makes the agent helpfully send an email it should never have sent. You need to assert on what the agent did not do, not just what it did.

Live-Data Dependencies

Tests that need a real Stripe charge, a real database row, or a real S3 object can't be shoehorned into a static fixture file. You need fixtures that build state and tear it down.

Connic Tests was designed around exactly these problems.

The 30-Second Tour

A test suite is just a YAML file at tests/<agent-name>.yaml. Same flat layout as middleware/, no scaffolding, no boilerplate. Each file declares one or more test cases for the agent of the same name.

tests/stress-tester.yaml

version: "1.0"
agent: stress-tester        # optional, defaults to the filename stem

defaults:
  runs: 10                  # invoke the agent 10 times per case
  success_threshold: 90     # 9 out of 10 must pass
  timeout_s: 60

tests:
  - name: returns_id_10
    payload: '{"a": 4, "b": 6}'
    expected_result: output.id == 10
    expected_tool_calls:
      - math.calculator.add                  # called at least once
      - math.calculator.add: invocations >= 5  # ...or with an expression
    expected_no_tool_calls:
      - email.send                           # must NOT be called

That's a complete, production-ready test. Ten invocations, a 90% pass threshold, an output assertion, two positive tool-call expectations, and a forbidden side effect. No fixtures, no runners, no glue code. Just the behavior you care about.

Built for Non-Determinism

The first thing that makes Connic Tests feel different is runs and success_threshold. Together they let you encode statistical expectations directly in the test contract.

runs (1–100)

How many independent agent invocations to execute for this case. Pick 1 for cheap deterministic checks. Pick 50 when measuring flakiness on a model that runs hot.

success_threshold (1–100)

Percent of those runs that must pass. The classic 100 means “this should never fail.” A 90 means “at scale, this is good enough.”

The result: confidence on demand

The same case can act as a smoke test (runs: 1) or a regression sentinel (runs: 50, success_threshold: 95). Tune per case so cheap cases stay cheap and critical cases get statistical rigor. Every individual invocation is recorded with its agent run ID and pass/fail outcome, so a case that fails on 3 of 50 runs takes you straight to the three offending traces. No log spelunking required.

Assertions That Match How Agents Actually Work

A string-equality check on the agent's final output is almost never the assertion you want. Connic Tests gives you four assertion shapes, and each one is an expression, not a hardcoded comparator.

expected_result — output expressions

Evaluated against the bindings output, error, and status. Write output.total > 0, output.status == "refunded", or any other safe expression. JSON payloads are parsed automatically.

expected_tool_calls — positive trace assertions

A bare tool name asserts the tool was called at least once. A mapping form like math.calculator.add: invocations >= 5 asserts on the call count. Catches the silent regression where an agent stops using a tool you depend on.

expected_no_tool_calls — negative trace assertions

A list of tools that must not be called. The most underrated assertion in agent testing. This is how you prove your refund flow never sends a confirmation email, or your reader agent never tried to write.

Terminal status (implicit)

If you omit expected_result, the run still has to reach the completed status to pass. A free, baseline “did it crash?” check on every case.

Why expressions, not matchers?

Hardcoded matchers (assertEquals, assertContains) cap the questions you can ask. Expressions don't. Anything you can compute over the output object (numeric comparisons, length checks, nested keys, set membership) is already a valid assertion. Same engine that powers Connic's trace filters, evaluated safely server-side.

Multimodal Fixtures Without the Setup

Half of the agents people deploy today read PDFs, parse invoices, OCR receipts, or describe images. Their tests should too. Drop a binary in tests/files/ and reference it by name. The runner base64-encodes it and delivers a multimodal payload to the agent.

tests/invoice-extractor.yaml

tests:
  - name: extract_invoice_total
    payload: "Extract the total amount from this invoice."
    files:
      - invoice_a.pdf
      - invoice_b.pdf
    runs: 5
    success_threshold: 100
    expected_result: output.total > 0 and output.currency == "EUR"

Five runs, each receiving both PDFs alongside the prompt, with a strict expectation that every single one returns a positive Euro total. No mocking. No stubs. Real model calls against real fixtures.

Dynamic Builders for Stateful Tests

Some tests can't be static. To test a refund agent properly, you have to create a real charge first. To test an updater, you have to seed a real row. Dynamic builders are small Python modules in tests/builders/ that produce the test payload at run time and tear down the fixture afterwards.

tests/builders/create_charge_then_refund.py

import stripe

def build(context, builder_args, test_name, payload, files):
    """Create a real Stripe charge, then ask the agent to refund it."""
    charge = stripe.Charge.create(
        amount=builder_args["amount_cents"],
        currency="usd",
        source="tok_visa",
    )
    # stash state for cleanup() to read back
    context["charge_id"] = charge.id
    return f"Please refund charge {charge.id} for the customer."

def cleanup(run, context, builder_args):
    """Runs after the agent finishes (pass OR fail)."""
    charge_id = context.get("charge_id")
    if charge_id:
        # ensure the fixture is gone even if the agent forgot
        try:
            stripe.Refund.create(charge=charge_id)
        except stripe.error.InvalidRequestError:
            pass  # already refunded by the agent under test
    return None  # returning False would mark the case failed

tests/billing-agent.yaml

tests:
  - name: refunds_a_real_charge
    builder: create_charge_then_refund
    builder_args:
      amount_cents: 4200
    expected_result: output.status == "refunded"
    expected_tool_calls:
      - stripe.refund.create
    expected_no_tool_calls:
      - email.send

The builder creates state, the agent operates on it, the assertion verifies behavior, and the cleanup hook tears down the fixture, even if the test failed. Builders run inside the same sandbox as your agent, so they see the same environment, the same secrets, and the same tools.

Cleanup runs unconditionally

Returning False from cleanup() additionally fails the case. Useful when teardown itself reveals a bug (e.g. you discover the agent didn't actually refund the charge it claimed to). Every other return value is treated as success.

Tests Are a Deployment Gate, Not a Vibe Check

Tests that run on a developer's laptop and nowhere else are a comfort, not a control. Connic Tests is wired directly into the deployment pipeline:

git push→Build→Test Phase→Promote→Production

During every deployment, Connic discovers the test suites in your project, expands each case according to its runs count, and executes them in an isolated runner that mirrors the production environment. Same tools, same secrets, same model providers. If any case fails its threshold, the deployment stops. Production never sees the broken build.

Wired into every deploy

No CI plumbing to write. The test phase is a built-in step of the deployment pipeline, and every test run is associated with the deployment that triggered it for full audit history.

Per-case visibility

Each test case becomes its own row in the deployment timeline, with the count of runs attempted, runs passed, and the underlying agent run IDs. Click any failing case to land directly on the failed traces.

Run on demand from the CLI

Need to verify a fix without redeploying? connic test executes the same framework against your current environment and streams results back to your terminal.

Flake budgets, not flake denial

A 90% threshold isn't lowering the bar. It's the bar, stated honestly. The deployment passes when the agent meets your real production-quality target, and fails the moment the underlying success rate drifts down.

Real-World Patterns

Regression Sentinel for a Prompt Change

A team rewrites the system prompt of their support agent. They add a sentinel case with 50 runs and a 95% threshold against a tricky historical ticket. The deploy succeeds only if the new prompt is at least as reliable as the old one on the ticket that originally motivated the rewrite.

Tool-Call Contract for a New Model

When upgrading the underlying model, an analytics agent suddenly “answers” from memory instead of querying the warehouse. The team adds expected_tool_calls: warehouse.query: invocations >= 1 to every test case. Future model swaps that re-introduce the bug fail the deployment automatically.

Negative Assertions on a Refund Agent

A billing agent must never email customers without an explicit instruction. Every refund test case lists expected_no_tool_calls: email.send. A regression that adds an “helpful” confirmation email is caught at deploy time, not in a customer complaint.

Multimodal Coverage for an Invoice Parser

An invoice extraction agent has 12 fixture PDFs covering vendor formats. Each case runs 3 times at 100% threshold. Any model regression that breaks one vendor's layout shows up as a single red row in the deployment view, not a quiet drop in extraction accuracy.

How It Fits the Rest of the Platform

Connic Tests is the deterministic, pre-production gate. It sits alongside two other quality systems on the platform:

System	When It Runs	What It Answers
Connic Tests	During deployment	Does this build pass the contract I wrote?
LLM Judges	After every (or sampled) production run	How is quality trending against my rubric?
A/B Testing	Across two live variants	Which version performs better on real traffic?

Tests catch regressions before they ship. Judges score the runs that do ship. A/B testing decides between two versions that both passed. Use all three and you have a proper feedback loop instead of a hope-driven release process.

Getting Started

Adding tests to an existing agent is a five-minute exercise:

1.Create a tests/ directory in your project root, next to agents/
2.Add tests/<agent-name>.yaml with one or two cases. Start with runs: 1 for fast feedback
3.Run connic test locally and watch the cases stream in
4.Push and deploy. The same suite now gates production. Crank runs up on the cases that matter most

For the full schema reference (every field, every default, every expression binding) see the Testing documentation. New to Connic? Start with the quickstart guide to deploy your first agent, then come back here and gate it with a test that fails before your customers do.

Connic Tests: Catch Agent Regressions Before They Reach Production

Why Agents Break Traditional Test Frameworks

The 30-Second Tour

Built for Non-Determinism

Assertions That Match How Agents Actually Work

Multimodal Fixtures Without the Setup

Dynamic Builders for Stateful Tests

Tests Are a Deployment Gate, Not a Vibe Check

Real-World Patterns

How It Fits the Rest of the Platform

Getting Started

More from the Blog

What We Shipped in April 2026

What We Shipped in March 2026

Database vs. Knowledge Base: Choosing the Right Storage

What We Shipped in January 2026

What We Shipped in November 2025

AI Agent Knowledge Base: Setup in 10 Minutes