Test agents
like you test code

Declarative YAML test suites under tests/. Same runtime, real environment, real connectors. Run them ad-hoc with connic test, or let the deploy gate run them before every release.

Start Building Free Read the testing docs

Read the testing docs

tests/invoice-processor.yaml

4 passed1 failed

extracts invoice total312ms
amount > 04ms
currency in [EUR, USD]6ms
vat is correctly calculated18ms
vendor name extracted11ms

$connic test351ms total

Anatomy of a test case

A YAML file in tests/, run by the same runtime as production

Each case invokes the agent N times with a fixed payload and asserts on the output and the tools it called.

tests/invoice-processor.yaml

version: "1.0"

defaults:
  runs: 5                       # invoke the agent 5 times per case
  success_threshold: 80         # 4/5 must pass for the case to pass
  timeout_s: 60                 # per-invocation wall clock

tests:
  - name: extracts_invoice_total
    payload: '{"message": "extract total", "doc_id": "INV-7821"}'
    expected_result: output.total > 0 and output.currency in ("EUR", "USD")
    expected_tool_calls:
      - invoices.extract: invocations >= 1
    expected_no_tool_calls:
      - notifications.send

tests/invoice-processor.yaml

5 passed0 failed

extracts_invoice_total312ms
expected_result passed (5/5)4ms
invoices.extract called 5/56ms
notifications.send not called4ms
success threshold 80 met2ms

$connic test328ms total

What you can assert

Expression DSL on output, plus tool-call shape

Assertions use the same safe evaluator as agent tool conditions and approval rules. Python-like syntax against the run's output, error, status, and recorded tool invocations.

expected_result

Python-like expression on bindings output, error, status. Supports attribute and subscript access, comparisons, boolean operators, and membership tests.

expected_tool_calls

Bare tool names (called at least once) or one-key mappings like {tool: invocations >= 5}. Mixed entries allowed.

expected_no_tool_calls

Tool names that must NOT be called during the run. Catches the case where the agent should have skipped a tool but didn't.

runs + success_threshold

Invoke the case N times (1–100); pass if at least the threshold percentage succeed. Lets you keep enforcing assertions on stochastic agents without flake-killing the suite.

timeout_s

Per-invocation wall-clock timeout in seconds (1–3600). A timeout counts as a failed run against the threshold.

Python-level checks via cleanup()

Dynamic payload builders can return False from cleanup() to fail the case. AND-ed with the YAML-defined checks, for assertions you can't express as an expression.

For LLM-graded quality scoring on production runs, configure Judges separately in the dashboard.

Two ways to run tests

Ad-hoc with the CLI, automatic on every deploy

connic test exits 0 when every case passes, 1 on failure, 2 on infrastructure error. Drop it into any CI runner, or skip CI entirely: every connic deploy and git auto-deploy already runs the suite as a deploy gate before the new image ships.

.github/workflows/agent-tests.yml

name: agent tests

on:
  pull_request:
    paths:
      - "agents/**"
      - "tools/**"
      - "tests/**"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install connic
      - run: connic test --env ${{ secrets.CONNIC_CI_ENV_ID }} --json
        env:
          CONNIC_API_KEY: ${{ secrets.CONNIC_API_KEY }}

Why not a spreadsheet of prompts?

What separates Connic testing from the patterns most teams outgrow

Why not a spreadsheet of prompts?
Feature	Connic	Spreadsheet eval	LangSmith eval	DIY pytest
Versioned with the agent project	Included	Not included	Partial	Included
Runs in CI	Included	Not included	Partial	Included
Expression DSL on output / status / error	Included	Not included	Partial	Included
Tool-call assertions (positive and negative)	Included	Not included	Partial	Partial
Repeats per case + success threshold for stochastic models	Included	Not included	Partial	Partial
Dynamic Python payload builders with cleanup	Included	Not included	Not included	Partial
Same runtime, real environment, real connectors	Included	Not included	Partial	Not included
Automatic deploy gate (no CI config needed)	Included	Not included	Not included	Not included

Keep exploring

Environments

Per-env config and secrets.

Bridges

Reach private services, no inbound ports.

Connectors

Wire agents to your real systems.

Judges

Score every run. Catch drift early.

Observability

Trace every step, every token.

Dev server

Hot-reload local agent runtime.

Frequently Asked Questions

Each case sets runs and success_threshold. runs invokes the agent N times (1–100) per case; success_threshold is the percent of those runs that must pass for the case to pass overall. So runs: 5, success_threshold: 80 means 4 of 5 invocations must pass.

Drop fixtures into tests/files/ and reference them by bare filename in the case's files: list. The runner reads each file, base64-encodes it, and delivers a multimodal payload of the shape {message, files: [{name, mime_type, data}]}. That's the same wire format webhook multipart uploads produce.

Use a dynamic payload builder. Drop a Python module under tests/builders/ exposing build() and optionally cleanup(), point the case at it via the builder: field, and the runner executes the pair once per invocation. cleanup() always runs, even on timeout or crash, so external fixtures get torn down reliably.

Tests run inside a one-shot test container against a real Connic environment, same code path as production. By default that's the deploy environment. Set the test_environment_id on your standard environment to isolate test traffic to a sibling env with stub credentials and stage-only connectors.

No. Once tests/ exists, the deploy gate is automatic: every connic deploy and every git auto-deploy runs the suite before the new image ships. A failing case aborts the deployment. You can still run connic test locally or in CI for faster feedback before pushing.

connic deploy --skip-tests is a CLI-only escape hatch, useful for getting a hotfix out while a flaky test is being debugged. Git auto-deploys never expose it.

Test agentslike you test code

tests/invoice-processor.yaml

A YAML file in tests/, run by the same runtime as production

tests/invoice-processor.yaml

Expression DSL on output, plus tool-call shape

Ad-hoc with the CLI, automatic on every deploy

Why not a spreadsheet of prompts?

Keep exploring

Frequently Asked Questions

How do you handle non-determinism?

How do I provide binary inputs like PDFs or images?

What if a test needs a fresh database row or freshly-minted ID?

Where does the test container run, and against which credentials?

Do I need to wire this into CI manually?

Can I force a deploy through when a test is flaky?

Test agents
like you test code