- Deploy
- Testing
Test agents
like you test code
Declarative YAML test suites under tests/. Same runtime, real environment, real connectors. Run them ad-hoc with connic test, or let the deploy gate run them before every release.
Read the testing docstests/invoice-processor.yaml
- extracts invoice total312ms
- amount > 04ms
- currency in [EUR, USD]6ms
- vat is correctly calculated18ms
- vendor name extracted11ms
A YAML file in tests/, run by the same runtime as production
Each case invokes the agent N times with a fixed payload and asserts on the output and the tools it called.
version: "1.0"
defaults:
runs: 5 # invoke the agent 5 times per case
success_threshold: 80 # 4/5 must pass for the case to pass
timeout_s: 60 # per-invocation wall clock
tests:
- name: extracts_invoice_total
payload: '{"message": "extract total", "doc_id": "INV-7821"}'
expected_result: output.total > 0 and output.currency in ("EUR", "USD")
expected_tool_calls:
- invoices.extract: invocations >= 1
expected_no_tool_calls:
- notifications.sendtests/invoice-processor.yaml
- extracts_invoice_total312ms
- expected_result passed (5/5)4ms
- invoices.extract called 5/56ms
- notifications.send not called4ms
- success threshold 80 met2ms
Expression DSL on output, plus tool-call shape
Assertions use the same safe evaluator as agent tool conditions and approval rules. Python-like syntax against the run's output, error, status, and recorded tool invocations.
Python-like expression on bindings output, error, status. Supports attribute and subscript access, comparisons, boolean operators, and membership tests.
Bare tool names (called at least once) or one-key mappings like {tool: invocations >= 5}. Mixed entries allowed.
Tool names that must NOT be called during the run. Catches the case where the agent should have skipped a tool but didn't.
Invoke the case N times (1–100); pass if at least the threshold percentage succeed. Lets you keep enforcing assertions on stochastic agents without flake-killing the suite.
Per-invocation wall-clock timeout in seconds (1–3600). A timeout counts as a failed run against the threshold.
Dynamic payload builders can return False from cleanup() to fail the case. AND-ed with the YAML-defined checks, for assertions you can't express as an expression.
For LLM-graded quality scoring on production runs, configure Judges separately in the dashboard.
Ad-hoc with the CLI, automatic on every deploy
connic test exits 0 when every case passes, 1 on failure, 2 on infrastructure error. Drop it into any CI runner, or skip CI entirely: every connic deploy and git auto-deploy already runs the suite as a deploy gate before the new image ships.
name: agent tests
on:
pull_request:
paths:
- "agents/**"
- "tools/**"
- "tests/**"
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install connic
- run: connic test --env ${{ secrets.CONNIC_CI_ENV_ID }} --json
env:
CONNIC_API_KEY: ${{ secrets.CONNIC_API_KEY }}Why not a spreadsheet of prompts?
What separates Connic testing from the patterns most teams outgrow
| Feature | Connic | Spreadsheet eval | LangSmith eval | DIY pytest |
|---|---|---|---|---|
| Versioned with the agent project | Included | Not included | Partial | Included |
| Runs in CI | Included | Not included | Partial | Included |
| Expression DSL on output / status / error | Included | Not included | Partial | Included |
| Tool-call assertions (positive and negative) | Included | Not included | Partial | Partial |
| Repeats per case + success threshold for stochastic models | Included | Not included | Partial | Partial |
| Dynamic Python payload builders with cleanup | Included | Not included | Not included | Partial |
| Same runtime, real environment, real connectors | Included | Not included | Partial | Not included |
| Automatic deploy gate (no CI config needed) | Included | Not included | Not included | Not included |