Testing Framework
Declarative test suites that live in tests/. Run them ad-hoc with connic test, or let the deploy gate run them automatically before every release.
Overview
Tests as YAML, run by the same runtime as production
A test is a YAML file under tests/ that invokes an agent N times with a fixed payload and asserts on the output and the tools it called. Tests run in a real runner container against a real Connic environment, the same code path your production traffic hits, so a passing suite is meaningful, not a stubbed simulation.
Two ways to run them:
- Ad-hoc:
connic testfrom your terminal. Useful while iterating. - Deploy gate: every
connic deployand git auto-deploy runs the suite before the new image is shipped. A failing case aborts the deploy.
Both paths share the same discovery, the same execution model, and the same dashboard surfacing. The only difference is whether a successful run also promotes a deployment to active.
File Layout
Test files live at the project root in a flat tests/ directory, the same convention as middleware/. The filename stem is the agent the suite targets.
So tests/stress-tester.yaml contains tests for the agent named stress-tester. If you want to split a large suite for one agent across multiple files (e.g. smoke vs. load tests), set the top-level agent: field on each file so they all target the same agent. See below.
YAML Format
Each file declares one or more test cases. File-level defaults apply to every case; per-case fields override them.
version: "1.0"
# File-level defaults applied to every case.
# Per-case fields override these.
defaults:
runs: 5 # invoke the agent 5 times per case
success_threshold: 80 # 4/5 must pass for the case to pass
timeout_s: 60 # per-invocation wall clock
tests:
- name: adds_two_numbers
payload: '{"message": "add 4 and 6", "a": 4, "b": 6}'
expected_result: status == "completed"
expected_tool_calls:
- math.calculator.add: invocations >= 1
- name: plain_message_no_tools
payload: "say hello"
expected_result: status == "completed"
expected_no_tool_calls:
- math.calculator.add
- name: high_concurrency_smoke
payload: '{"message": "stress ping"}'
runs: 20 # override file default for this case
success_threshold: 95
expected_result: status == "completed"When you need more than one suite for the same agent, override the filename-derived default with the agent: field. The two files here both target stress-tester:
# A second suite for the stress-tester agent. The filename can't also be
# stress-tester.yaml (it's already taken), so we point at the agent
# explicitly and give the file a more descriptive name.
agent: stress-tester
defaults:
runs: 50
success_threshold: 90
tests:
- name: sustained_burst
payload: '{"message": "stress ping"}'
expected_result: status == "completed"Field reference
| Field | Type | Status | Description |
|---|---|---|---|
| version | string | Optional | Schema version. Currently only "1.0".Default: "1.0" |
| agent | string | Optional | Agent the suite targets. Defaults to the filename stem (e.g. tests/foo.yaml → foo). Set explicitly when you want to split a large suite for one agent across multiple files. |
| defaults | object | Optional | File-level defaults applied to every case. Per-case fields override these. |
| runs | integer | Optional | How many times each case invokes the agent. Range: 1–100.Default: 1 |
| success_threshold | integer | Optional | Percent of runs that must pass for the case to pass overall. Range: 1–100.Default: 100 |
| timeout_s | integer | Optional | Per-invocation wall-clock timeout in seconds. Range: 1–3600.Default: 120 |
| tests | object[] | Required | Test cases. At least one required. |
| name | string | Required | Stable identifier within the file. Must be unique. Surfaces as the row title in the dashboard pipeline panel. |
| payload | string | Optional | Agent input as a string, same shape as a normal Connic payload. If the string parses as JSON it's converted before output is evaluated, so output.id == 10 works on a JSON reply. Required unless builder is set. |
| files | string[] | Optional | Bare filenames found in tests/files/. The runner reads each file, base64-encodes it, and attaches it under files. If payload is a JSON object (or comes from a builder returning a dict), its keys sit at the top level of context["payload"] next to files; otherwise the string is delivered as {message: payload}. See File Attachments.Default: [] |
| builder | string | Optional | Name of a Python module under tests/builders/ (with or without the .py suffix). Replaces the static payload with whatever call(test_details) returns. See Dynamic Payload Builders. |
| builder_args | object | Optional | Arbitrary kwargs forwarded to the builder as the builder_args argument of build() and cleanup(). Use it to vary fixtures without writing one builder per case. |
| runs | integer | Optional | Per-case override for defaults.runs. |
| success_threshold | integer | Optional | Per-case override for defaults.success_threshold. |
| timeout_s | integer | Optional | Per-case override for defaults.timeout_s. |
| expected_result | string | Optional | Expression evaluated against bindings output, error, status, context. If omitted, the case passes whenever the run reaches completed. See the Expression DSL section below. |
| expected_tool_calls | list | Optional | Either bare tool names (called at least once) or one-key mappings {tool: <expr on invocations, params, and/or context>}. Mixed entries allowed in the same list, and the same tool may appear in multiple entries to lock down distinct argument sets independently.Default: [] |
| expected_no_tool_calls | string[] | Optional | Tool names that must NOT be called during the run. Useful for locking down the negative branch of a conditional tool selection.Default: [] |
| expected_child_agents | object | Optional | Map of triggered agent name → assertions for that child run. Each entry takes the same expected_result / expected_tool_calls / expected_no_tool_calls fields as the parent, plus its own nested expected_child_agents for deeper trigger chains. See Asserting on triggered agents.Default: null |
| expected_triggered | integer | Optional | (Inside an expected_child_agents entry.) Minimum number of times the named child agent must be triggered. Useful when the only thing the parent can assert is that a fire-and-forget trigger happened.Default: 1 |
| expected_payload | string | Optional | (Inside an expected_child_agents entry.) Expression evaluated against the input the parent passed to trigger_agent. Bindings: payload (JSON-parsed when the parent passed a JSON string, else the raw value), payload_raw (string form, "" when N/A), context. Works on fire-and-forget triggers too, since the payload is captured at call time. |
| expected_result | string | Optional | (Inside an expected_child_agents entry.) Same expression grammar as the top-level field, evaluated against the child run's output. Requires at least one wait_for_response=True trigger. |
| expected_tool_calls | list | Optional | (Inside an expected_child_agents entry.) Same grammar as the top-level field, evaluated against the child's tool calls.Default: [] |
| expected_no_tool_calls | string[] | Optional | (Inside an expected_child_agents entry.) Tools the child must NOT call.Default: [] |
| expected_child_agents | object | Optional | (Inside an expected_child_agents entry.) Recursive — assertions for agents this child triggers in turn. Stack as deep as the trigger chain goes.Default: null |
Expression DSL
Assertion expressions use the same safe evaluator as tool conditions and approval rules.
Python-like syntax: and, or, not; comparisons == != > < >= <=; membership in, not in; parentheses for grouping; string literals in single or double quotes. Reach into nested objects with dot-paths like context.user.role. A bare path like context.active is a truthy check: it passes when the value is set and not empty, zero, or false. Missing fields make the surrounding predicate fail rather than raising.
expected_result
outputoutput.id == 10 and "hi" in output both work.errorNone.status"completed", "failed", "cancelled", "blocked", "awaiting_approval".context.<key>context dict, the same one build() mutated. Empty for tests with no builder. Use this to compare agent output against fixture state the builder just provisioned (e.g. output.id == context.row_uuid). See Dynamic Payload Builders.expected_tool_calls
invocationsparams.<key>context.<key>params and invocations so a tool-call assertion can pin params to a fixture id (e.g. params.uuid == context.test_uuid).and splits a tool-call expression into params.* filters (per-invocation) and invocations predicates (over the filtered count). context.* may appear on either side. If only params.* conjuncts are given, invocations >= 1 is implied. Repeat the same tool name across entries to lock down distinct argument sets independently. Tool names match either the local function name or the qualified ref.expected_result examples
tests:
# Status check (the most common case)
- name: completes_cleanly
payload: "ping"
expected_result: status == "completed"
# JSON output via attribute access
- name: returns_id_10
payload: '{"a": 4, "b": 6}'
expected_result: output.id == 10
# Substring match on a plain-text reply
- name: greets_user
payload: "hi"
expected_result: '"hello" in output'
# Numeric comparison + boolean composition
- name: high_confidence_only
payload: "classify this"
expected_result: output.confidence >= 0.8 and output.label != "unknown"
# Negative case: a failure is the expected outcome
- name: rejects_invalid_input
payload: '{"vendor": ""}'
expected_result: status == "failed" and "missing vendor" in errorexpected_tool_calls examples
tests:
# Bare name -- the tool must be called at least once
- name: uses_calculator
payload: '{"a": 4, "b": 6}'
expected_tool_calls:
- math.calculator.add
# Mapping form -- expression on invocations
- name: calls_add_at_least_five_times
payload: '{"sum_many": [1,2,3,4,5,6]}'
expected_tool_calls:
- math.calculator.add: invocations >= 5
# Exactly-once enforcement
- name: calls_send_exactly_once
payload: "send a digest"
expected_tool_calls:
- notifications.send: invocations == 1
# Filter by call arguments via params.* -- asserts the agent
# actually used the operands from the payload, not invented ones.
# When invocations is omitted, "at least one matching call" is implied.
- name: calls_add_with_payload_args
payload: '{"a": 4, "b": 6}'
expected_tool_calls:
- math.calculator.add: params.a == 4 and params.b == 6
# Repeat the same tool to lock down each argument set independently.
# Each entry is its own assertion -- this passes when the agent
# calls add(4, ...) once AND add(7, ...) once, in any order.
- name: calls_add_for_each_pair
payload: "compute 4+6 and 7+8 separately"
expected_tool_calls:
- math.calculator.add: invocations == 1 and params.a == 4
- math.calculator.add: invocations == 1 and params.a == 7
# Pin params against builder context. The builder inserts a row,
# stashes its uuid in context["test_uuid"], and the agent receives
# the uuid in its prompt. The assertion fails if the agent fetches
# any row other than the one the builder provisioned.
- name: fetches_the_row_we_just_inserted
builder: insert_then_query
expected_result: output.row.id == context.test_uuid
expected_tool_calls:
- db.fetch_row: params.uuid == context.test_uuid and invocations == 1
# Negative assertion: tool must NOT be called
- name: plain_chat_no_tools
payload: "say hi"
expected_no_tool_calls:
- math.calculator.add
- notifications.sendAsserting on Triggered Agents
When the agent under test calls trigger_agent (see trigger_agent), the test container runs the child agent in-process instead of dispatching to the live deployment. That gives you the same execution model as the parent for any agent the trigger reaches, so expected_child_agents can assert on output, tool calls, and further triggers exactly the way the top-level fields do.
The assertion stacks: each entry is keyed by the triggered agent's name and can carry its own expected_child_agents for whatever that child triggers in turn.
tests:
# The dispatcher agent calls trigger_agent("summarizer", ...) with
# wait_for_response=True. In the deploy-gate container the child runs
# in-process, so its output and tool calls are captured here.
- name: dispatches_to_summarizer
payload: '{"text": "..."}'
expected_child_agents:
summarizer:
expected_payload: payload.text != ""
expected_result: output.summary != ""
expected_tool_calls:
- llm.complete: invocations >= 1
expected_no_tool_calls:
- email.send
# Pin the trigger payload against builder context, so the test fails if
# the agent forwards the wrong fixture id instead of the one it was
# given. Works whether the parent passed a dict (payload.field) or a
# string (substring via payload_raw).
- name: forwards_charge_id_unchanged
builder: create_charge_then_refund
builder_args:
amount_cents: 4200
expected_child_agents:
billing-refunder:
expected_payload: payload.charge_id == context.charge_id
# Recursive: assert on a grandchild that summarizer triggers in turn.
# Same shape repeats at every depth -- agent name keys mapping to the
# same assertion fields, plus its own expected_child_agents.
- name: dispatches_summarizer_then_publisher
payload: '{"text": "..."}'
expected_child_agents:
summarizer:
expected_result: output.summary != ""
expected_child_agents:
publisher:
expected_tool_calls:
- kafka.publish: params.topic == "summaries"
# Fire-and-forget triggers (wait_for_response=False) cannot have their
# result inspected, but the payload is recorded at call time -- so
# expected_payload still applies.
- name: fans_out_telemetry
payload: '{"event": "checkout"}'
expected_child_agents:
telemetry-writer:
expected_triggered: 1
expected_payload: payload.event == "checkout"Two evaluation paths
wait_for_response=True: the child runs synchronously inside the test container with its own tool-call collector, soexpected_result,expected_tool_calls,expected_no_tool_calls, and nestedexpected_child_agentsall apply.wait_for_response=False: fire-and-forget. The framework only knows the call happened and what payload it carried; useexpected_triggeredandexpected_payloadhere. If a fire-and-forget trigger is the only match and the spec carries result / tool / nested assertions, the case fails with a clear reason telling you to wait for the response.
Asserting on the trigger payload
expected_payload uses the same expression grammar as expected_result, just with input-side bindings. Use payload.<key> when the parent passed a dict or a JSON string, and payload_raw for substring checks against a free-form string trigger. context.<key> is bound the same way the other assertions bind it, so you can pin a forwarded fixture id with payload.charge_id == context.charge_id. Because the payload is captured at call time, this assertion works on fire-and-forget triggers too — it's the one piece of every trigger record that's always observable.
trigger_agent_at in test mode is treated as fire-and-forget (the test container never waits for the scheduled time), so its triggered agents are matched by name and count too.
Matching semantics
- Per-trigger. Each
trigger_agentcall gets its own record, with its own captured tool calls and grandchildren — they don't leak back into the parent. - At-least-one-must-pass. When the parent triggered the same child more than once, the assertion passes as soon as one waited trigger satisfies the spec.
- Builder context is shared.
context.<key>in a child'sexpected_resultorexpected_tool_callsreads the same builder dict the top-level case uses, so a fixture id stashed inbuild()is reachable at every depth. - Only inside the deploy-gate container. Production
trigger_agentcalls still route via the normal API path. The in-process dispatch is exclusive to tests so a deploy gate can never side-effect the live deployment.
File Attachments
Drop binary fixtures (PDFs, images, audio, anything the agent will receive in production) into tests/files/ and reference them by bare filename in the case's files: list. The runner reads each file, base64-encodes it, and attaches it under files. If payload is a JSON object (or comes from a builder returning a dict), its keys sit at the top level of context["payload"] next to files; otherwise the string is delivered as {message: payload}.
tests:
- name: extracts_invoice_total
payload: "extract the total amount as JSON"
files:
- invoice_acme.pdf
- invoice_globex.pdf
expected_result: output.total > 0
# Files combine with a static payload (the prompt). They can also be
# used with a builder -- attached files are merged with whatever the
# builder returns.
- name: classifies_receipt
payload: "is this a meal or travel expense?"
files:
- receipt.jpg
expected_result: 'output.category in ("meal", "travel")'A few constraints worth knowing:
- Bare filenames only. No path separators, no
... The schema rejects anything that looks like a path. - 25 MB upload budget. Code/config (everything outside
tests/files/) is still capped at 5 MB; fixtures get the remaining headroom. - Mime type is auto-detected from the extension via Python's
mimetypes, falling back toapplication/octet-stream. - Missing files fail fast. If a referenced fixture isn't in the upload, the deploy gate aborts before building the test image.
Dynamic Payload Builders
Some tests can't run against a static payload. You need a fresh fixture, a freshly minted database row, or a payload that references an ID that didn't exist a moment ago. Drop a Python module under tests/builders/ exposing build(context, builder_args, test_name, payload, files) (and optionally cleanup(run, context, builder_args)), point the case at it via the builder: field, and the runner will execute the pair once per invocation. build()'s return value (string or dict) becomes the agent input, replacing any static payload.
# tests/builders/create_charge_then_refund.py
import os, requests
def build(context, builder_args, test_name, payload, files):
"""Provision a fixture in your own API, then return the agent payload.
context dict -- mutate to pass state to cleanup()
AND to expected_result / expected_tool_calls
builder_args dict -- the yaml `builder_args`
test_name str -- the yaml `name`
payload str | None -- the yaml `payload` (if any)
files list[str] -- the yaml `files`
"""
charge = requests.post(
f"{os.environ['BILLING_API']}/charges",
json={"amount_cents": builder_args["amount_cents"], "currency": "eur"},
).json()
# Stash the id so cleanup() can tear the fixture down AND so the
# yaml expressions can reference it via `context.charge_id`.
context["charge_id"] = charge["id"]
return {"charge_id": charge["id"], "instruction": "refund this charge"}
def cleanup(run, context, builder_args):
"""Tear down the fixture. Optionally add Python-level checks.
Runs after every agent invocation -- pass, fail, or timeout --
so external resources are always released.
run["input"] -- what the agent saw
run["output"] -- the agent's parsed output
run["context"] -- the run's run_context dict (run_id, agent_name,
connector_id, timestamp, plus anything middleware
or hooks added during the run)
context -- the dict you populated in build()
builder_args -- same dict that was passed to build()
"""
requests.delete(f"{os.environ['BILLING_API']}/charges/{context['charge_id']}")
# Return False to fail the case (in addition to yaml checks);
# True/None to pass.
return run["output"].get("refund_id") is not Nonetests:
- name: refunds_a_real_charge
builder: create_charge_then_refund
builder_args:
amount_cents: 4200
# `context` is the same dict build() mutated -- here it pins the
# tool call to the exact charge_id the builder provisioned, so the
# test fails if the agent invents an id or refunds the wrong charge.
expected_result: output.status == "refunded" and output.charge_id == context.charge_id
expected_tool_calls:
- billing.refund: params.charge_id == context.charge_id and invocations == 1Passing state from build to cleanup
Mutate the context dict inside build(). The same dict is delivered to cleanup() as its second argument. The canonical pattern is to stash an id you provisioned in build, then DELETE that resource in cleanup so the test never leaves residue behind.
Referencing context from yaml assertions
The same context dict is also bound as context in expected_result and expected_tool_calls expressions. That closes the loop on dynamic fixtures: the builder mints an id, the agent receives it via the payload, and the assertion can confirm the agent passed that exact id to the tool call instead of inventing one or grabbing the wrong row.
# tests/builders/insert_then_query.py
import os, uuid, requests
def build(context, builder_args, test_name, payload, files):
test_uuid = str(uuid.uuid4())
requests.post(
f"{os.environ['DB_API']}/rows",
json={"id": test_uuid, "value": "hello"},
)
context["test_uuid"] = test_uuid
return f"Fetch the row with id {test_uuid} and tell me its value."
def cleanup(run, context, builder_args):
requests.delete(f"{os.environ['DB_API']}/rows/{context['test_uuid']}")tests:
- name: fetches_the_row_we_just_inserted
builder: insert_then_query
expected_result: output.row.id == context.test_uuid
expected_tool_calls:
- db.fetch_row: params.uuid == context.test_uuid and invocations == 1A few notes:
- Empty for builder-less cases. Tests with no
buildergetcontext = {}, so a missing key fails the predicate rather than raising. - Read after build, before cleanup. Assertions evaluate against the dict as it stands when
build()returns.cleanup()still sees the same reference and can mutate it for its own bookkeeping, but those changes can't reach the assertions, since cleanup runs afterwards. - Same syntax both sides.
context.foo.bar[0]works identically inexpected_resultand inside aparamsorinvocationsconjunct.
cleanup() return contract
- Return
TrueorNoneto pass. Use this for plain teardown that has no opinion on the agent output. - Return
Falseto fail the case. Useful for Python-level assertions that aren't expressible as a yaml expression. The result is AND-ed with the yaml-defined checks (expected_result,expected_tool_calls,expected_no_tool_calls). - Always runs. cleanup fires even when the agent timed out or crashed, so teardown is reliable. Exceptions raised by cleanup are caught and reported as a failure with the traceback in
failure_reason.
Other notes
- One re-import per invocation. Builders are reloaded fresh every time, so module-level state (caches, counters) resets between calls. Use this for builders that POST a fixture, since you don't want stale IDs leaking between cases.
- Sync or async. If
buildorcleanupreturns a coroutine the runner awaits it. - Same env as the agent. Builders run inside the test container, with the same environment variables, connectors, and network reachability. Authenticating to your own API works exactly as it does from a tool.
- Combine with files. If both
builderandfilesare set, attached fixtures are merged into the builder's output. If the builder returns a dict with its ownfileskey, both lists concatenate. - Missing builders fail fast. Same pre-build check as
files: a typo aborts the deploy gate before the test container even starts.
Running Tests Ad-Hoc
# Discover tests/*.yaml, build a one-shot test image,
# run every case in your default standard env, and exit
# with status 0 if all cases passed.
connic testconnic test picks the default standard environment's test_environment_id if one is set (see Environments), and falls back to the env itself otherwise. Override or filter as needed:
# Run only cases whose name contains the substring
connic test --filter adds_two_numbers
# Pick a specific environment to execute against
connic test --env <environment-id>
# Machine-readable output for CI
connic test --json
# Print local per-agent coverage and exit (no backend call)
connic test --coverageAs cases finish, the CLI prints them in a results table and surfaces a clickable dashboard link to the throwaway deployment that backed the run. Drill in there for per-case agent runs, traces, tool calls, and outputs.
Exit code is 0 when every case passed, 1 on failure, 2 on infrastructure error. Drop connic test straight into your CI pipeline.
Coverage Report
connic test --coverage is a static, offline report. It reads agents/ and tests/ from disk and tells you which agents have tests and which tools those tests actually exercise. No backend call, no credentials, no test container. Safe to wire into a pre-commit hook or a doc-style CI job on every PR.
The model is intentionally simple:
- Every agent counts equally. One of ten agents fully covered is 10% overall, regardless of tool count. This keeps the headline number honest when one agent has 20 tools and another has 2.
- Per-agent score = covered tools / total tools. A tool counts as covered if it appears at least once in any of that agent's
expected_tool_callsentries (bare or mapping form). To hit 100% on an agent, every one of its tools needs to show up inexpected_tool_callsin at least one case. - No test file → 0%. An agent without a corresponding
tests/<agent>.yamlcontributes 0 to the average. - Tool-less agents → 100% if a test file exists. Sequential agents and orchestrators have nothing to cover at the tool level, so a single test file is enough.
- A/B variants are skipped. Test variants like
support-test-fastshare the base agent's tools and are excluded from the count. - Discoverable tools count too. Both
toolsanddiscoverable_toolsare part of the denominator. The auto-injectedsearch_tools/use_toolmarkers are not.
| Agent | Type | Tools covered | Coverage |
|---|---|---|---|
| stress-tester | llm | 1 / 1 | 100.0% |
| search-agent | llm | 1 / 3 | 33.3% |
| billing-bot | llm | no tests | 0.0% |
| Overall (3 agents) | 44.4% | ||
Uncovered tools are listed beneath the table. For search-agent above: web.fetch, web.summarize.
Pair it with --json to get a machine-readable report ({overall, agents: [{name, type, has_tests, tools_total, tools_covered, uncovered_tools, percent}]}) you can pipe into a CI gate. For example, fail the build if overall coverage drops below a threshold. Unlike connic test, --coverage always exits 0; it's a report, not a gate.
The Deploy Gate
Every deploy, whether triggered by connic deploy or by a git push to your connected branch, runs as a two-phase pipeline:
- 1. Build image: package your project into a runner image.
- 2. Run tests: spin up a one-shot test container from the just-built image, execute every case in
tests/, capture results. - 3. Deploy to {env name}: only if every case passed. Otherwise the deployment is marked
FAILEDand nothing ships.
You can watch this happen live on the deployment detail page in the dashboard. Each step shows pending → in progress → done; the test step expands to show the per-case list with per-invocation pills you can click to open the run drawer.
Test environment override
By default tests run in the deploy environment. To isolate them so a release can't hit real billing APIs, point Settings → Git & Environments → Test environment at a sibling environment with stub credentials and stage-only connectors. The deploy gate will use that env's vars and connectors for the test phase, then promote the image into the real environment if everything passes.
Skipping the gate
# Force a deploy through even if tests fail (or you have none yet).
# Available only on the CLI -- git auto-deploys never set this.
connic deploy --skip-tests--skip-tests is CLI-only and intended as an escape hatch, for example, getting a hotfix out while a flaky test is being debugged. Git auto-deploys never expose it.
Where Results Show Up
- Deployments list: every row carries a Tests column with the suite's pass/fail/skipped status. Ad-hoc
connic testruns appear with a purple Test run badge so they don't clutter your real release history. - Deployment detail → Pipeline: the Build → Tests → Deploy timeline with live per-step status.
- Tests step (expanded): one row per case, with status, success ratio, threshold, and a clickable list of agent run IDs.
- Run history: every test invocation lands in the runs table for the env it executed in, tagged with a purple badge so you can filter them out (or drill in).
Best Practices
Patterns that hold up once a suite has more than a couple of cases. How to set up the test env, keep external state from leaking, and write assertions that fail when something regresses.
Set up a dedicated test environment
Create a sibling environment in Settings → Git & Environments (e.g. staging-test) and point your standard env's Test environment dropdown at it. The deploy gate picks it up automatically; ad-hoc connic test resolves it via the standard env's test_environment_id. Each environment is a fresh slate. DB rows, agent sessions, knowledge base content, env vars, and connectors are all keyed by environment, so a test run can't see or mutate production state. See Configure environments.
For one-off CLI runs, pass --env <environment-id> explicitly rather than relying on the default. The fallback is fine for CI but easy to forget when iterating locally. Default new cases to runs: 1 so deploys stay quick, and only raise runs (with a sub-100 success_threshold) on cases where stochastic LLM behavior actually matters.
Real tools, scoped credentials
Connic agents call real tools at test time. Nothing is auto-mocked, by design, so a passing suite is meaningful rather than a stubbed simulation. Point the test env's env vars at sandbox credentials for every external service the agent touches (Stripe test keys, sandboxed email, throwaway S3 bucket), and swap connectors to stage-only instances so production data is never read or written during a run.
For state that needs to exist before the agent runs (a row in your own API, an inbox message, a webhook fixture), use a dynamic payload builder. build() provisions the fixture and returns the agent input; the runner re-imports the module per invocation so fixtures don't bleed across cases. See Builder reference.
Clean up after dynamic builders
Per-env isolation covers Connic-side state (DB, sessions, KB), but anything the builder creates in your own API or a third-party sandbox is your responsibility to delete. Always implement cleanup(run, context, builder_args) alongside build() and DELETE the resource using the id you stashed in context. cleanup() always runs, even on agent timeout or crash, so it's a safe place for teardown. Returning False also fails the case, useful for Python-level checks that don't fit the YAML expression DSL.
Thread fixture state through to assertions
The context dict the builder mutates is also bound as context.<key> inside expected_result and expected_tool_calls. Use it to verify the agent threaded the exact id the builder minted, not just some id of the right shape. Without that check, the agent can hallucinate a real-looking uuid and the case still passes. The canonical pattern is to stash an id in build, reference it in the payload, and pin both sides of the round-trip:
tests:
- name: fetches_the_row_we_just_inserted
builder: insert_then_query
expected_result: output.row.id == context.test_uuid
expected_tool_calls:
- db.fetch_row: params.uuid == context.test_uuid and invocations == 1See Reference context from yaml assertions for the matching builder.
Pin both branches of conditional tool selection
When an agent picks tool A or tool B based on input, write two cases: one asserting expected_tool_calls: [A] with expected_no_tool_calls: [B], and the mirror. Without the negative side, both cases pass as long as some tool got called, so a regression that swaps A and B goes undetected. The same pattern works for approval rules, conditional middleware, or any branch where you need to assert which path ran, not just that the agent called something.
Treat coverage as a separate PR check
connic test --coverage --json is offline, has no backend dependency, and always exits 0, so wiring it into the deploy gate is pointless. Run it as a separate PR-only CI job that parses the overall field and fails if it drops below a threshold (e.g. 60%). That keeps you from merging untested new agents or tools, without slowing every PR down with the full test container path. See Read the coverage report shape.
Once tests/ exists in your project, the deploy gate is automatic. No CI configuration, no extra commands, no separate runners. Every push to your connected branch passes through the same suite you ran locally.