Connic
Connic Composer SDK

Testing Framework

Declarative test suites that live in tests/. Run them ad-hoc with connic test, or let the deploy gate run them automatically before every release.

Last updated

Overview

Tests as YAML, run by the same runtime as production

A test is a YAML file under tests/ that invokes an agent N times with a fixed payload and asserts on the output and the tools it called. Tests run in a real runner container against a real Connic environment, the same code path your production traffic hits, so a passing suite is meaningful, not a stubbed simulation.

Two ways to run them:

  • Ad-hoc: connic test from your terminal. Useful while iterating.
  • Deploy gate: every connic deploy and git auto-deploy runs the suite before the new image is shipped. A failing case aborts the deploy.

Both paths share the same discovery, the same execution model, and the same dashboard surfacing. The only difference is whether a successful run also promotes a deployment to active.

File Layout

Test files live at the project root in a flat tests/ directory, the same convention as middleware/. The filename stem is the agent the suite targets.

Project structure
my-agent-project/
agents/
stress-tester.yaml
tools/
math/
calculator.py
tests/test suites live here, flat
stress-tester.yamltests for the stress-tester agent
search.yamltests for the search agent
files/binary fixtures referenced by files:
invoice.pdf
receipt.jpg
builders/dynamic-payload builders referenced by builder:
create_charge.py
requirements.txt
.connic

So tests/stress-tester.yaml contains tests for the agent named stress-tester. If you want to split a large suite for one agent across multiple files (e.g. smoke vs. load tests), set the top-level agent: field on each file so they all target the same agent. See below.

YAML Format

Each file declares one or more test cases. File-level defaults apply to every case; per-case fields override them.

tests/stress-tester.yaml
version: "1.0"

# File-level defaults applied to every case.
# Per-case fields override these.
defaults:
  runs: 5                     # invoke the agent 5 times per case
  success_threshold: 80       # 4/5 must pass for the case to pass
  timeout_s: 60               # per-invocation wall clock

tests:
  - name: adds_two_numbers
    payload: '{"message": "add 4 and 6", "a": 4, "b": 6}'
    expected_result: status == "completed"
    expected_tool_calls:
      - math.calculator.add: invocations >= 1

  - name: plain_message_no_tools
    payload: "say hello"
    expected_result: status == "completed"
    expected_no_tool_calls:
      - math.calculator.add

  - name: high_concurrency_smoke
    payload: '{"message": "stress ping"}'
    runs: 20                  # override file default for this case
    success_threshold: 95
    expected_result: status == "completed"

When you need more than one suite for the same agent, override the filename-derived default with the agent: field. The two files here both target stress-tester:

tests/stress-tester-load.yaml
# A second suite for the stress-tester agent. The filename can't also be
# stress-tester.yaml (it's already taken), so we point at the agent
# explicitly and give the file a more descriptive name.
agent: stress-tester

defaults:
  runs: 50
  success_threshold: 90

tests:
  - name: sustained_burst
    payload: '{"message": "stress ping"}'
    expected_result: status == "completed"

Field reference

FieldTypeStatusDescription
versionstringOptionalSchema version. Currently only "1.0".Default: "1.0"
agentstringOptionalAgent the suite targets. Defaults to the filename stem (e.g. tests/foo.yaml → foo). Set explicitly when you want to split a large suite for one agent across multiple files.
defaultsobjectOptionalFile-level defaults applied to every case. Per-case fields override these.
runsintegerOptionalHow many times each case invokes the agent. Range: 1–100.Default: 1
success_thresholdintegerOptionalPercent of runs that must pass for the case to pass overall. Range: 1–100.Default: 100
timeout_sintegerOptionalPer-invocation wall-clock timeout in seconds. Range: 1–3600.Default: 120
testsobject[]RequiredTest cases. At least one required.
namestringRequiredStable identifier within the file. Must be unique. Surfaces as the row title in the dashboard pipeline panel.
payloadstringOptionalAgent input as a string, same shape as a normal Connic payload. If the string parses as JSON it's converted before output is evaluated, so output.id == 10 works on a JSON reply. Required unless builder is set.
filesstring[]OptionalBare filenames found in tests/files/. The runner reads each file, base64-encodes it, and attaches it under files. If payload is a JSON object (or comes from a builder returning a dict), its keys sit at the top level of context["payload"] next to files; otherwise the string is delivered as {message: payload}. See File Attachments.Default: []
builderstringOptionalName of a Python module under tests/builders/ (with or without the .py suffix). Replaces the static payload with whatever call(test_details) returns. See Dynamic Payload Builders.
builder_argsobjectOptionalArbitrary kwargs forwarded to the builder as the builder_args argument of build() and cleanup(). Use it to vary fixtures without writing one builder per case.
runsintegerOptionalPer-case override for defaults.runs.
success_thresholdintegerOptionalPer-case override for defaults.success_threshold.
timeout_sintegerOptionalPer-case override for defaults.timeout_s.
expected_resultstringOptionalExpression evaluated against bindings output, error, status, context. If omitted, the case passes whenever the run reaches completed. See the Expression DSL section below.
expected_tool_callslistOptionalEither bare tool names (called at least once) or one-key mappings {tool: <expr on invocations, params, and/or context>}. Mixed entries allowed in the same list, and the same tool may appear in multiple entries to lock down distinct argument sets independently.Default: []
expected_no_tool_callsstring[]OptionalTool names that must NOT be called during the run. Useful for locking down the negative branch of a conditional tool selection.Default: []
expected_child_agentsobjectOptionalMap of triggered agent name → assertions for that child run. Each entry takes the same expected_result / expected_tool_calls / expected_no_tool_calls fields as the parent, plus its own nested expected_child_agents for deeper trigger chains. See Asserting on triggered agents.Default: null
expected_triggeredintegerOptional(Inside an expected_child_agents entry.) Minimum number of times the named child agent must be triggered. Useful when the only thing the parent can assert is that a fire-and-forget trigger happened.Default: 1
expected_payloadstringOptional(Inside an expected_child_agents entry.) Expression evaluated against the input the parent passed to trigger_agent. Bindings: payload (JSON-parsed when the parent passed a JSON string, else the raw value), payload_raw (string form, "" when N/A), context. Works on fire-and-forget triggers too, since the payload is captured at call time.
expected_resultstringOptional(Inside an expected_child_agents entry.) Same expression grammar as the top-level field, evaluated against the child run's output. Requires at least one wait_for_response=True trigger.
expected_tool_callslistOptional(Inside an expected_child_agents entry.) Same grammar as the top-level field, evaluated against the child's tool calls.Default: []
expected_no_tool_callsstring[]Optional(Inside an expected_child_agents entry.) Tools the child must NOT call.Default: []
expected_child_agentsobjectOptional(Inside an expected_child_agents entry.) Recursive — assertions for agents this child triggers in turn. Stack as deep as the trigger chain goes.Default: null

Expression DSL

Assertion expressions use the same safe evaluator as tool conditions and approval rules.

Expression Syntax

Python-like syntax: and, or, not; comparisons == != > < >= <=; membership in, not in; parentheses for grouping; string literals in single or double quotes. Reach into nested objects with dot-paths like context.user.role. A bare path like context.active is a truthy check: it passes when the value is set and not empty, zero, or false. Missing fields make the surrounding predicate fail rather than raising.

expected_result

output
The agent's output. JSON-parsed when valid JSON, otherwise the raw string. So output.id == 10 and "hi" in output both work.
error
The run's error string, or None.
status
One of "completed", "failed", "cancelled", "blocked", "awaiting_approval".
context.<key>
The builder's context dict, the same one build() mutated. Empty for tests with no builder. Use this to compare agent output against fixture state the builder just provisioned (e.g. output.id == context.row_uuid). See Dynamic Payload Builders.

expected_tool_calls

invocations
Count of calls to the named tool that match the params filter (or all calls, when no filter is given).
params.<key>
Keyword arguments of a single tool invocation. Use to filter calls down to a specific argument set.
context.<key>
Same builder dict as above; available alongside params and invocations so a tool-call assertion can pin params to a fixture id (e.g. params.uuid == context.test_uuid).
Top-level and splits a tool-call expression into params.* filters (per-invocation) and invocations predicates (over the filtered count). context.* may appear on either side. If only params.* conjuncts are given, invocations >= 1 is implied. Repeat the same tool name across entries to lock down distinct argument sets independently. Tool names match either the local function name or the qualified ref.

expected_result examples

examples
tests:
  # Status check (the most common case)
  - name: completes_cleanly
    payload: "ping"
    expected_result: status == "completed"

  # JSON output via attribute access
  - name: returns_id_10
    payload: '{"a": 4, "b": 6}'
    expected_result: output.id == 10

  # Substring match on a plain-text reply
  - name: greets_user
    payload: "hi"
    expected_result: '"hello" in output'

  # Numeric comparison + boolean composition
  - name: high_confidence_only
    payload: "classify this"
    expected_result: output.confidence >= 0.8 and output.label != "unknown"

  # Negative case: a failure is the expected outcome
  - name: rejects_invalid_input
    payload: '{"vendor": ""}'
    expected_result: status == "failed" and "missing vendor" in error

expected_tool_calls examples

examples
tests:
  # Bare name -- the tool must be called at least once
  - name: uses_calculator
    payload: '{"a": 4, "b": 6}'
    expected_tool_calls:
      - math.calculator.add

  # Mapping form -- expression on invocations
  - name: calls_add_at_least_five_times
    payload: '{"sum_many": [1,2,3,4,5,6]}'
    expected_tool_calls:
      - math.calculator.add: invocations >= 5

  # Exactly-once enforcement
  - name: calls_send_exactly_once
    payload: "send a digest"
    expected_tool_calls:
      - notifications.send: invocations == 1

  # Filter by call arguments via params.* -- asserts the agent
  # actually used the operands from the payload, not invented ones.
  # When invocations is omitted, "at least one matching call" is implied.
  - name: calls_add_with_payload_args
    payload: '{"a": 4, "b": 6}'
    expected_tool_calls:
      - math.calculator.add: params.a == 4 and params.b == 6

  # Repeat the same tool to lock down each argument set independently.
  # Each entry is its own assertion -- this passes when the agent
  # calls add(4, ...) once AND add(7, ...) once, in any order.
  - name: calls_add_for_each_pair
    payload: "compute 4+6 and 7+8 separately"
    expected_tool_calls:
      - math.calculator.add: invocations == 1 and params.a == 4
      - math.calculator.add: invocations == 1 and params.a == 7

  # Pin params against builder context. The builder inserts a row,
  # stashes its uuid in context["test_uuid"], and the agent receives
  # the uuid in its prompt. The assertion fails if the agent fetches
  # any row other than the one the builder provisioned.
  - name: fetches_the_row_we_just_inserted
    builder: insert_then_query
    expected_result: output.row.id == context.test_uuid
    expected_tool_calls:
      - db.fetch_row: params.uuid == context.test_uuid and invocations == 1

  # Negative assertion: tool must NOT be called
  - name: plain_chat_no_tools
    payload: "say hi"
    expected_no_tool_calls:
      - math.calculator.add
      - notifications.send

Asserting on Triggered Agents

When the agent under test calls trigger_agent (see trigger_agent), the test container runs the child agent in-process instead of dispatching to the live deployment. That gives you the same execution model as the parent for any agent the trigger reaches, so expected_child_agents can assert on output, tool calls, and further triggers exactly the way the top-level fields do.

The assertion stacks: each entry is keyed by the triggered agent's name and can carry its own expected_child_agents for whatever that child triggers in turn.

examples
tests:
  # The dispatcher agent calls trigger_agent("summarizer", ...) with
  # wait_for_response=True. In the deploy-gate container the child runs
  # in-process, so its output and tool calls are captured here.
  - name: dispatches_to_summarizer
    payload: '{"text": "..."}'
    expected_child_agents:
      summarizer:
        expected_payload: payload.text != ""
        expected_result: output.summary != ""
        expected_tool_calls:
          - llm.complete: invocations >= 1
        expected_no_tool_calls:
          - email.send

  # Pin the trigger payload against builder context, so the test fails if
  # the agent forwards the wrong fixture id instead of the one it was
  # given. Works whether the parent passed a dict (payload.field) or a
  # string (substring via payload_raw).
  - name: forwards_charge_id_unchanged
    builder: create_charge_then_refund
    builder_args:
      amount_cents: 4200
    expected_child_agents:
      billing-refunder:
        expected_payload: payload.charge_id == context.charge_id

  # Recursive: assert on a grandchild that summarizer triggers in turn.
  # Same shape repeats at every depth -- agent name keys mapping to the
  # same assertion fields, plus its own expected_child_agents.
  - name: dispatches_summarizer_then_publisher
    payload: '{"text": "..."}'
    expected_child_agents:
      summarizer:
        expected_result: output.summary != ""
        expected_child_agents:
          publisher:
            expected_tool_calls:
              - kafka.publish: params.topic == "summaries"

  # Fire-and-forget triggers (wait_for_response=False) cannot have their
  # result inspected, but the payload is recorded at call time -- so
  # expected_payload still applies.
  - name: fans_out_telemetry
    payload: '{"event": "checkout"}'
    expected_child_agents:
      telemetry-writer:
        expected_triggered: 1
        expected_payload: payload.event == "checkout"

Two evaluation paths

  • wait_for_response=True: the child runs synchronously inside the test container with its own tool-call collector, so expected_result, expected_tool_calls, expected_no_tool_calls, and nested expected_child_agents all apply.
  • wait_for_response=False: fire-and-forget. The framework only knows the call happened and what payload it carried; use expected_triggered and expected_payload here. If a fire-and-forget trigger is the only match and the spec carries result / tool / nested assertions, the case fails with a clear reason telling you to wait for the response.

Asserting on the trigger payload

expected_payload uses the same expression grammar as expected_result, just with input-side bindings. Use payload.<key> when the parent passed a dict or a JSON string, and payload_raw for substring checks against a free-form string trigger. context.<key> is bound the same way the other assertions bind it, so you can pin a forwarded fixture id with payload.charge_id == context.charge_id. Because the payload is captured at call time, this assertion works on fire-and-forget triggers too — it's the one piece of every trigger record that's always observable.

trigger_agent_at in test mode is treated as fire-and-forget (the test container never waits for the scheduled time), so its triggered agents are matched by name and count too.

Matching semantics

  • Per-trigger. Each trigger_agent call gets its own record, with its own captured tool calls and grandchildren — they don't leak back into the parent.
  • At-least-one-must-pass. When the parent triggered the same child more than once, the assertion passes as soon as one waited trigger satisfies the spec.
  • Builder context is shared. context.<key> in a child's expected_result or expected_tool_calls reads the same builder dict the top-level case uses, so a fixture id stashed in build() is reachable at every depth.
  • Only inside the deploy-gate container. Production trigger_agent calls still route via the normal API path. The in-process dispatch is exclusive to tests so a deploy gate can never side-effect the live deployment.

File Attachments

Drop binary fixtures (PDFs, images, audio, anything the agent will receive in production) into tests/files/ and reference them by bare filename in the case's files: list. The runner reads each file, base64-encodes it, and attaches it under files. If payload is a JSON object (or comes from a builder returning a dict), its keys sit at the top level of context["payload"] next to files; otherwise the string is delivered as {message: payload}.

tests/invoice-agent.yaml
tests:
  - name: extracts_invoice_total
    payload: "extract the total amount as JSON"
    files:
      - invoice_acme.pdf
      - invoice_globex.pdf
    expected_result: output.total > 0

  # Files combine with a static payload (the prompt). They can also be
  # used with a builder -- attached files are merged with whatever the
  # builder returns.
  - name: classifies_receipt
    payload: "is this a meal or travel expense?"
    files:
      - receipt.jpg
    expected_result: 'output.category in ("meal", "travel")'

A few constraints worth knowing:

  • Bare filenames only. No path separators, no ... The schema rejects anything that looks like a path.
  • 25 MB upload budget. Code/config (everything outside tests/files/) is still capped at 5 MB; fixtures get the remaining headroom.
  • Mime type is auto-detected from the extension via Python's mimetypes, falling back to application/octet-stream.
  • Missing files fail fast. If a referenced fixture isn't in the upload, the deploy gate aborts before building the test image.

Dynamic Payload Builders

Some tests can't run against a static payload. You need a fresh fixture, a freshly minted database row, or a payload that references an ID that didn't exist a moment ago. Drop a Python module under tests/builders/ exposing build(context, builder_args, test_name, payload, files) (and optionally cleanup(run, context, builder_args)), point the case at it via the builder: field, and the runner will execute the pair once per invocation. build()'s return value (string or dict) becomes the agent input, replacing any static payload.

tests/builders/create_charge_then_refund.py
# tests/builders/create_charge_then_refund.py
import os, requests

def build(context, builder_args, test_name, payload, files):
    """Provision a fixture in your own API, then return the agent payload.

    context       dict       -- mutate to pass state to cleanup()
                                AND to expected_result / expected_tool_calls
    builder_args  dict       -- the yaml `builder_args`
    test_name     str        -- the yaml `name`
    payload       str | None -- the yaml `payload` (if any)
    files         list[str]  -- the yaml `files`
    """
    charge = requests.post(
        f"{os.environ['BILLING_API']}/charges",
        json={"amount_cents": builder_args["amount_cents"], "currency": "eur"},
    ).json()
    # Stash the id so cleanup() can tear the fixture down AND so the
    # yaml expressions can reference it via `context.charge_id`.
    context["charge_id"] = charge["id"]
    return {"charge_id": charge["id"], "instruction": "refund this charge"}


def cleanup(run, context, builder_args):
    """Tear down the fixture. Optionally add Python-level checks.

    Runs after every agent invocation -- pass, fail, or timeout --
    so external resources are always released.

    run["input"]    -- what the agent saw
    run["output"]   -- the agent's parsed output
    run["context"]  -- the run's run_context dict (run_id, agent_name,
                       connector_id, timestamp, plus anything middleware
                       or hooks added during the run)
    context         -- the dict you populated in build()
    builder_args    -- same dict that was passed to build()
    """
    requests.delete(f"{os.environ['BILLING_API']}/charges/{context['charge_id']}")
    # Return False to fail the case (in addition to yaml checks);
    # True/None to pass.
    return run["output"].get("refund_id") is not None
tests/billing-agent.yaml
tests:
  - name: refunds_a_real_charge
    builder: create_charge_then_refund
    builder_args:
      amount_cents: 4200
    # `context` is the same dict build() mutated -- here it pins the
    # tool call to the exact charge_id the builder provisioned, so the
    # test fails if the agent invents an id or refunds the wrong charge.
    expected_result: output.status == "refunded" and output.charge_id == context.charge_id
    expected_tool_calls:
      - billing.refund: params.charge_id == context.charge_id and invocations == 1

Passing state from build to cleanup

Mutate the context dict inside build(). The same dict is delivered to cleanup() as its second argument. The canonical pattern is to stash an id you provisioned in build, then DELETE that resource in cleanup so the test never leaves residue behind.

Referencing context from yaml assertions

The same context dict is also bound as context in expected_result and expected_tool_calls expressions. That closes the loop on dynamic fixtures: the builder mints an id, the agent receives it via the payload, and the assertion can confirm the agent passed that exact id to the tool call instead of inventing one or grabbing the wrong row.

tests/builders/insert_then_query.py
# tests/builders/insert_then_query.py
import os, uuid, requests

def build(context, builder_args, test_name, payload, files):
    test_uuid = str(uuid.uuid4())
    requests.post(
        f"{os.environ['DB_API']}/rows",
        json={"id": test_uuid, "value": "hello"},
    )
    context["test_uuid"] = test_uuid
    return f"Fetch the row with id {test_uuid} and tell me its value."

def cleanup(run, context, builder_args):
    requests.delete(f"{os.environ['DB_API']}/rows/{context['test_uuid']}")
tests/db-agent.yaml
tests:
  - name: fetches_the_row_we_just_inserted
    builder: insert_then_query
    expected_result: output.row.id == context.test_uuid
    expected_tool_calls:
      - db.fetch_row: params.uuid == context.test_uuid and invocations == 1

A few notes:

  • Empty for builder-less cases. Tests with no builder get context = {}, so a missing key fails the predicate rather than raising.
  • Read after build, before cleanup. Assertions evaluate against the dict as it stands when build() returns. cleanup() still sees the same reference and can mutate it for its own bookkeeping, but those changes can't reach the assertions, since cleanup runs afterwards.
  • Same syntax both sides. context.foo.bar[0] works identically in expected_result and inside a params or invocations conjunct.

cleanup() return contract

  • Return True or None to pass. Use this for plain teardown that has no opinion on the agent output.
  • Return False to fail the case. Useful for Python-level assertions that aren't expressible as a yaml expression. The result is AND-ed with the yaml-defined checks (expected_result, expected_tool_calls, expected_no_tool_calls).
  • Always runs. cleanup fires even when the agent timed out or crashed, so teardown is reliable. Exceptions raised by cleanup are caught and reported as a failure with the traceback in failure_reason.

Other notes

  • One re-import per invocation. Builders are reloaded fresh every time, so module-level state (caches, counters) resets between calls. Use this for builders that POST a fixture, since you don't want stale IDs leaking between cases.
  • Sync or async. If build or cleanup returns a coroutine the runner awaits it.
  • Same env as the agent. Builders run inside the test container, with the same environment variables, connectors, and network reachability. Authenticating to your own API works exactly as it does from a tool.
  • Combine with files. If both builder and files are set, attached fixtures are merged into the builder's output. If the builder returns a dict with its own files key, both lists concatenate.
  • Missing builders fail fast. Same pre-build check as files: a typo aborts the deploy gate before the test container even starts.

Running Tests Ad-Hoc

terminal
# Discover tests/*.yaml, build a one-shot test image,
# run every case in your default standard env, and exit
# with status 0 if all cases passed.
connic test

connic test picks the default standard environment's test_environment_id if one is set (see Environments), and falls back to the env itself otherwise. Override or filter as needed:

terminal
# Run only cases whose name contains the substring
connic test --filter adds_two_numbers

# Pick a specific environment to execute against
connic test --env <environment-id>

# Machine-readable output for CI
connic test --json

# Print local per-agent coverage and exit (no backend call)
connic test --coverage

As cases finish, the CLI prints them in a results table and surfaces a clickable dashboard link to the throwaway deployment that backed the run. Drill in there for per-case agent runs, traces, tool calls, and outputs.

Exit code is 0 when every case passed, 1 on failure, 2 on infrastructure error. Drop connic test straight into your CI pipeline.

Coverage Report

connic test --coverage is a static, offline report. It reads agents/ and tests/ from disk and tells you which agents have tests and which tools those tests actually exercise. No backend call, no credentials, no test container. Safe to wire into a pre-commit hook or a doc-style CI job on every PR.

The model is intentionally simple:

  • Every agent counts equally. One of ten agents fully covered is 10% overall, regardless of tool count. This keeps the headline number honest when one agent has 20 tools and another has 2.
  • Per-agent score = covered tools / total tools. A tool counts as covered if it appears at least once in any of that agent's expected_tool_calls entries (bare or mapping form). To hit 100% on an agent, every one of its tools needs to show up in expected_tool_calls in at least one case.
  • No test file → 0%. An agent without a corresponding tests/<agent>.yaml contributes 0 to the average.
  • Tool-less agents → 100% if a test file exists. Sequential agents and orchestrators have nothing to cover at the tool level, so a single test file is enough.
  • A/B variants are skipped. Test variants like support-test-fast share the base agent's tools and are excluded from the count.
  • Discoverable tools count too. Both tools and discoverable_tools are part of the denominator. The auto-injected search_tools / use_tool markers are not.
AgentTypeTools coveredCoverage
stress-testerllm1 / 1100.0%
search-agentllm1 / 333.3%
billing-botllmno tests0.0%
Overall (3 agents)44.4%

Uncovered tools are listed beneath the table. For search-agent above: web.fetch, web.summarize.

Pair it with --json to get a machine-readable report ({overall, agents: [{name, type, has_tests, tools_total, tools_covered, uncovered_tools, percent}]}) you can pipe into a CI gate. For example, fail the build if overall coverage drops below a threshold. Unlike connic test, --coverage always exits 0; it's a report, not a gate.

The Deploy Gate

Every deploy, whether triggered by connic deploy or by a git push to your connected branch, runs as a two-phase pipeline:

  • 1. Build image: package your project into a runner image.
  • 2. Run tests: spin up a one-shot test container from the just-built image, execute every case in tests/, capture results.
  • 3. Deploy to {env name}: only if every case passed. Otherwise the deployment is marked FAILED and nothing ships.

You can watch this happen live on the deployment detail page in the dashboard. Each step shows pending → in progress → done; the test step expands to show the per-case list with per-invocation pills you can click to open the run drawer.

Test environment override

By default tests run in the deploy environment. To isolate them so a release can't hit real billing APIs, point Settings → Git & Environments → Test environment at a sibling environment with stub credentials and stage-only connectors. The deploy gate will use that env's vars and connectors for the test phase, then promote the image into the real environment if everything passes.

Skipping the gate

terminal
# Force a deploy through even if tests fail (or you have none yet).
# Available only on the CLI -- git auto-deploys never set this.
connic deploy --skip-tests

--skip-tests is CLI-only and intended as an escape hatch, for example, getting a hotfix out while a flaky test is being debugged. Git auto-deploys never expose it.

Where Results Show Up

  • Deployments list: every row carries a Tests column with the suite's pass/fail/skipped status. Ad-hoc connic test runs appear with a purple Test run badge so they don't clutter your real release history.
  • Deployment detail → Pipeline: the Build → Tests → Deploy timeline with live per-step status.
  • Tests step (expanded): one row per case, with status, success ratio, threshold, and a clickable list of agent run IDs.
  • Run history: every test invocation lands in the runs table for the env it executed in, tagged with a purple badge so you can filter them out (or drill in).

Best Practices

Patterns that hold up once a suite has more than a couple of cases. How to set up the test env, keep external state from leaking, and write assertions that fail when something regresses.

Set up a dedicated test environment

Create a sibling environment in Settings → Git & Environments (e.g. staging-test) and point your standard env's Test environment dropdown at it. The deploy gate picks it up automatically; ad-hoc connic test resolves it via the standard env's test_environment_id. Each environment is a fresh slate. DB rows, agent sessions, knowledge base content, env vars, and connectors are all keyed by environment, so a test run can't see or mutate production state. See Configure environments.

For one-off CLI runs, pass --env <environment-id> explicitly rather than relying on the default. The fallback is fine for CI but easy to forget when iterating locally. Default new cases to runs: 1 so deploys stay quick, and only raise runs (with a sub-100 success_threshold) on cases where stochastic LLM behavior actually matters.

Real tools, scoped credentials

Connic agents call real tools at test time. Nothing is auto-mocked, by design, so a passing suite is meaningful rather than a stubbed simulation. Point the test env's env vars at sandbox credentials for every external service the agent touches (Stripe test keys, sandboxed email, throwaway S3 bucket), and swap connectors to stage-only instances so production data is never read or written during a run.

For state that needs to exist before the agent runs (a row in your own API, an inbox message, a webhook fixture), use a dynamic payload builder. build() provisions the fixture and returns the agent input; the runner re-imports the module per invocation so fixtures don't bleed across cases. See Builder reference.

Clean up after dynamic builders

Per-env isolation covers Connic-side state (DB, sessions, KB), but anything the builder creates in your own API or a third-party sandbox is your responsibility to delete. Always implement cleanup(run, context, builder_args) alongside build() and DELETE the resource using the id you stashed in context. cleanup() always runs, even on agent timeout or crash, so it's a safe place for teardown. Returning False also fails the case, useful for Python-level checks that don't fit the YAML expression DSL.

Thread fixture state through to assertions

The context dict the builder mutates is also bound as context.<key> inside expected_result and expected_tool_calls. Use it to verify the agent threaded the exact id the builder minted, not just some id of the right shape. Without that check, the agent can hallucinate a real-looking uuid and the case still passes. The canonical pattern is to stash an id in build, reference it in the payload, and pin both sides of the round-trip:

tests/db-agent.yaml
tests:
  - name: fetches_the_row_we_just_inserted
    builder: insert_then_query
    expected_result: output.row.id == context.test_uuid
    expected_tool_calls:
      - db.fetch_row: params.uuid == context.test_uuid and invocations == 1

See Reference context from yaml assertions for the matching builder.

Pin both branches of conditional tool selection

When an agent picks tool A or tool B based on input, write two cases: one asserting expected_tool_calls: [A] with expected_no_tool_calls: [B], and the mirror. Without the negative side, both cases pass as long as some tool got called, so a regression that swaps A and B goes undetected. The same pattern works for approval rules, conditional middleware, or any branch where you need to assert which path ran, not just that the agent called something.

Treat coverage as a separate PR check

connic test --coverage --json is offline, has no backend dependency, and always exits 0, so wiring it into the deploy gate is pointless. Run it as a separate PR-only CI job that parses the overall field and fails if it drops below a threshold (e.g. 60%). That keeps you from merging untested new agents or tools, without slowing every PR down with the full test container path. See Read the coverage report shape.

Tests as part of the deploy

Once tests/ exists in your project, the deploy gate is automatic. No CI configuration, no extra commands, no separate runners. Every push to your connected branch passes through the same suite you ran locally.