Connic

Test agents
like you test code

Declarative YAML test suites under tests/. Same runtime, real environment, real connectors. Run them ad-hoc with connic test, or let the deploy gate run them before every release.

Read the testing docs

tests/invoice-processor.yaml

4 passed1 failed
  • extracts invoice total312ms
  • amount > 04ms
  • currency in [EUR, USD]6ms
  • vat is correctly calculated18ms
  • vendor name extracted11ms
$connic test351ms total
Anatomy of a test case

A YAML file in tests/, run by the same runtime as production

Each case invokes the agent N times with a fixed payload and asserts on the output and the tools it called.

tests/invoice-processor.yaml
version: "1.0"

defaults:
  runs: 5                       # invoke the agent 5 times per case
  success_threshold: 80         # 4/5 must pass for the case to pass
  timeout_s: 60                 # per-invocation wall clock

tests:
  - name: extracts_invoice_total
    payload: '{"message": "extract total", "doc_id": "INV-7821"}'
    expected_result: output.total > 0 and output.currency in ("EUR", "USD")
    expected_tool_calls:
      - invoices.extract: invocations >= 1
    expected_no_tool_calls:
      - notifications.send

tests/invoice-processor.yaml

5 passed0 failed
  • extracts_invoice_total312ms
  • expected_result passed (5/5)4ms
  • invoices.extract called 5/56ms
  • notifications.send not called4ms
  • success threshold 80 met2ms
$connic test328ms total
What you can assert

Expression DSL on output, plus tool-call shape

Assertions use the same safe evaluator as agent tool conditions and approval rules. Python-like syntax against the run's output, error, status, and recorded tool invocations.

expected_result

Python-like expression on bindings output, error, status. Supports attribute and subscript access, comparisons, boolean operators, and membership tests.

expected_tool_calls

Bare tool names (called at least once) or one-key mappings like {tool: invocations >= 5}. Mixed entries allowed.

expected_no_tool_calls

Tool names that must NOT be called during the run. Catches the case where the agent should have skipped a tool but didn't.

runs + success_threshold

Invoke the case N times (1–100); pass if at least the threshold percentage succeed. Lets you keep enforcing assertions on stochastic agents without flake-killing the suite.

timeout_s

Per-invocation wall-clock timeout in seconds (1–3600). A timeout counts as a failed run against the threshold.

Python-level checks via cleanup()

Dynamic payload builders can return False from cleanup() to fail the case. AND-ed with the YAML-defined checks, for assertions you can't express as an expression.

For LLM-graded quality scoring on production runs, configure Judges separately in the dashboard.

Two ways to run tests

Ad-hoc with the CLI, automatic on every deploy

connic test exits 0 when every case passes, 1 on failure, 2 on infrastructure error. Drop it into any CI runner, or skip CI entirely: every connic deploy and git auto-deploy already runs the suite as a deploy gate before the new image ships.

.github/workflows/agent-tests.yml
name: agent tests

on:
  pull_request:
    paths:
      - "agents/**"
      - "tools/**"
      - "tests/**"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install connic
      - run: connic test --env ${{ secrets.CONNIC_CI_ENV_ID }} --json
        env:
          CONNIC_API_KEY: ${{ secrets.CONNIC_API_KEY }}

Why not a spreadsheet of prompts?

What separates Connic testing from the patterns most teams outgrow

Why not a spreadsheet of prompts?
FeatureConnicSpreadsheet evalLangSmith evalDIY pytest
Versioned with the agent projectIncludedNot includedPartialIncluded
Runs in CIIncludedNot includedPartialIncluded
Expression DSL on output / status / errorIncludedNot includedPartialIncluded
Tool-call assertions (positive and negative)IncludedNot includedPartialPartial
Repeats per case + success threshold for stochastic modelsIncludedNot includedPartialPartial
Dynamic Python payload builders with cleanupIncludedNot includedNot includedPartial
Same runtime, real environment, real connectorsIncludedNot includedPartialNot included
Automatic deploy gate (no CI config needed)IncludedNot includedNot includedNot included

Frequently Asked Questions

Each case sets runs and success_threshold. runs invokes the agent N times (1–100) per case; success_threshold is the percent of those runs that must pass for the case to pass overall. So runs: 5, success_threshold: 80 means 4 of 5 invocations must pass.

Drop fixtures into tests/files/ and reference them by bare filename in the case's files: list. The runner reads each file, base64-encodes it, and delivers a multimodal payload of the shape {message, files: [{name, mime_type, data}]}. That's the same wire format webhook multipart uploads produce.

Use a dynamic payload builder. Drop a Python module under tests/builders/ exposing build() and optionally cleanup(), point the case at it via the builder: field, and the runner executes the pair once per invocation. cleanup() always runs, even on timeout or crash, so external fixtures get torn down reliably.

Tests run inside a one-shot test container against a real Connic environment, same code path as production. By default that's the deploy environment. Set the test_environment_id on your standard environment to isolate test traffic to a sibling env with stub credentials and stage-only connectors.

No. Once tests/ exists, the deploy gate is automatic: every connic deploy and every git auto-deploy runs the suite before the new image ships. A failing case aborts the deployment. You can still run connic test locally or in CI for faster feedback before pushing.

connic deploy --skip-tests is a CLI-only escape hatch, useful for getting a hotfix out while a flaky test is being debugged. Git auto-deploys never expose it.