Connic
Back to BlogProduct Spotlight

Agent Observability: Track Costs, Tokens & Runs

Deploying AI agents without visibility is flying blind. Build custom dashboards, track LLM costs per model, and catch failures before users do.

January 23, 2026(last updated: May 20, 2026)8 min read

You deployed your first AI agent. It processed 500 requests yesterday. Great news, right? Except you don't know how many tokens it consumed, what it cost you, or why 12% of those requests failed silently. Welcome to the observability problem.

Traditional APM tools were built for request-response patterns: latency percentiles, error rates, throughput. AI agents don't fit that mold. They make multiple LLM calls per request, token usage varies wildly based on context, and costs can spike 10x when users send longer inputs. You need observability built specifically for agentic workloads.

What Makes Agent Observability Different

When a user sends a message to your agent, a lot happens behind the scenes. The agent might:

  • 1.Query a knowledge base for context (RAG retrieval)
  • 2.Make an initial LLM call to reason about the request
  • 3.Execute 2-3 tool calls (API requests, database queries)
  • 4.Make another LLM call to synthesize results
  • 5.Optionally call another agent for specialized tasks

Each step consumes tokens, and each step can fail. Traditional metrics like "average response time" hide all that. You need granular visibility into each phase.

The Four Pillars of Agent Observability

Run Tracking
Total runs, success rates, failures. The baseline health metrics for your agent fleet.
Token Usage
Input vs output tokens, cached tokens, reasoning tokens. Understand where context is going.
Cost Attribution
Per-model pricing, volume tiers, input vs output costs. Know exactly where money goes.
Real-time Logs
Live run history, duration, status. Debug issues as they happen, not hours later.

Building Your First Dashboard

Connic creates a default dashboard when you start your first project, but the real value comes from customizing it. Here's how to build a dashboard for what you actually look at:

Step 1: Navigate to Observability

In your project sidebar, click Observability. You'll see the default dashboard with pre-configured widgets for total runs, success rate, token usage, and costs.

Step 2: Enter Edit Mode

Click the Edit button in the top right. This unlocks drag-and-drop arrangement and the ability to add, remove, or configure widgets.

Step 3: Add Widgets

Click Add Widget to choose from three widget types:

Stat Cards

Single metric displays. Choose from: Total Runs, Success Rate, Failed Runs, Tool Calls, Total Tokens, Input/Output Tokens, Total Cost, Input/Output Cost, Avg Cost per Run, Avg Tokens per Run.

Area Charts

Time-series visualizations. Track agent runs (completed vs failed), token usage (input vs output over time), or token cost trends.

Logs Lists

Recent activity feeds. Show agent runs or connector runs with status, duration, and direct links to detailed traces.

Step 4: Filter by Agent

Most widgets support filtering by agent. If you run multiple agents for different purposes, create separate widgets per agent, or compare them side-by-side in the same chart.

Understanding Token Economics

Token usage drives your LLM costs, but not all tokens are equal:

Input tokensWhat you send to the model: system prompt, user message, RAG context
Output tokensWhat the model generates: responses, tool calls, reasoning
Cached tokensInput tokens that hit provider caching, often 10x cheaper
Thinking tokensReasoning tokens from models like o1 or Claude with extended thinking

Output tokens typically cost 3–8x more than input tokens across major providers — for example, review OpenAI's current API pricing for up-to-date ratios across their model tiers. If your costs look high, check output usage first. Long, verbose responses are usually the culprit.

Setting Up Model Pricing

Token counts are useful, but dollar amounts are actionable. To convert tokens to costs, configure pricing for the models your agents use.

Global Defaults

Connic includes global pricing for popular models out of the box. These appear with a "global" badge in your pricing settings, so you don't need to configure anything to track costs for common models like GPT, Claude Sonnet, or Gemini.

Custom Model Pricing

Using a fine-tuned model, self-hosting, or just need different pricing than the defaults? Navigate to Settings > Observability and click Add Pricing.

Model Pattern Examples
# Exact model match
openai/gpt-5-mini
anthropic/claude-haiku-4-5
gemini/gemini-2.5-flash

# Regex pattern for model families
openai/gpt-5.*          # Matches all GPT-5 variants
anthropic/claude-.*      # All Claude models
gemini/gemini-.*         # All Gemini models

All pricing is per 1 million tokens. Project-level pricing overrides global defaults, so you can customize costs for specific use cases without affecting other projects.

Volume-Based Pricing Tiers

Some providers offer tiered pricing for high-volume usage. Configure volume tiers to accurately track costs when your token counts exceed certain thresholds:

Base rate (0-200K tokens)$2.50 / 1M
Above 200K tokens$2.00 / 1M
Above 1M tokens$1.50 / 1M

Multi-Dashboard Workflows

One dashboard rarely fits all needs. Create multiple dashboards for different perspectives:

  • -Executive Overview: High-level cost and success metrics for weekly reviews
  • -Debugging Dashboard: Recent runs, failure rates, and logs for on-call engineers
  • -Cost Optimization: Token breakdowns and cost trends for budget planning
  • -Agent Comparison: Side-by-side metrics for A/B testing different agent configurations

Set a default dashboard that loads on entry, and configure a default time range per dashboard. Your executive overview might default to 30 days while the debugging dashboard shows the last hour.

Real-Time Monitoring

Dashboards auto-refresh every 10 seconds, and a "Last updated" indicator shows how fresh the data is. For incident response, you can watch failures happen live without hitting refresh.

Pro Tip: Environment Isolation
Each environment (development, staging, production) has isolated observability data. Use the environment selector to switch contexts. Production dashboards stay clean even while you're running thousands of test runs in development.

Common Patterns and Anti-Patterns

DO: Track cost per agent
Different agents have different cost profiles. Your research agent might use GPT-5 while your simple FAQ bot uses Gemini Flash. Track them separately.
DON'T: Ignore success rate drops
A 95% to 85% success rate drop sounds minor, but it's 3x more failures. Set alerting thresholds based on percentages, not raw counts.
DO: Compare input vs output ratios
Healthy agents typically have 2-5x more input than output tokens (context + RAG retrieval). An inverted ratio often indicates runaway generation or inefficient prompts.
DON'T: Rely on averages alone
Average cost per run hides outliers. One 50K token conversation can skew your daily average. Use time-series charts to spot anomalies.

Getting Started

Observability is available in all Connic projects. To start:

  • 1.Deploy an agent and run a few requests to generate data
  • 2.Navigate to Observability in your project
  • 3.Review the default dashboard, then customize for your needs
  • 4.Configure model pricing in Settings > Observability for accurate cost tracking

Running agents without observability is like driving without a dashboard. You might get where you're going, but you won't know you're out of gas until it's too late.

Check out the quickstart guide to deploy your first agent, or explore the agent documentation to learn about advanced configurations.

Frequently Asked Questions

What is AI agent observability and why is it different from traditional APM?

AI agent observability tracks the full internal execution of an agent run — each LLM call, tool invocation, token count, cost, and failure reason — as a hierarchical trace. Traditional APM treats a request as a single span with latency and status. Agents make multiple LLM calls per request, token costs vary 10x based on input length, and failures often happen silently mid-chain, requiring per-step visibility that general APM tools do not provide.

What metrics should I track for AI agents in production?

The four essential metrics are: run tracking (total runs, success rate, failure rate), token usage (input vs output vs cached tokens per run), cost attribution (per-model, per-agent, per-run dollar amounts), and real-time logs (run duration, status, tool calls). Beyond these basics, per-step traces let you pinpoint exactly where failures and cost spikes occur.

How do I reduce AI agent LLM costs using observability data?

Start by checking the input-to-output token ratio. A healthy agent typically uses 2–5x more input than output tokens. An inverted ratio often signals runaway generation or inefficient prompts. Use time-series charts to spot cost spikes — a single long conversation can skew daily averages. Enable prompt caching for system prompts and recurring RAG context, which can cut cached token costs by roughly 10x on providers that support it.

How do I track AI agent costs per model across multiple providers?

Configure model pricing in your observability settings with exact per-million-token rates for each model your agents use. Most providers price output tokens at 3–4x the input token rate. For volume-based pricing tiers, configure threshold breakpoints so cost tracking remains accurate as usage scales. Project-level pricing overrides let you customize rates per use case.

What is a good AI agent success rate to target?

There is no universal threshold — it depends on use case risk and business impact. As a starting point: a drop from 95% to 85% success rate represents 3x more failures in absolute terms. Set alerting thresholds based on percentage points, not raw failure counts, and track the rate over time rather than treating any single number as a pass/fail target.

More from the Blog

Industry Insights

AI Agent Deployment Platforms in 2026: The Runtime Landscape

A survey of where AI agents actually run in 2026: four platform archetypes, three trade-offs that matter, and the questions to ask before you pick one.

April 19, 202612 min read
Product Spotlight

Agent Approvals: Human-in-the-Loop for Production AI

AI agents that delete data, process refunds, or call external APIs need a safety net. Connic Approvals pause agent execution at critical moments, wait for human review, and resume automatically. You get control without killing autonomy.

April 5, 202610 min read
Changelog

What We Shipped in March 2026

A/B testing, agent guardrails, API spec tools, dashboard templates with percentile metrics, migration CLI, and more.

April 1, 20266 min read
Product Spotlight

A/B Testing for AI Agents: Ship Better Prompts with Confidence

You changed the prompt. It feels better. But is it actually better? Learn how to run controlled experiments on your AI agents and let real traffic decide.

March 27, 20269 min read
Product Spotlight

Secure AI Agents: A Production Safety Checklist

Shipping AI agents without a security strategy is a liability. A practical checklist covering prompt injection, PII handling, output validation, and the guardrails you need before go-live.

March 21, 202612 min read
Product Spotlight

Connic Bridge: AI Agents for Private Infrastructure

Connic Bridge creates a secure outbound tunnel so your AI agents can reach private Kafka, databases, and internal services without opening inbound ports.

February 19, 20267 min read