Agent Observability: Track Costs, Tokens & Runs

You deployed your first AI agent. It processed 500 requests yesterday. Great news, right? Except you have no idea how many tokens it consumed, what it cost you, or why 12% of those requests failed silently. Welcome to the observability problem.

Traditional APM tools were built for request-response patterns: latency percentiles, error rates, throughput. AI agents are different. They make multiple LLM calls per request. Token usage varies wildly based on context. Costs can spike 10x when users send longer inputs. You need observability built specifically for agentic workloads.

What Makes Agent Observability Different

When a user sends a message to your agent, a lot happens behind the scenes. The agent might:

1.Query a knowledge base for context (RAG retrieval)
2.Make an initial LLM call to reason about the request
3.Execute 2-3 tool calls (API requests, database queries)
4.Make another LLM call to synthesize results
5.Optionally call another agent for specialized tasks

Each step consumes tokens. Each step can fail. Traditional metrics like "average response time" hide all this complexity. You need granular visibility into each phase.

The Four Pillars of Agent Observability

Run Tracking

Total runs, success rates, failures. The baseline health metrics for your agent fleet.

Token Usage

Input vs output tokens, cached tokens, reasoning tokens. Understand where context is going.

Cost Attribution

Per-model pricing, volume tiers, input vs output costs. Know exactly where money goes.

Real-time Logs

Live run history, duration, status. Debug issues as they happen, not hours later.

Building Your First Dashboard

Connic creates a default dashboard when you start your first project. But the real power comes from customization. Here is how to build a dashboard tailored to your needs:

Step 1: Navigate to Observability

In your project sidebar, click Observability. You will see the default dashboard with pre-configured widgets showing total runs, success rate, token usage, and costs.

Step 2: Enter Edit Mode

Click the Edit button in the top right. This unlocks drag-and-drop arrangement and the ability to add, remove, or configure widgets.

Step 3: Add Widgets

Click Add Widget to choose from three widget types:

Stat Cards

Single metric displays. Choose from: Total Runs, Success Rate, Failed Runs, Tool Calls, Total Tokens, Input/Output Tokens, Total Cost, Input/Output Cost, Avg Cost per Run, Avg Tokens per Run.

Area Charts

Time-series visualizations. Track agent runs (completed vs failed), token usage (input vs output over time), or token cost trends.

Logs Lists

Recent activity feeds. Show agent runs or connector runs with status, duration, and direct links to detailed traces.

Step 4: Filter by Agent

Most widgets support filtering by specific agents. Running multiple agents for different purposes? Create separate widgets to track each one, or compare them side-by-side in the same chart.

Understanding Token Economics

Token usage drives your LLM costs. But not all tokens are equal:

Input tokensWhat you send to the model: system prompt, user message, RAG context

Output tokensWhat the model generates: responses, tool calls, reasoning

Cached tokensInput tokens that hit provider caching, often 10x cheaper

Thinking tokensReasoning tokens from models like o1 or Claude with extended thinking

Output tokens typically cost 3-4x more than input tokens. If your costs seem high, check your output token usage first. Long, verbose responses are often the culprit.

Setting Up Model Pricing

Token counts are useful. Dollar amounts are actionable. To convert tokens to costs, you need to configure pricing for the models your agents use.

Global Defaults

Connic includes global pricing for popular models out of the box. These appear with a "global" badge in your pricing settings. You do not need to configure anything to start tracking costs for common models like GPT-4o, Claude Sonnet, or Gemini 2.5.

Custom Model Pricing

Using a fine-tuned model? Self-hosting? Or just need different pricing than the defaults? Navigate to Settings > Observability and click Add Pricing.

Model Pattern Examples

# Exact model match
openai/gpt-4o
anthropic/claude-sonnet-4-5-20250514
gemini/gemini-2.5-pro

# Regex pattern for model families
openai/gpt-4o.*          # Matches all GPT-4o variants
anthropic/claude-.*      # All Claude models
gemini/gemini-2.*        # Gemini 2.x family

All pricing is per 1 million tokens. Project-level pricing overrides global defaults, so you can customize costs for specific use cases without affecting other projects.

Volume-Based Pricing Tiers

Some providers offer tiered pricing for high-volume usage. Configure volume tiers to accurately track costs when your token counts exceed certain thresholds:

Base rate (0-200K tokens)$2.50 / 1M

Above 200K tokens$2.00 / 1M

Above 1M tokens$1.50 / 1M

Multi-Dashboard Workflows

One dashboard rarely fits all needs. Create multiple dashboards for different perspectives:

-Executive Overview: High-level cost and success metrics for weekly reviews
-Debugging Dashboard: Recent runs, failure rates, and logs for on-call engineers
-Cost Optimization: Token breakdowns and cost trends for budget planning
-Agent Comparison: Side-by-side metrics for A/B testing different agent configurations

Set a default dashboard that loads automatically. Configure default time ranges per dashboard: your executive overview might default to 30 days while your debugging dashboard shows the last hour.

Real-Time Monitoring

Dashboards auto-refresh every 10 seconds. The "Last updated" indicator shows you exactly how fresh the data is. For incident response, this means you can watch failures happen live without manual refreshes.

Pro Tip: Environment Isolation

Each environment (development, staging, production) has isolated observability data. Use the environment selector to switch contexts. Production dashboards stay clean even when you are running thousands of test runs in development.

Common Patterns and Anti-Patterns

DO: Track cost per agent

Different agents have different cost profiles. Your research agent might use GPT-4o while your simple FAQ bot uses Flash. Track them separately.

DON'T: Ignore success rate drops

A 95% to 85% success rate drop might seem minor. But it means 3x more failures. Set up alerting thresholds based on percentages, not raw counts.

DO: Compare input vs output ratios

Healthy agents typically have 2-5x more input than output tokens (context + RAG retrieval). An inverted ratio often indicates runaway generation or inefficient prompts.

DON'T: Rely on averages alone

Average cost per run hides outliers. One 50K token conversation can skew your daily average. Use time-series charts to spot anomalies.

Getting Started

Observability is available in all Connic projects. Here is how to start:

1.Deploy an agent and run a few requests to generate data
2.Navigate to Observability in your project
3.Review the default dashboard, then customize for your needs
4.Configure model pricing in Settings > Observability for accurate cost tracking

Running agents without observability is like driving without a dashboard. You might get where you are going, but you will not know if you are running out of gas until it is too late.

Check out the quickstart guide to deploy your first agent, or explore the agent documentation to learn about advanced configurations.