Your agent works in development. It passes your test suite. You deploy it to production. Three days later, a customer reports that the agent recommended a product that was discontinued two years ago. Your logs show 200 OK on every API call. Nothing failed. The agent just quietly produced wrong answers while every traditional monitoring metric stayed green.
This is the fundamental monitoring problem with AI agents: traditional infrastructure monitoring (uptime, latency, error rates) catches less than half of production failures. The rest are silent quality degradations that no dashboard will flag unless you build agent-specific observability.
This guide covers what to monitor, which tools do it well, and how to build a monitoring stack that catches quality failures, not just infrastructure failures.
Interactive Concept Map
Click any node to expand or collapse. Use the controls to zoom, fit to view, or go fullscreen.
Why traditional monitoring is not enough
Traditional application monitoring answers one question: “Is the system up and responding?” Agent monitoring needs to answer a different question: “Is the system producing correct, useful results?”
A web application that returns a 200 response with valid JSON is working correctly. An agent that returns a 200 response with valid JSON might be hallucinating, ignoring tool results, repeating itself in a loop, or producing answers based on outdated information. None of these failures trigger alerts in standard monitoring systems.
The gap between infrastructure health and output quality is where agent-specific monitoring lives.
The six metrics that matter
After monitoring production agent systems, these six metrics consistently provide the most signal about system health. Track these before adding anything else.
1. Task success rate
The percentage of tasks where the agent produces a correct, complete result. This requires defining what “correct” means for your use case and implementing automated quality checks.
How to measure: Run model-based graders on a sample of production outputs. Score each output against your quality rubric. Track the pass rate over time.
Alert threshold: Any sustained drop of 5% or more warrants investigation. A sudden drop of 10% warrants immediate response.
2. Step count per task
The number of model calls, tool invocations, and reasoning steps the agent takes to complete each task. This metric catches stuck loops, inefficient reasoning paths, and regression in agent capability.
How to measure: Instrument each step in the agent loop with a counter. Record the final count as structured metadata on each task.
What normal looks like: For most agent tasks, step count follows a tight distribution. A research task that normally takes 5-8 steps suddenly taking 15 is a strong signal that something changed.
Alert threshold: Step count exceeding 2x the rolling 7-day median for that task type.
3. Token consumption per task
Total input and output tokens consumed for each task. This metric catches prompt bloat, context window inefficiency, and cost anomalies.
How to measure: Sum input_tokens and output_tokens from every model call within a task. Track per-task totals and daily aggregates.
Alert threshold: Per-task token consumption exceeding 1.5x the 7-day median. Daily aggregate exceeding the budget threshold at 80%.
4. Tool call failure rate
The percentage of tool calls that fail, time out, or return invalid responses. Broken down by tool.
How to measure: Log the outcome of every tool call: success, failure (with error type), timeout, or validation failure. Calculate per-tool failure rates over rolling windows.
Alert threshold: Any tool exceeding 5% failure rate over a 1-hour window. A tool that normally has 0.1% failures suddenly hitting 3% indicates an upstream API issue.
5. Verification pass rate
The percentage of agent outputs that pass your automated verification checks. This is distinct from task success rate because verification runs at the step level, not the task level.
How to measure: Implement schema validation, reasonableness checks, and output quality checks at each verification point. Log pass/fail for each check.
Alert threshold: Verification pass rate dropping below 90% over a 1-hour window. Investigate immediately if it drops below 80%.
6. Latency percentiles
End-to-end task completion time and per-step latency, measured at p50, p95, and p99. Latency spikes indicate model API degradation, tool slowdowns, or agent logic changes.
How to measure: Timestamp the start and end of each task and each step. Calculate percentile distributions.
Alert threshold: p95 latency exceeding 2x the 7-day baseline. p99 exceeding 3x warrants investigation into specific slow paths.
Agent monitoring tools compared
The agent observability space has matured significantly in 2026. Here’s how the major tools compare for agent-specific monitoring.
LangSmith
LangSmith is the most mature agent monitoring platform, built by the LangChain team and tightly integrated with the LangChain and LangGraph ecosystem.
Strengths: Detailed trace visualization showing every model call, tool invocation, and intermediate state. Annotation queues for human review of flagged outputs. Dataset management for regression testing. The trace waterfall view is the best in class for debugging multi-step agent flows.
Limitations: Tight coupling with LangChain ecosystem. Pricing scales with trace volume, which can become expensive at high throughput. Self-hosting is not available; you must send data to LangSmith’s cloud.
Best for: Teams using LangChain or LangGraph who want comprehensive tracing without building custom infrastructure.
Langfuse
Langfuse is an open-source LLM observability platform that works with any framework, not just LangChain.
Strengths: Framework-agnostic integration. Self-hosting option for data-sensitive environments. Cost tracking built in. The open-source model means you can inspect and extend the codebase.
Limitations: Smaller community than LangSmith. Fewer built-in analysis tools. The self-hosted option requires infrastructure management.
Best for: Teams that need framework flexibility, want self-hosting capability, or are budget-conscious.
Helicone
Helicone provides a proxy-based approach: route your LLM API calls through Helicone and it captures traces, costs, and performance metrics automatically.
Strengths: Zero-code integration (just change the API base URL). Cost tracking across multiple model providers. Rate limiting and caching built in.
Limitations: Proxy architecture adds latency (typically 10-30ms). Limited agent-specific features; it’s primarily an LLM call monitor rather than an agent workflow monitor.
Best for: Teams that want quick, low-effort monitoring without code changes. Good as a first monitoring layer before investing in more comprehensive tools.
Arize Phoenix
Arize Phoenix is an open-source observability tool focused on evaluation and experimentation alongside monitoring.
Strengths: Strong evaluation capabilities. Embedding visualization for understanding retrieval quality. Open-source with an active community. Good integration with the evaluation workflow (monitor, identify issues, create test cases, evaluate fixes).
Limitations: More focused on evaluation than real-time alerting. Requires more setup for production alerting workflows.
Best for: Teams that want monitoring and evaluation in the same platform, especially those building RAG-based agents.
Comparison summary
| Tool | Framework Lock-in | Self-Hosting | Cost Tracking | Real-Time Alerts | Best For |
|---|---|---|---|---|---|
| LangSmith | LangChain | No | Yes | Yes | LangChain teams |
| Langfuse | None | Yes | Yes | Limited | Framework flexibility |
| Helicone | None | No | Yes | Yes | Quick setup |
| Arize Phoenix | None | Yes | Limited | Limited | Eval-focused teams |
Building your monitoring stack
No single tool covers everything. Here’s a practical architecture for production agent monitoring.
Layer 1: Infrastructure monitoring. Use your existing infrastructure monitoring (Datadog, Grafana, Prometheus) for uptime, API latency, error rates, and resource utilization. This is table stakes, not agent-specific.
Layer 2: LLM call monitoring. Use LangSmith, Langfuse, or Helicone to capture every model call with inputs, outputs, token counts, and latency. This gives you the trace data you need for debugging.
Layer 3: Agent workflow monitoring. Build custom instrumentation that tracks the six metrics above at the task level. Log step counts, tool call outcomes, verification results, and task completion status. This is the layer most teams skip and most failures hide in.
Layer 4: Quality monitoring. Run model-based graders on a sample of production outputs (5-10% is usually sufficient). Track quality scores over time. Alert on quality regressions. This catches the “quietly producing wrong answers” failure mode that nothing else detects.
Five monitoring best practices
1. Monitor quality, not just availability
An agent that’s up and responding but producing wrong answers is worse than an agent that’s down, because users trust the wrong answers. Implement quality monitoring from day one, not after the first production incident.
2. Set budgets and enforce them
Step budgets, token budgets, and time budgets should be hard limits, not suggestions. A stuck agent loop at 3 AM will consume your entire monthly API budget before anyone notices. Enforce limits in code, not just in dashboards.
3. Log structured data, not strings
Every agent step should produce a structured JSON log entry with: step number, step type, token counts, tool call details, verification results, and timing data. String logs (“Agent called search tool”) are useless for automated analysis. Structured logs enable alerting, dashboards, and trend analysis.
4. Compare against baselines, not fixed thresholds
Agent behavior changes as models update, tools evolve, and prompts are refined. Fixed alert thresholds (latency > 5 seconds) quickly become stale. Compare against rolling baselines (latency > 2x the 7-day median) and your alerts stay relevant as the system evolves.
5. Build runbooks before you need them
Document the response procedure for each alert type before your first production incident. When task success rate drops below 85%, what’s the investigation procedure? When token consumption spikes, what’s the first diagnostic step? Runbooks written during an incident are worse than runbooks written before one.
Common monitoring mistakes
Monitoring only the happy path. If you only track successful task completions, you miss the failures entirely. Track failure modes explicitly: timeouts, verification failures, budget exhaustion, and stuck loops.
Averaging away problems. Average latency of 2 seconds sounds fine. But if p99 is 45 seconds, 1% of your users are having a terrible experience. Always use percentile metrics, not averages.
Alerting on symptoms, not causes. “High latency” is a symptom. The cause might be a model API degradation, a tool timeout, or an agent making too many steps. Structure your alerts to help you find the cause, not just detect the symptom.
Ignoring cost as a monitoring metric. Cost anomalies are often the first signal of a problem. A sudden spike in token consumption usually means the agent is stuck in a loop, the context window is bloating, or a prompt change increased verbosity. Track cost in real time, not just in monthly invoices.
Frequently asked questions
How much does agent monitoring cost?
LangSmith pricing starts at roughly $1 per 1,000 traces for production use. Langfuse is free to self-host (infrastructure costs only). Helicone offers a free tier for low volumes. Budget 5-15% of your agent API costs for monitoring infrastructure, which typically pays for itself by catching cost anomalies and quality issues early.
Can I use my existing APM tool for agent monitoring?
Your existing APM (Datadog, New Relic, Dynatrace) handles infrastructure monitoring well but doesn’t understand agent-specific concerns. Use it as Layer 1, then add agent-specific tooling on top. Some APM vendors are adding LLM monitoring features, but they’re typically less mature than dedicated tools.
When should I add monitoring to my agent system?
Before you deploy to production. Not after. The monitoring infrastructure should be part of your initial deployment, not something you bolt on after the first incident. Start with the six core metrics and a basic alerting setup. Expand as you learn what your system needs.
How do I monitor agent quality at scale?
Sample-based evaluation. Run model-based graders on 5-10% of production outputs, selected randomly. Track pass rates over time and set alerts for regressions. Full evaluation of every output is rarely cost-effective; sampling provides sufficient signal at manageable cost.
For more on building the infrastructure around your monitoring stack, read our guide to what harness engineering is and how monitoring fits into the broader harness architecture.
Subscribe to the newsletter for weekly updates on monitoring tools, production patterns, and framework evaluations.