Observability for Multi-Agent Systems

Your multi-agent system posted a wrong review comment. 6 agents. 73 LLM calls. Which one broke it?

The System Nobody Can Debug

Here’s an architecture that’s increasingly common: a PR gets submitted, and instead of one model reviewing it, a team of specialized agents kicks in.

Six agents. Fifty to eighty LLM calls per PR. Decisions that block or approve real merges into production.

Now something goes wrong. The system approves a PR that introduces a race condition. Or flags a false positive that wastes an engineer’s afternoon. You open your monitoring and see… total token count and average latency. Completely useless for answering the question that actually matters: which agent broke the decision, and why?

This is the observability gap in multi-agent systems. We wouldn’t run six microservices without distributed tracing. But we’re running six AI agents with zero visibility into how decisions flow between them.

The failure mode is different too. Microservices fail loudly — 500 errors, timeouts, stack traces. Agents fail quietly. They return confident, well-formatted wrong answers. By the time you notice, the damage is downstream and the causal chain is buried across dozens of LLM calls.

Here are the questions you can’t answer without purpose-built observability:

Why did the Security Agent flag a false positive on this PR but not on an identical pattern last week?
Why did the Synthesizer drop a valid finding from the Performance Agent?
Why did this PR take 4 minutes when similar PRs take 40 seconds?
Why did the system cost $2.30 for this review but $0.15 for a comparable one?

In traditional systems, you trace a request. In multi-agent systems, you trace a decision. That distinction changes everything about what you instrument and what you dashboard.

The Three Pillars, Reframed

The classic observability pillars — logs, metrics, traces — still apply. But their meaning shifts when the system is making judgments instead of processing data.

Traditional	Multi-Agent Equivalent
Logs	Agent transcripts: prompts sent, completions received, tool calls made
Metrics	Token usage, latency per agent, confidence scores, hallucination rate, drop rate
Traces	Decision lineage: which agent claimed what → what downstream agent consumed it → what final output resulted

Let’s make this concrete. Here’s what a trace looks like for a real PR review:

PR review decision trace showing agent outputs and the dropped performance finding

This trace tells a story you can act on. The Performance Agent found something real — a cache invalidation race condition. It cited evidence from two tool calls. But the Synthesizer dropped it. Without this trace, the PR merges with a race condition nobody catches until it hits production.

Notice what’s visible here that traditional metrics would miss:

The Performance Agent was slow because one of its tools timed out, not because the model was slow
The Synthesizer made a judgment call to drop a finding — that’s not an error, it’s a decision with consequences
The Orchestrator chose to skip the Style Agent, which is fine here but might not be in other cases

Each of these is a different class of problem requiring different instrumentation.

What to Instrument

Instrumentation for multi-agent systems works in three layers. Each layer answers progressively harder questions.

Layer 1: Per-Call Telemetry

This is table stakes. Every LLM call and every tool call gets basic instrumentation:

For LLM calls:

Model name and version
Input tokens and output tokens
Latency (time to first token, time to completion)
Temperature and other generation parameters
Stop reason (max tokens, stop sequence, tool call)
Cost

For tool calls:

Tool name and arguments passed
Result size (bytes/tokens)
Latency
Success or failure
Error message if failed

This layer answers: “What happened?” It’s necessary but not sufficient. Knowing that simulate_load timed out after 2.1 seconds is useful. But it doesn’t tell you whether that timeout affected the final decision.

Layer 2: Per-Agent Telemetry

This is where multi-agent observability diverges from single-agent observability. You’re not just tracking individual calls — you’re tracking agent-level behavior and output quality.

Key metrics:

Rounds/retries per agent: Is an agent looping? Retrying failed tool calls? This is your early warning for runaway cost.
Confidence scores on outputs: If your agents produce confidence estimates, track their distribution. A sudden shift means something changed upstream.
Findings produced vs. findings consumed downstream: This is the critical one.
Drop rate: What percentage of an agent’s findings get discarded by downstream agents?

agent.performance.findings_produced: 2
agent.synthesizer.findings_consumed_from_performance: 1
agent.performance.drop_rate: 50%   ← investigate if >30%

A consistently high drop rate means one of two things: the agent is producing low-quality findings (calibration problem), or the downstream consumer has a threshold that’s too aggressive (synthesis problem). Either way, you need to know.

Layer 3: Per-Investigation Telemetry

This is the hardest layer because it requires ground truth.

End-to-end latency and cost: Straightforward to measure.
Decision accuracy: Did the system’s final call turn out to be correct? This requires a feedback signal — a human override, a production incident, a follow-up test.
Agent agreement rate: How often do multiple agents contradict each other on the same PR? Low agreement might be healthy (different perspectives) or pathological (inconsistent reasoning).
Routing accuracy: Did the Orchestrator pick the right set of agents? You can measure this retroactively: re-run skipped agents on a sample of PRs and check whether they would have produced relevant findings.

Layer 3 metrics are lagging indicators. You won’t have them in real-time. But they’re what you use to detect whether decision quality is degrading over weeks and months.

Tracing Decisions, Not Requests

Here’s where multi-agent observability fundamentally differs from traditional distributed tracing.

In a microservice architecture, a trace follows a request through a linear (or branching) path of services. Each service transforms the data and passes it along. The trace shows you what happened.

In a multi-agent system, the trace needs to show you what was believed and why. Data doesn’t just flow forward — it gets judged at each hop. An agent doesn’t just process input and produce output. It makes claims, assigns confidence, and sometimes rejects upstream claims entirely.

A trace schema that captures this looks different from OpenTelemetry spans:

The key fields that traditional tracing doesn’t give you:

consumed_by_synthesizer — Did this finding make it to the final output? If not, why?
drop_reason — The Synthesizer’s rationale for ignoring a finding. This is debuggable.
skip_reason — Why an agent wasn’t invoked. Routing errors hide here.
confidence — The agent’s self-assessed certainty. Track this against ground truth over time to measure calibration.

You need to be able to query this structure. “Show me all findings that were produced but dropped in the last week where the source PR later had a production incident.” That query tells you your system’s miss rate — and exactly where in the pipeline the miss happened.

Dashboards That Actually Help

The default dashboard for most AI systems shows average latency and total token count per day. This tells you almost nothing about decision quality. It’s the equivalent of monitoring a recommendation engine by tracking HTTP response time instead of click-through rate.

Here’s what to build instead:

The Decision Funnel

Decision funnel dashboard showing produced, reviewed, and output findings

Where’s the biggest drop? If agents produce findings that never reach the final output, you have either a quality problem (agents producing noise) or a synthesis problem (downstream agent too aggressive in filtering). The funnel tells you which layer to investigate.

The query that produces this funnel from the trace schema we defined earlier:

Decision funnel SQL results by source agent

Cost Anomaly Detection

Most PRs cost $0.10–$0.30 to review. When one costs $2.30, something went wrong. The dashboard should surface outliers and let you drill into which agent consumed the excess budget.

Cost distribution dashboard showing review cost outliers

Drill-down on the $2.30 outlier:

Symptom	Likely Cause	Trace Field to Query
One agent 10x cost	Tool retry loop	`agent.tool_calls[].retry_count`
All agents 2x cost	Large diff bloating all prompts	`investigation.diff_size_tokens`
Normal agents + expensive orchestrator	Orchestrator invoking all agents unnecessarily	`agents_skipped = []`

The Feedback Loop

Observability that doesn’t feed back into the system is a spectator sport. The question isn’t “what do I do with this data?” — that’s system improvement, a different discipline. The question is: what signals do I capture, and how do I query them later?

Production Incidents Linked to Traces

A bug hits production. You trace it back to a specific PR. The diagnostic query:

Two outcomes, two completely different failure classes:

Case A: Finding Produced But Dropped

The trace query returns a matching finding:

finding_id	source_agent	confidence	consumed	drop_reason
f-001	performance	0.71	false	below threshold 0.75

The system saw the problem, but the finding never reached the final review. The observability question is: how often do findings dropped for this reason later correlate with production incidents?

Case B: Finding Never Produced

The trace query returns no matching findings.

The system was blind to the problem. The observability questions are:

What tools did the relevant agent run?
Was the right agent invoked at all?
Did the finding disappear because evidence was missing, routing was wrong, or the agent failed to inspect the right source?

Observability as a Prerequisite, Not a Feature

There’s a tendency to treat observability as something you add after the system works. With multi-agent systems, that’s backwards.

Here’s the same failure scenario, with and without observability:

WITHOUT OBSERVABILITY:

  Day 0:  PR #4821 merges. Review says "LGTM."
  Day 3:  Race condition hits production. 
  Day 3:  Engineer investigates prod bug. Finds the cache code.
  Day 4:  Engineer asks: "Did the AI reviewer catch this?"
  Day 4:  Nobody knows. No record of what agents found vs. what shipped.
  Day 5:  Team loses trust in AI review. Starts ignoring it.

  Time to diagnosis: 5 days. Root cause of system failure: unknown.

─────────────────────────────────────────────────────────

WITH OBSERVABILITY:

  Day 0:  PR #4821 merges. Review says "LGTM."
  Day 3:  Race condition hits production.
  Day 3:  Engineer queries trace store:
          SELECT * FROM findings WHERE investigation_ref = 'PR-4821'
  Day 3:  Finds: Performance Agent flagged it (f-001, confidence 0.71).
          Synthesizer dropped it (threshold 0.75).
  Day 3:  Diagnosis: dropped-finding-later-incident correlation is visible in the trace dashboard.

  Time to diagnosis: 4 hours. Root cause: Synthesizer threshold too aggressive.

The core insight: microservices fail loudly; agents fail silently. A 500 error wakes someone up at 2 AM. A confidently wrong review comment ships to production and nobody notices until the bug report arrives three days later.

That makes observability not just harder to build, but more important to have. If you’re running multiple LLM calls in sequence — even just two — you already have a distributed decision system. The only question is whether you’re observing it on purpose or discovering its failures by accident.

Build the traces first. Then build the system.