The data quality illusion

Why "good enough" observability breaks AI debugging

Mar 25, 2026

👋 Hi, I’m Thomas. Welcome to a new edition of Beyond Runtime, where I dive into the messy, fascinating world of distributed systems, debugging, AI, and system design. All through the lens of a CTO with 20+ years in the backend trenches.

QUOTE OF THE WEEK:

“Remember kids, the only difference between screwing around and science is writing it down.” — Adam Savage, Mythbusters host

Everyone in the room nods when you say “garbage in, garbage out.” It’s the first thing developers say when you ask them about AI and data quality.

It’s also, apparently, the last thing they think about when they actually wire up their observability stack to a debugging agent.

There’s a reason for this gap between knowing and doing: the failure mode is invisible. Let’s me explain.

When you send traditional observability data to an AI debugging agent, it doesn’t return an error. It doesn’t say “I don’t have enough context.” It reasons over what it has, fills the gaps with plausible inference, and hands you back a confident, well-structured, completely believable root cause analysis. The code change looks reasonable. The PR description is coherent. Nothing obviously smells wrong.

So you ship it.

And then you spend the next two hours figuring out why production broke.

The fundamental mismatch

Here’s the uncomfortable truth: observability data was never designed for AI agents. It was designed for humans staring at dashboards.

When a senior SRE is the end consumer, it makes sense to have:

Sampling. It gives you information about system behavior while respecting cost and storage constraints. For a human trying to spot patterns across millions of events, that economic compromise is reasonable. For an agent trying to reconstruct a specific failure sequence across service boundaries, it’s not helpful. The agent can’t reason over gaps and, crucially, it doesn’t know the gaps are there.
Aggregation. For a human, a p99 latency metric on a dashboard is more actionable than a million individual data points. But an agent debugging a specific failure needs the individual events, in sequence, with their full payloads. Averages obscure (or erase completely) the outlier that caused the bug.
Siloed collection. When different teams own different layers you often have frontend errors in one tool and backend traces in another. To be fair, this was never actually a good solution but developers can manually correlate information and fill in the gaps with tribal and tacit knowledge. But an agent debugging a full-stack issue needs the correlation pre-built: it can’t intuitively know that the user click at 14:32:01 caused the cascade that showed up in the backend logs at 14:32:04.
Metadata inconsistency is another gap that humans paper over without realizing it. Developers learn through osmosis that “payment-service,” “svc-payments,” and “payments_v2” are the same thing: it comes up in a Slack thread, gets mentioned in onboarding, lives in someone’s head. An agent has none of that context. It treats them as three distinct entities.

The mistake is assuming that an AI agent, with completely different constraints and no human intuition to fill the gaps, can just slot in as the new consumer of observability data, without anything changing underneath.

In short, things like sampling and aggregation were reasonable solutions to human constraints: cost, cognitive load, screen real estate, etc.

Agents have different constraints entirely: they need completeness over a bounded window rather than coverage over a broad one.

The witness vs. the summary

Imagine you’re a detective investigating an incident. You have two options.

Option One: a single witness

You interview a witness who was present for the entire event, start to finish, uninterrupted. They describe exactly what they saw, in sequence, with specific details: what was said, what was picked up, what happened next. The account is complete. Nothing is paraphrased. Nothing is omitted because it seemed unimportant at the time.

(I know human memory is fallible, but stay with me for a second with this metaphor).

Option Two: a summary from a committee

You receive a summary prepared by a committee. Each department contributed the parts they considered relevant. Legal redacted some details. Finance summarized the numbers into totals. The timestamps are approximate. Some sections reference events that happened in a different document you don’t have access to. The summary is coherent, professional, and reads like it covers everything.

A human detective will look at Option Two and immediately start pulling threads. Why are these timestamps approximate? What’s in the other document? Why did Legal redact this section?

An AI agent has no such instinct. It reads Option Two and reasons over it as if it were complete. It produces a confident conclusion. The conclusion is logically consistent with the summary it was given, but it’s missing crucial facts.

Traditional observability data is Option Two. It was assembled by a committee of tools, each capturing what it was built to capture, sampling what it couldn’t afford to store, and aggregating what it thought you’d want to see. It reads like a complete picture, but it’s not.

What AI agents actually need is the witness. Unsampled. Unredacted. Present for the whole thing. With full request and response payloads, correlated across every service boundary, from the first user click to the final error state.

Conclusion

Traditional observability tools were built to answer a fundamentally different question than debugging agents. Dashboards answer: is the system healthy? AI debugging agents need to answer: exactly what happened, and exactly where in the code?

Those questions require different data. The sooner that distinction becomes a first-class engineering concern, the sooner AI debugging agents will actually deliver on what everyone hopes they can do.

💜 This newsletter is sponsored by Multiplayer.app, the debugging agent for developers.

Try it for free for 1 month

📚 Interesting Articles & Resources

AI Tooling for Software Engineers in 2026 - Gergely Orosz and Elin Nilsson

Based on a survey of 900+ respondents (median 11–15 years experience), the article maps the current state of AI tooling across the engineering profession. As an aside, the fact that the most experienced engineers are both the heaviest agent users and the most enthusiastic about AI isn't coincidental in my opinion: they're the ones with enough system-level context to actually direct agents well.

How to Kill the Code Review - Ankit Jain

This is a great articulation of the very real tension between PRs created and PR review times. However, the conclusion is to "ship fast, observe everything, revert faster"… there's almost no treatment of what happens after code ships. Who debugs it? How do you observe and understand failures when the code was written by an agent you didn't fully read?

Beyond Runtime

Discussion about this post

Ready for more?