Where AI coding tools systematically fail
They can dramatically accelerate routine development, but they’re not silver bullets
👋 Hi, I’m Thomas. Welcome to a new edition of Beyond Runtime, where I dive into the messy, fascinating world of distributed systems, debugging, AI, and system design. All through the lens of a CTO with 20+ years in the backend trenches.
QUOTE OF THE WEEK:
“Of all my programming bugs, 80 percent are syntax errors. Of the remaining 20 percent, 80 percent are trivial logical errors. Of the remaining 4 percent, 80 percent are pointer errors. And the remaining 0.8 percent are hard.” - Marc Donner
In my previous blog post I reviewed the four AI tools my team and I investigated while developing our own AI product roadmap for Multiplayer.
My main take away is that: these assistants can dramatically accelerate routine development, but they’re not silver bullets. Their weaknesses become most apparent when projects transition from static code to real-time runtime behavior and architectural decision-making.
Here’s a deep dive into their current limitations:
1. Limited runtime visibility
The problem: AI assistants reason primarily from static code. They lack native runtime observability.
Some can integrate with CLI tools (gcloud, aws) to fetch logs or system state, but these are workarounds rather than built-in runtime awareness. When a bug emerges only under specific input conditions or concurrency states, the AI often proposes generic fixes rather than identifying root causes.
Runtime context includes:
Frontend behavior: User actions, UI state, browser console logs
Network layer: Request/response pairs, headers, timing
Backend execution: Distributed traces, service call graphs, timing
Data layer: Database queries, cache hits/misses, transaction logs
Infrastructure: CPU/memory usage, container restarts, queue depths
Without runtime context:
AI sees: Function X in service Y
AI suggests: "Add null checks" or "Increase timeout values"
Result: Generic advice that might not address the root causeWith runtime context:
AI sees:
- Frontend: User clicked "Submit Payment" at 14:23:45
- Backend: Service A called Service B at 14:23:46
- Service B: Connection pool exhausted, request queued
- Service C: Timeout triggered after 5s, returned 503
- Database: No transaction committed
- User experience: Payment appeared to succeed but order wasn't created
AI suggests: "Service B's connection pool (size: 10) is undersized for
current load (30 req/s). Either increase pool size or implement request
throttling in Service A. The 5s timeout in Service C is too short given
the queuing delay—consider circuit breaker pattern instead."Reality check: AI tools can’t replace APM, distributed tracing, or proper logging infrastructure. However, collecting this data across layers is tedious even with proper instrumentation. That’s why I’m focusing on full stack session recordings to automatically correlate data across layers and ensure engineers have access to all relevant signals in one place.
2. Hallucinations and plausible-but-broken code
Even sophisticated assistants generate code that compiles but fails in practice. Modern agents mitigate this by running tests or executing code locally, but hallucinations remain common enough to make human review essential.
The pattern: AI excels at syntactic correctness but can miss semantic correctness, security implications, or edge cases. Treat AI output like a junior teammate’s contribution: review, test, validate.
3. Narrow debugging context
When debugging, experienced engineers triangulate across logs, traces, metrics, database state, and git history. AI tools traditionally see only the immediate code snippet or file, missing correlations across services or layers.
This is changing: Some assistants now integrate with observability tools and runtime data (via APIs or protocols like MCP), but the gap remains significant. They’re still weak at diagnosing complex distributed systems issues or performance-related bugs.
Example: A microservice experiencing intermittent timeouts under load. The AI might suggest adding retries or increasing timeout values. In short, generic advice that misses the actual problem: a downstream service’s connection pool exhaustion triggering cascading failures.
4. Performance blind spots
AI-generated code is rarely optimized for efficiency. These tools cannot evaluate:
Memory usage patterns
Concurrency behavior
Scaling bottlenecks
Cache hit rates
Database query efficiency
Reality: AI can help implement optimizations once you’ve identified them, but it won’t discover them proactively.
5. Weak at system design and architecture
AI tools are strong within existing code structures. With integrations (e.g., via MCP), they can ingest Jira tickets, PRDs, and design docs to ground suggestions. But they lack the tacit context and judgment that architects apply:
SLO/latency and cost budgets
Compliance nuances (GDPR, HIPAA, SOC2)
Vendor/infrastructure constraints
Team topology (Conway’s Law isn’t in the training data)
Long-term maintainability (technical debt tradeoffs, migration paths, deprecation strategies)
The limitation: These tools can accelerate coding and assist with design exploration, but high-stakes architectural decisions still require human judgment informed by organizational context, budget realities, and strategic direction.
Final thoughts
AI coding tools represent a fundamental shift in how we develop software, but they’re not replacements. They struggle with:
Runtime behavior and distributed systems debugging
Architectural tradeoffs requiring organizational context
Security nuances and edge cases
Long-term maintainability decisions
The most effective pattern I’ve seen: senior engineers using AI to handle the mundane, freeing cognitive bandwidth for the problems that actually require human judgment: system design, security review, performance optimization, and strategic technical decisions.
The tools are here. The question isn’t whether to use them, instead, IMO, it’s how to integrate them thoughtfully into your workflow.
💜 This newsletter is sponsored by Multiplayer.app.
Full stack session recording. End-to-end visibility in a single click.
For a deep dive on this topic, check my original article here:
You’ll find more information about:
Summary of top AI coding tools
GitHub Copilot
Cursor
Claude Code
OpenAI Codex
Where AI coding tools fall short
Practical tips for engineers working with AI coding tools
Adding runtime context to enhance AI workflows
📚 Interesting Articles & Resources
Cursor vs Claude Code: Why Top Developers Are Using Both - Eric Roby
This article argues (with reason) that professional devs aren’t choosing one AI coding tool: they use Cursor for fast in-IDE coding and Claude Code for deeper, terminal-based tasks. What combination of AI tools are you using?
What Makes an AI Agent Different From ChatGPT? - Neo Kim
Agents break goals into steps, use tools, act on the world, adapt when things fail, and follow a reason-act cycle, making them more autonomous than traditional LLMs. This is a good overview of the difference between AI agents and chatbots.

