← Back to blog
7 min read

The LLM is not the bug

The scariest category of AI bug is the one where nothing throws. The output just quietly drifts. Here is how to catch it.

The scariest bugs I have seen in AI systems share one feature: nothing crashes. The code runs. The model returns. The JSON parses. The tool call succeeds. The log says "OK." And the output is subtly, confidently, invisibly wrong.

This is a different category of bug than the ones software engineers are trained to find. Traditional bugs have stack traces, you read the trace, you find the line, you fix it. AI bugs often have no trace at all, because the system did exactly what it was told to do. It is just that what it was told to do was not what you wanted.

Let me give you three real examples, sanitized. The pattern is only visible once you have seen it in the wild.

§ 01

Silent bugs I have seen in the wild

  1. An agent extracts dates from customer emails and adds them to a calendar. It works great for "meeting on Thursday at 3pm." Then a customer writes "let's move this to next Tuesday." The model, trying to be helpful, interprets "next Tuesday" as nine days from now instead of two. The email thread started the previous week, and the model is anchoring on the thread date rather than the message date. The meeting goes on the wrong day. Nobody notices until the customer fails to show up to a meeting that is, from the calendar's point of view, still in the future.
  2. An agent summarizes support tickets and tags them with a category from a list of twelve valid options. Under light load, it is 99% accurate. Under heavy load, when the prompt context is larger, the system is under pressure, and the model is being asked to do more in a single call, it occasionally invents a thirteenth category. "Billing/Other." That category does not exist in the downstream routing system, so those tickets silently fail to route. Nobody sees them for three days. By then the customer has already escalated through a different channel and your support team has a bad Monday.
  3. An agent writes SQL queries from natural language. The queries are correct. They run. They return data. What you do not notice, because you are only spot-checking, is that the agent has started aliasing one of the tables inconsistently. The JOIN is matching on the wrong column. The numbers in the weekly report are within 5% of correct, which is inside normal week-over-week variation, so nobody questions it. You find out six weeks later when a finance person runs a different query against the same data and the totals disagree by a number nobody can explain.

In all three cases, there was no error, no exception, no red light anywhere. The log said "success." The bug was in the output, the output looked plausible, and nobody had built a way to notice the difference between plausible and correct.

§ 02

Loud failures versus quiet failures

This is the fundamental asymmetry of LLM bugs. It is why you cannot ship AI systems the way you ship regular software.

Regular software fails loudly. Exceptions, 500s, crashes, things you can page on. LLMs fail quietly. The cost of catching a loud failure is low; the cost of catching a quiet failure is enormous, because the failure has already propagated into your customer's world by the time you notice. A regular-software bug costs you an hour. An LLM quiet-drift bug costs you a customer, and you find out three months after the fact when they cancel and mention it in the exit survey.

The goal of everything that follows is not to eliminate bugs. It is to move bugs from the "silent drift" category into the "loud crash" category, where your existing engineering instincts can catch them. Loud crashes you can fix. Silent drift you lose customers to.

§ 03

How to break the silence

Write assertions at the semantic layer, not just the structural layer. Structural assertions check that the output is valid JSON with the right fields. Those are cheap and everyone does them. Semantic assertions check that the values actually make sense. If your agent extracted a date, assert that the date is in the future when the context implies it should be. If it picked a category, assert that the category is in the known list instead of just passing it through. If it wrote SQL, assert that the query returns a row count within an expected range. Most quiet bugs can be caught with three or four assertions that you write once and then forget about for the rest of the project's life.

Sample aggressively, and have a human review the sample. Not "all outputs," which is impossible. Not "zero outputs," which is what most people do. A small, random, human-reviewed sample, every week, forever.

I review ten runs a week on every system I operate. Ten is enough to catch patterns, and small enough that I actually do it instead of promising myself I will do it later. The bugs this practice catches are almost never the ones I would have caught any other way, because they are the bugs that look right to every automated check and only become obvious when a human glances at them.

Measure drift, not just accuracy. Accuracy is "did it get this one right." Drift is "is the distribution of outputs today different from the distribution last month." If the agent used to pick "Billing" for 20% of tickets and suddenly picks it for 35%, something has changed, even if each individual decision, examined in isolation, looks defensible. Drift is the early warning signal for quiet bugs. Almost nobody measures it on their own systems, because nobody tells you to.

And the most important one: build the smallest amount of structure that lets the model fail loudly instead of quietly. Every time you can replace "the model decides" with "the model picks from a fixed list," do it. Every time you can replace "the model produces a number" with "the model picks a bucket," do it. Every time you can force the output through a schema that will reject anything unexpected, do it.

The model will resist. You will feel like you are constraining its creativity. You are. That is the point. A constrained model that fails loudly is a hundred times more useful in production than an unconstrained one that fails quietly, and "creativity" is not what you were paying for in the first place.

§ 04

Actionable conclusions

Audit your system for every place it returns free-form text where it could return a constrained choice, and constrain it. Budget a day; it will save you a quarter.

Pick three semantic assertions per agent run and wire them in this week. Not next sprint. This week.

Start a weekly ten-run human review ritual, with a timer. Fifteen minutes, every Friday, forever. The bugs it catches will pay for the time by an order of magnitude.

Build a drift chart for your top three output categories. Just a chart. Look at it weekly. The first time it catches something you would have missed, you will never stop looking.

When you review your own system, do not ask "does it work." Everyone's system "works" by the only metric they are measuring. Ask "how would I know if it stopped working," and be specific. If your answer is "the customer would tell me," you have not shipped a production system. You have shipped a tripwire that is slower than the customer.

§ 05

**The LLM is not the bug. The silence around the LLM is the bug. Break the silence.**