How to audit an AI workflow you didn't build
Someone built a thing, left, and now it's your problem. No docs, no tests, a 400-node canvas. Here is the exact sequence I use to go from 'I have no idea what this does' to 'I know exactly what's wrong.'
The message you do not want to receive at 10am on a Monday reads something like this: "Our AI system has been giving wrong answers for two weeks. The person who built it left. Can you help?"
I have received this message eight times in the last two years. The systems in question ranged from a simple Zapier workflow with four steps to a 400-node n8n canvas with no documentation, three active sub-workflows, and a prompt engineering layer that had been edited by four different people over eighteen months. Every audit starts from the same place: I have no idea what this does, and I need to figure it out before I can fix it.
The good news is that the auditing sequence is the same every time. The specific findings change, but the method does not. I call it the Five Questions—and every AI practitioner who charges money will eventually need it, because inheriting a system you did not build is not a special case. It is a normal part of this job.
Question 1: What is this thing supposed to do?
This sounds obvious. It is not. The answer is almost never in the documentation, because there usually is not documentation. The answer lives in three places: the original brief or scope document if it exists, the person who commissioned the system if they are still there, and the execution logs.
Prioritize the logs over the brief. The brief tells you what was intended. The logs tell you what is actually happening. These are often different. A system built to "qualify inbound leads" may actually be spending sixty percent of its executions on email-parsing logic that pre-dates the qualifying step, because someone added it three months ago and never updated the brief.
Read at least thirty execution logs before you open the canvas. Not to find bugs—to understand what inputs the system is actually receiving and what outputs it is actually producing. This step takes about an hour and prevents about four hours of wrong-direction debugging.
Question 2: Where does the data come from, and where does it go?
Draw the data map before you read a single node. Every AI system has inputs—webhooks, API polls, scheduled queries, manual triggers—and outputs—CRM writes, emails, Slack messages, database rows. The inputs and outputs are usually visible from the highest-level view of the canvas without reading any of the logic.
Identify every input and every output, and write them down. Then ask: are there any outputs happening right now that nobody intended? And are there any inputs that are no longer sending data?
The second question is the more useful one. Systems break not because someone changed the logic—though that happens—but because an upstream data source changed its format, rotated its credentials, or started sending a field as null that used to always be present. The break is not in the system; it is in the data feeding the system. If you audit the logic without auditing the inputs, you will spend half a day fixing a workflow that is not broken.
Question 3: What are the LLM calls doing, and are the prompts still accurate?
This is the step that takes the longest and is skipped most often. Every call to an LLM in a workflow has a prompt. That prompt was written at a specific moment in time, for a specific version of the data it was going to receive, by a specific person with a specific mental model of what the system was supposed to do. If any of those three things have changed—and after six months of production use, all three usually have—the prompt may be producing outputs that are technically valid but contextually wrong.
I pull every prompt in the system and read it as if I am the LLM receiving it. Does this prompt accurately describe what the data looks like right now? Does it accurately describe what a good output looks like? Is there anything in the prompt clearly written for a previous version of the system—references to fields that no longer exist, examples that no longer apply, instructions that contradict how the system actually works today?
This step surfaces between forty and sixty percent of the "wrong answers" issues I am brought in to fix. The code is fine. The logic is fine. The prompt is describing a system that no longer exists.
Question 4: What happens when it fails?
The answer in most systems is: nothing good. There is usually one of three failure patterns. Either the system errors out and notifies no one, the system errors out and sends a notification to an email address nobody checks, or the system does not error out at all—it produces an output that looks correct but is wrong in a way that is only visible if you already know what the right answer looks like.
The third pattern is the most expensive. It is also the one that produces the "two weeks of wrong answers" message.
For each failure mode you identify, document what the current behavior is and what it should be. This becomes your remediation list.
Question 5: What is the real priority?
After running the first four questions, you will have a list of issues. Some of them are bugs. Some are design choices you would not have made but that are not actually wrong. Some are things the current owner considers problems but the original builder considered features.
Before you fix anything, have a thirty-minute call with the system's current owner and show them the list. Ask them to rank the issues by business impact, not by technical severity. A prompt that produces slightly verbose CRM notes is a low-priority problem. A prompt that is occasionally telling customers the wrong appointment time is a high-priority problem. The two look the same in the audit; only the owner can tell you which is which.
Once you have the ranked list, fix in order of business impact. Do not refactor. Do not redesign. Fix the things that are actually causing harm, document what you found and what you changed, and leave the system in better shape than you found it without rewriting it into something the owner does not recognize.
That is what a professional audit looks like. It is not glamorous. It does not require state-of-the-art tools. It requires the discipline to read before you touch, to map before you fix, and to ask the business question before the technical one.
If you are sitting in front of a system someone else built and it is not working right, this is where to start. If you want to run through it with someone who has done it eight times before, that is also what I do on consulting calls.