Build the dashboard before the agent
You cannot fix what you cannot see. Observability is not the last thing you bolt on. It is the first thing you build.
Here is a rule I did not believe when I first heard it. I will now not start a project without it: build the dashboard before the agent. Not "alongside." Not "in the second sprint." Before. Before you write the prompt, before you wire the first tool call, before the first model token comes out of the API. The first thing you build on an AI project is the thing that lets you see what the AI is doing.
I know how this sounds. You have a clear idea in your head and you just want to prove the thing works. Spinning up a trace UI before there is anything to trace feels like building a stadium before anyone has learned the rules of the sport. It is the opposite of "move fast." It looks like yak-shaving of the highest order. Every instinct you have honed as a startup person is going to tell you to skip it and come back to it later, once the real thing is working.
Skip it, and you will pay the cost I paid.
The lesson that stuck
I have spent real stretches of time, days that bled into nights, debugging agents that "worked" most of the time. No structured trace data, no way to tell which runs were failing or why. Thousands of JSON blobs in a log file and a customer asking when it would be fixed. The honest answer was "I do not know, because I cannot see." That debugging time was more expensive than building the dashboard first would have been. (Repeat that last sentence until you fully internalize it. It is the whole argument.)
The reason people do not build the dashboard first is not effort. It is psychology. Building a dashboard before the agent feels like admitting the agent is going to have bugs, and nobody wants to admit that before they have written a single line. They want to believe, for the first forty-eight hours, that this one will be different. That this one will just work.
It will not just work. It has never just worked, not once, not for anyone. Every AI system has a bug-finding phase. The only variable is whether that phase is cheap, you can see the bug in ten seconds, or expensive, meaning you lose a customer before you know something was wrong. The dashboard is the thing that decides which one you get.
What "the dashboard" actually is
Let me be specific about what "the dashboard" means. It is not Datadog. It is not console.log. It is not the default traces view in LangSmith, though all of those are better than nothing.
It is one row per agent run, logged forever. Input, output, wall-clock time, model used, token counts, and (this is the part people skip) every single tool call with its arguments and its return value, in order, nested under the parent run. If your agent calls a sub-agent, that sub-agent's full trace is nested under the parent call. If it retries, each attempt is its own entry. Nothing is summarized. Nothing is truncated. You can click any row and see exactly what the model saw and exactly what it did, byte for byte.
It is searchable. Not "grep through JSON" searchable, actually searchable, by customer, by date, by outcome, by tool name, by the presence of specific strings in the output. The bug you are trying to find at 11pm on a Thursday is "show me every run for this customer where the booking tool returned an empty string." If that takes you more than thirty seconds to answer, you do not have observability; you have log files.
And it is visible to the customer, or at least the relevant subset is. This is the part every engineer resists and every customer thanks you for. The single best thing you can do for trust with an AI system is let the customer see what it did on their behalf. Not just "success" or "failure", the actual steps, the actual reasoning trace where appropriate, the actual output, in a view they can open without having to email you. When something goes wrong, and it will, the customer is almost never angry that it broke. They are angry that they cannot tell what broke. Show them, and most of the anger disappears before it reaches your inbox.
This is a lot of work for a prototype
It is not. This is the part people do not believe until they build it, so I am going to spell it out and remove the excuse.
A bare-bones version of the dashboard is a Postgres table with a JSON column, a single web route that lists the last 200 rows with three filters (customer, status, date), and a detail view that pretty-prints the nested steps with syntax highlighting. That is it. You can write it in a language you already know, against a database you already have, in an afternoon. I have built it from scratch in under two hours more than once, and I am not a fast coder.
The fully-featured version, alerts, per-customer views, assertion results, drift charts, a customer login so they can see their own runs, takes longer. The version that pays for itself the first time an agent misbehaves is small enough to build on a Sunday morning before your kid wakes up.
The first hour of your next AI project
On your next project, before you do anything else, stand up exactly this.
A table called runs, with columns: id, created_at, customer_id, run_type, input (JSON), output (JSON), status, duration_ms, model, token_counts, error, steps (a JSON array of tool calls with timestamps, arguments, and return values). Primary key on id, index on customer_id and created_at. Done.
A single page at /runs that lists the last 200 rows with dropdown filters for customer, status, and date. Done.
A detail page at /runs/:id that pretty-prints the steps array, nested, with syntax highlighting on the JSON. Collapsible by tool call. Done.
One helper function in your code called log_run(...) that your agent calls at the start of every run and updates at the end. Done.
That is the whole thing. It is an afternoon. Everything you build after it is faster because of it, and the first time the agent misbehaves you will not have to write a single grep command to find out why.
Then, and only then, write the agent.
You will save yourself the debugging time and the customer emails I have gotten. You will also skip the quiet shame of realizing that the part of the system you should have built first was the part you had been treating as optional.
So: what does your current trace view look like? If the answer is "I grep the logs," you are not observing. You are hoping, and hope is not a strategy. Build the dashboard first.