Explain yourself right now!

Observability engineering is pretty simple but these days it seems like it’s becoming an impossibility. Back in the before times, when we had to write the code ourselves, it was easier to understand the thing under construction because it was just us building it. When you are the one performing the act of building the thing, the knowledge belongs to you. People mistake the internet’s knowledge for their own. In the age of agentic tomfoolery, it’s pretty easy to forget that, similar to the internet’s knowledge, the agent’s knowledge doesn’t belong to you.

Back in the day it was ours though and the rate of change of added complexity correlated very strongly with the rate of hair loss in engineers. People would be praised for their ability to maintain the mental image of these things and begged to explain how they hadn't forgotten it all over the weekend. Now that so many folks are offloading that knowledge to the agents, there are a number of things being lost. One of them is the craft of observability. Observability is not dashboards and alerts and overpriced software with questionable functionality, it’s taking the model of a system and pressing it into the code so that the thing could explain itself when mom and dad are gone. You have to be familiar with your own universe before you can get there though.

So, how do we go about building these models? You build them backwards by asking your system a few questions. But what are the questions!? That’s a good question… are you ready for a numbered list!?

Is it working - availability and health
If not, where? - localization
Why? - root cause analysis
Can I make it stop? - control surface

Observability has three pillars: traces, metrics, and logs. On their own, each one is half of a clue. A metric yelling at you that p99 latency tripled over the past 5 minutes is cool and all but gives you nowhere to look. A trace with no logs is a map with no labels, you can get a general idea but it’s not quite enough. Logs on their own provide a great level of detail, but they won’t tell you that p99 latency just tripled. The good stuff these days isn’t in having just one, it’s in having all three at once through the magic of a correlation ID.

The pillars are the words, the correlation ID is our way of threading them together, and we’re on the hook for turning them into a story. I’ve written plenty of bad ones, omitting details and leaving folks lost, and plenty of overshares where the plot gets lost to the noise. The craft, the thing that requires us to remain close to the system, is knowing which details are load-bearing and encoding exactly those: enough that the story reads at the worst possible moment but not so much that important details drown.

You can’t offload that and you can’t write it holding only the prompt.