The story of a request

Observability is taking the model in your head and pressing it into the code so the thing can explain itself.

I know that’s catchy but I’m going to now put in a level of effort to make it concrete starting with the most boring system on Earth: a REST endpoint. And not just any endpoint, but a contrived endpoint!

A REST endpoint has exactly one protagonist, the request, and it’s respectful enough to live and die in front of you, synchronously, in a short amount of time. It’s a great place to learn the moves because it’s familiar and it’s right out in the open.

tl;dr: Identify the call, the caller, the parameters, the actions and resources involved, and the result. Do so with a strict sense of security and compliance in mind and you’ll be well on your way to building a strong observability posture.

For all forms of request, a correlation ID is going to be critical in building the full story of what’s happening. A modern observability stack will do that on your behalf but for the sake of the article, we’re going to make it an explicit action. The first thing that should happen with any request is to identify it by attaching a correlation ID that can be stamped onto every other artifact we produce for the sake of observation and operation. This will allow us to correlate (see what I did there, that’s actually not a good thing to do in writing) all of those artifacts to produce the story of a single request from start to finish.

After we have identified the request, we have to identify the caller. This is particularly important in multi-tenant systems but generally speaking, it’s important to be able to identify who is making the request or who the request is being made for. There are auth/auth implications here as well but we’re going to just keep this scoped to the story of observability rather than platform security.

Next up is exposing the request parameters in a safe way. A REST request is being made to perform some kind of an action on some kind of resources, it’s the plot of the story so to speak, and the parameters used in the call are the first act. This is where a good number of bugs are going to be resolved, determining not only if the system is operating as expected but that it’s being used as designed. Plenty of system-related issues in my life have been resolved by demonstrating errors to be poorly formed requests simply by correlating customer calls with logs showing errors in parameter validation.

The second act of this story is the combination of action and resource - understanding what this caller is asking the system to do. At this point in a request, we’re going to be focusing on providing the details of service internals: business logic, computation, internal system I/O, etc… At this point, when things go bump, it’s usually due to bugs in our code so be detailed. If there is going to be extra effort given to describing a system, let it be here, where we make the black box more of a grey. Within the confines of the external boundaries of a system are the often opaque boundaries of an internal system that are necessary for us as engineers to instrument and expose. This is also where it’s common to see a number of data leaks, specifically PHI/PII; security and compliance must always be considered when building any kind of a system.

The final piece here will be the results. Not just the response object we create, but understanding the effects that this request had on the state of the system or data. This is going to combine emitting the result of any I/O that happens (DB changes, files stored, etc…) and what we tell the caller the result of those actions was. We can pretty easily find ourselves responding with 200s but failing to observe the expected state change because we smothered an exception somewhere in the call stack. Here be the land of the sneaky bugs so spare no expense.

Something I’d like to quickly point out: All of these things overlap with each other. There are no distinct or hard boundaries anywhere and that’s an important concept here. We aren’t writing individual sentences for the sake of it, we’re writing a comprehensive story and that story is:

identity → source → inputs → work → outcome

A REST APIs are a fairly simple scenario here since they’re pretty linear and synchronous. I’ll be writing about some of the more complex topics in the space soon. For now, some code:

from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from pydantic import BaseModel
import structlog

app = FastAPI()
# server span + context propagation = free correlation
FastAPIInstrumentor.instrument_app(app)
tracer = trace.get_tracer(__name__)


class TransferRequest(BaseModel):
    account_id: str
    amount_cents: int
    # free-form text → likely PII. never log the contents.
    note: str | None = None


@app.post("/transfers")
async def create_transfer(body: TransferRequest, request: Request):
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, "032x")

    # 1. identity + 2. source
    # the correlation ID, and who it's for, on every artifact
    log = structlog.get_logger().bind(
        trace_id=trace_id,
        tenant_id=request.headers.get("x-tenant-id"),
        actor_id=request.headers.get("x-actor-id"),
    )
    span.set_attribute("enduser.tenant", request.headers.get("x-tenant-id", "unknown"))

    # 3. inputs
    # expose the params
    # redact the payload (note → shape, not contents)
    span.set_attributes({
        "transfer.account_id": body.account_id,
        "transfer.amount_cents": body.amount_cents,
        "transfer.has_note": body.note is not None,
    })
    log.info("transfer.received")

    # 4. work
    # make the middle legible
    # the computation gets its own span and its own story
    with tracer.start_as_current_span("ledger.post") as work:
        result = await ledger.post(body.account_id, body.amount_cents)
        work.set_attribute("ledger.rows_written", result.rows)
        work.set_attribute("ledger.balance_after_cents", result.balance)

    # 5. outcome
    # not just the response
    # but whether the ending is coherent with the setup
    span.set_attribute("transfer.committed", result.rows == 1)
    if result.rows != 1: # a 200 that didn't do the thing
        log.error("transfer.no_state_change", rows=result.rows)

    log.info("transfer.done", http_status=200)
    return {"id": result.id, "balance_cents": result.balance}