production-engineering

The Problem With Chasing GPU Utilization

Walk into any AI infrastructure discussion and you’ll hear the same question: What’s your GPU utilization? It’s become the infrastructure equivalent of asking a web service for its CPU utilization. The assumption is simple: higher utilization is better. After all, GPUs are expensive, and a cluster running at 90% utilization sounds far more impressive than one running at 50%. For a long time, I believed that too. Then I spent more time working on GPU scheduling and multi-tenant AI workloads....

Retries, Timeouts, and Idempotency: The Trio That Defines Production Reliability

Distributed systems rarely fail in clean, obvious ways. They degrade. They stall. They partially succeed. They retry half a request, lose the response, and leave you wondering whether the operation happened once, twice, or not at all. In production, reliability is rarely about whether the code works on a happy path. It is about how the system behaves when dependencies are slow, networks are unreliable, and clients do not get a clear answer....

Building Boring, Reliable Go Services in Production

The software industry has a habit of celebrating novelty. New frameworks, new abstractions, new patterns, and new promises of developer productivity show up every few months. Production systems, however, rarely fail because they were not modern enough. They fail because they were difficult to reason about, fragile under stress, and painful to operate. Over time, I have become much less interested in clever backend services and much more interested in boring ones....