I've seen too many engineers waste hours (or even days) on bugs they could have solved in 20 minutes. Their mistake? Jumping straight into the code, setting random breakpoints, sprinkling console.log everywhere, and hoping to stumble upon the problem.
Spoiler: it doesn't work. Or rather, it sometimes works, but it's inefficient and frustrating.
Today, I want to share an approach that has saved me countless hours: following the chain, from the outside in.
When something breaks, most developers instinctively:
The problem with this approach? You're starting from assumptions. You think you know where the problem is. And most of the time, you're wrong.
Result: you spend hours debugging the wrong place.
The idea is simple: before touching any code, analyze what's happening from the outside in. Start with what's farthest from the core of your system and work your way back.
Concretely, this means:
This approach has a huge advantage: each step narrows the scope of your investigation. You're no longer looking for a needle in a haystack — you know exactly where to look.
It seems obvious, but I still see people debugging without having clearly defined the problem.
Before anything else, ask yourself:
If you can't clearly answer these questions, you're not ready to debug. You'll go in circles.
If your system uses distributed tracing (Jaeger, Zipkin, Tempo, Datadog APM...), this is your best weapon.
A trace shows you exactly the journey of a request through all your services. You can see:
It's incredibly powerful. In seconds, you can identify:
I've solved production bugs in under 5 minutes thanks to traces, where without them I would have spent hours guessing.
Now's the time to set it up. Seriously. The initial investment pays off quickly.
In the meantime, you can fall back on correlation IDs in your logs (if you have them), or manually reconstruct the journey — but it's tedious.
Logs are the daily bread of debugging. But you need to know how to use them correctly.
Don't read all the logs in your stack. It's a monumental waste of time. Use your search tools:
Logs should be read chronologically. You want to understand what happened before the error, not just the error itself.
Often, the real problem is a few lines above: a refused connection, a timeout, an unexpected value...
Sometimes, the absence of a log is more telling than the logs themselves. If you expect to see a log and it's not there, it means:
Metrics give you an overview of your system's state. They answer different questions than logs:
This is the famous RED/USE framework. If you don't know it, I recommend looking it up.
Overlay your metrics with deployments, config changes, external incidents... Often, the culprit becomes obvious: "Look, the error rate exploded exactly 5 minutes after the 2:32 PM deploy."
It's only after doing all of this that you should open your IDE.
At this point, you normally have:
Now, you can read the code with a precise objective. You're looking for something specific, not "the bug somewhere in 100k lines."
The code lets you validate (or invalidate) your hypotheses. If the logs tell you "null value received," the code tells you why that value can be null and what happens when it is.
It's tempting to "refactor while you're at it" or "fix another thing you noticed." Resist. You're there to solve a specific problem. Note other stuff for later.
Situation: a customer reports that some of their requests fail with a 500 error.
Step 1 — Symptoms:
/api/orders endpointStep 2 — Traces: I grab a trace ID from the customer's logs. The trace shows me:
Step 3 — Logs: I filter the Inventory Service logs around the time of the error:
ERROR: Connection refused to database replica-2
WARN: Falling back to primary database
ERROR: Query timeout after 30000ms
Step 4 — Metrics: I look at the Inventory Service metrics:
Conclusion: The 7:55 AM deployment introduced a bug that doesn't properly close DB connections in certain cases. The pool gets saturated, new requests timeout.
Total time: 15 minutes. Without this methodical approach, I probably would have spent hours reading code randomly.
This is THE biggest mistake. You waste massive time searching in the wrong place.
"When did this start?" is a crucial question. Always correlate with deployments, config changes, load spikes...
Traces without logs is incomplete. Logs without metrics is tunnel vision. Use everything you have.
If you can't reproduce the problem, you can't validate your solution. Invest time to find a reproducible case.
"I added a try/catch and it works now." No. You hid the problem, not solved it. Understand the root cause.
Debugging smartly is a skill that can be learned. And the key is methodology.
Start from the outside: symptoms, traces, logs, metrics. Each step narrows the scope. When you get to the code, you know exactly what you're looking for.
This approach has saved me hundreds of hours. It will do the same for you.
Next time something breaks, resist the urge to open your IDE immediately. Take a breath, open your dashboards, and follow the chain.