How to Debug Anything

Over the years I’ve noticed a lot of patterns when I go to help people debug their code or server architectures. And I’ve helped hundreds of people debug their code. Sometimes the advice I give can literally be reduced to a standard laundry list of “Have you tried X?” questions, many of which aren’t even specific to the domain or problem at hand.

Here are some of those generic suggestions, along with extra details about how to apply them to general errors. Not all of these are possible in all circumstances, but some apply to almost any situation, so if one really won’t work, skip to the next. I’ve used each probably hundreds of times, though.

I’m sure most senior developers know this entire list and more implicitly. But I know developers still make these mistakes, because I’ve been in the position of “debugger of last resort” at a company. So I hope they’ll be of help to developers out there who are feeling stuck and looking for inspiration.

Edit: Updated 2019-02-20 with more strategies.

Question Your Assumptions

If the code isn’t working the way you think it should, then something about your assumptions is wrong. Look through code to ensure that all the pieces work as you expect them to. Step through the code with a debugger if you can and look for any surprises.

Part of this is to step back and determine what you’re assuming. You think a function works a particular way? Verify it. Check the docs. Check its actual behavior. Run tests in isolation if necessary to verify that it is actually working as expected.

This also applies to assumptions about how various asynchronous operations can interact with each other. Ask yourself “What if?” for absolutely anything that can go wrong in the behavior of your code.

Anything that you can come up with might shed light on the topic.

Collect More Information

If you’re still not sure what’s going on, if you’re absolutely stumped, and you don’t know where to even look for the problem, start logging. Log all the things. Look at the output of those logs to see what is really going on in your code.

Those assumptions you questioned in the last point? You should be logging and/or asserting on just about every remotely relevant or related bit of data to ensure that it matches what you are assuming. You think that this variable is always an integer and defined? Assert! You think this value should only be from 1-n? Assert! If it’s too hard to log, then that’s a symptom of a deeper issue, and you should solve that underlying problem and then work on the problem at hand.

In particular, if you have asynchronous events interacting with your code, log the sequence of firing and receiving those events, and make sure those logs can be interleaved by time with the logs that trace other aspects of your app/server. Then you can better understand the behavior of the entire system, and when the failure you’re searching for gets logged, you can look backwards up the combined log and see what sequence of events led to the issue.

Use Your Debugger

It’s right there in the name: de-bug-ger. It’s a tool designed to help you understand how code works, to watch exactly where and why a bug occurs, and stepping through problem code is an excellent way to Collect More Information (as above).

And by use, I mean really use. Learn your tools. Sometimes a bug might happen only the 500th time through a loop, only when the situation is exactly right, and you certainly don’t want to hit “resume” 500 times waiting for the right scenario to be present. Most debuggers today allow you to add conditions to breakpoints: Use them! If you think the bug happens with a certain configuration of flags, set the breakpoint to only fire when those flags match the troubling configuration.

Enforce More Isolation

While I haven’t drunk the functional programming Kool-aid, they have one thing right: The more you have random side-effects spread through your code, the harder it is to reason about. If you’re working with a pure functional language, you can skip this point, because you’re probably already doing this to the extreme.

Most languages let you modify global variables from deep within functions. If any part of the code you’re looking at does this, see if you can architect it to be idempotent. You want to do this with every function you can.

So instead of trying to track down who is setting the DisableWidgets flag when they shouldn’t be, instead banish the DisableWidgets flag entirely and rewrite all the logic associated with displaying Widgets to be in one place. This is getting specific, but one pattern I’ve seen for this, in a UI context, is to use a stack of “current Widgets”, so instead of manually disabling and enabling widgets as different windows become active, they get their “active” state implicitly by being on top of the stack.

Another approach is to use state-changing functions rather than letting any code just mutate state willy-nilly. Then at least you can trace when the states are being changed.

Look for solutions like my examples above: Have more of it be implicit and/or managed in a central way.

Make The Bug Happen More Often

The hardest bug to fix is the one that’s intermittent. Sometimes it’s possible to tweak things so that the bug will happen more frequently. That can be a huge hint as to what’s causing it.

Your natural instinct might be to tweak things in the opposite way: Making small changes (especially if it’s a timing-related bug) might make your code work for a bit, but if you haven’t dug down to find the underlying issue, the bug is still waiting there, and it will likely strike at the least opportune time.

So if you can engineer a situation where the bug happens every single time, then you can use that to your advantage in searching for the bug (see below) or even stepping through code with a debugger.

And yes, if you can, you should always try stepping through code with a debugger.

Search Deterministically

Sometimes you have a huge area of code to search for a bug; you see in the output that a value is wrong, but you need to figure out where it goes wrong.

Trace the calculation of the value back in time. As you’re searching, look for ways to cut the search space in half each time, like a binary search: At any natural boundary in the code roughly in the middle of the area you’re searching, look closely at the code and see if the intermediate state at that point makes sense.

Other times you’re looking for a seemingly random crash or corruption; this is more common in lower level languages like C and C++, where memory corruption is a normal kind of failure, but it can happen in “safer” languages as well when a variable is changed from the wrong scope (this is far more likely in a dynamic language) or there is a faulty bit of logic.

Similarly, you need to cut your search space in half (or otherwise slice it up) in order to narrow down where the problem is occurring. If you can disable large chunks of source code, and the problem still occurs, then you can eliminate those blocks as potential causes. Keep eliminating different parts until your corruption or crash goes away.

Always Initialize Your Variables

This may seem obvious, but it clearly isn’t. I once helped a team that was working on a game. They had a bug where, when you started playing the game, sometimes a certain feature would be available, and sometimes it wouldn’t.

This was in C++, so my first thought was “uninitialized variable”. 1 They had been fighting the bug for weeks, and couldn’t find it. I went into the relevant classes, found all the member variables, and just initialized all of them to zero-equivalents. Took me about 10 minutes, and after that the bug was fixed.

Moral: Sometimes technical debt can actually be the bug.

Refactor

Sometimes the code as written is just too hard to understand. If you know what it’s supposed to do, and it won’t take more than a day or two to rewrite in a cleaner manner, just do it.

I’ve occasionally found bugs because the code was so convoluted that I just couldn’t find the problem in a reasonable period of time. I’m sure I would have found it eventually, but after a short look at the tangled garbage that the previous developer had left for me, I was certain I could rewrite it a piece at a time, restructuring the code as I went, and I’d end up with something better and more maintainable when I was done.

Sure enough, took me about four hours to rewrite the code. In the process I found the bug, which became obvious as I was rewriting a crucial portion of code: The developer had made an incorrect assumption about the logic at a crucial point. But because the code was so difficult to understand, it really wasn’t obvious. I don’t blame the original developer for the mistake, though I do blame the developer for the Rube Goldberg architecture.

This can be interpreted to mean “Make it SOLID”, but it’s more than that. Typically there a dozens of ways to accomplish the same thing; think about other options and whether the code for those options would simply be cleaner to implement than what you’re looking at.

This is not a case of NIH as much a case of cleaning up technical debt, and this time the technical debt was just hiding a bug.

Don’t Blame The Tool (Prematurely)

If you’re using a really popular tool2, or a language that’s in the top 20 of popularity, don’t blame the tool, or the language or its standard libraries, until you’ve been able to reproduce a failure case outside of your app.

I say this as someone who has submitted bug patches that were accepted to fix actual bugs to very popular tools that have a reputation for being bulletproof. The thing is, statistically speaking, the bug you’re running into is almost certainly not a bug in the tool, and it’s on you to prove otherwise. Even with 30 years of experience I can still fall into this trap.

If you think you’ve found a bug in V8 JavaScript or Clang or GCC, the odds are pretty strongly against you being correct. Almost 100% if you have less than 10 years of experience in software development.

A huge percentage of people will throw up their hands because of mistaken assumptions and simply blame the tool. This almost qualifies as giving up before you start.

Don’t Give Up

Speaking of giving up, a huge cause of bugs not being fixed is simply giving up before you’ve tried hard enough to fix them. The list of tools above can get you closer to figuring out what is going on in your code. Cycle through this list trying the various suggestions long enough, and eventually the bug will get resolved. If you give up, it’s guaranteed not to be resolved. At least not by you.3

Being a software engineer means taking responsibility for getting things done. If you want to earn the big bucks, don’t throw up your hands just because it takes longer than you want to solve a problem.


  1. If you have unpredictable behavior between runs of a program (or between debug and release builds), and you’re using a language that doesn’t initialize all variables (or you’re using a language that relies a lot on globals, and you could be accidentally using a leaked global), this is a primary cause. [return]
  2. More than 500 stars on Github, for example. [return]
  3. This is where a lot of my consulting comes from! [return]

Comments

comments powered by Disqus