Over the years I’ve noticed a lot of patterns when I go to help people debug their code or server architectures. And I’ve helped hundreds of people debug their code. Sometimes the advice I give can literally be reduced to a standard laundry list of “Have you tried X?” questions, many of which aren’t even specific to the domain or problem at hand.
Here are some of those generic suggestions, along with extra details about how to apply them to general errors. Not all of these are possible in all circumstances, but some apply to almost any situation, so if one really won’t work, skip to the next. I’ve used each probably hundreds of times, though.
I’m sure most senior developers know this entire list and more implicitly. But I know developers still make these mistakes, because I’ve been in the position of “debugger of last resort” at a company. So I hope they’ll be of help to developers out there who are feeling stuck and looking for inspiration.
Question Your Assumptions
If the code isn’t working the way you think it should, then something about your assumptions is wrong. Look through code to ensure that all the pieces work as you expect them to. Step through the code with a debugger if you can and look for any surprises.
Part of this is to step back and determine what you’re assuming. You think a function works a particular way? Verify it. Check the docs. Check its actual behavior. Run tests in isolation if necessary to verify that it is actually working as expected.
This also applies to assumptions about how various asynchronous operations can interact with each other. Ask yourself “What if?” for absolutely anything that can go wrong in the behavior of your code.
- What if this call fails?
- What if a connection is attempted before the code is done initializing?
- What if a connection comes in during initialization?
- What if this call takes a very long time to execute, and other things are allowed to happen in the interim?
Anything that you can come up with might shed light on the topic.
Collect More Information
If you’re still not sure what’s going on, if you’re absolutely stumped, and you don’t know where to even look for the problem, start logging. Log all the things. Look at the output of those logs to see what is really going on in your code.
Those assumptions you questioned in the last point? You should be logging and/or asserting on just about every remotely relevant or related bit of data to ensure that it matches what you are assuming. You think that this variable is always an integer and defined? Assert! You think this value should only be from 1-n? Assert! If it’s too hard to log, then that’s a symptom of a deeper issue, and you should solve that underlying problem and then work on the problem at hand.
In particular, if you have asynchronous events interacting with your code, log the sequence of firing and receiving those events, and make sure those logs can be interleaved by time with the logs that trace other aspects of your app/server. Then you can better understand the behavior of the entire system, and when the failure you’re searching for gets logged, you can look backwards up the combined log and see what sequence of events led to the issue.
Enforce More Isolation
While I haven’t drunk the functional programming Kool-aid, they have one thing right: The more you have random side-effects spread through your code, the harder it is to reason about. If you’re working with a pure functional language, you can skip this point, because you’re probably already doing this to the extreme.
Most languages let you modify global variables from deep within functions. If any part of the code you’re looking at does this, see if you can architect it to be idempotent. You want to do this with every function you can.
So instead of trying to track down who is setting the
DisableWidgets flag when they shouldn’t be, instead
DisableWidgets flag entirely and rewrite all the logic associated with displaying Widgets to
be in one place. This is getting specific, but one pattern I’ve seen for this, in a UI context, is to use a stack
of “current Widgets”, so instead of manually disabling and enabling widgets as different windows become active,
they get their “active” state implicitly by being on top of the stack.
Another approach is to use state-changing functions rather than letting any code just mutate state willy-nilly. Then at least you can trace when the states are being changed.
Look for solutions like my examples above: Have more of it be implicit and/or managed in a central way.
Always Initialize Your Variables
This may seem obvious, but it clearly isn’t. I once helped a team that was working on a game. They had a bug where, when you started playing the game, sometimes a certain feature would be available, and sometimes it wouldn’t.
This was in C++, so my first thought was “uninitialized variable”. 1 They had been fighting the bug for weeks, and couldn’t find it. I went into the relevant classes, found all the member variables, and just initialized all of them to zero-equivalents. Took me about 10 minutes, and after that the bug was fixed.
Moral: Sometimes technical debt can actually be the bug.
Sometimes the code as written is just too hard to understand. If you know what it’s supposed to do, and it won’t take more than a day or two to rewrite in a cleaner manner, just do it.
I’ve occasionally found bugs because the code was so convoluted that I just couldn’t find the problem in a reasonable period of time. I’m sure I would have found it eventually, but after a short look at the tangled garbage that the previous developer had left for me, I was certain I could rewrite it a piece at a time, restructuring the code as I went, and I’d end up with something better and more maintainable when I was done.
Sure enough, took me about four hours to rewrite the code. In the process I found the bug, which became obvious as I was rewriting a crucial portion of code: The developer had made an incorrect assumption about the logic at a crucial point. But because the code was so difficult to understand, it really wasn’t obvious. I don’t blame the original developer for the mistake, though I do blame the developer for the Rube Goldberg architecture.
This can be interpreted to mean “Make it SOLID”, but it’s more than that. Typically there a dozens of ways to accomplish the same thing; think about other options and whether the code for those options would simply be cleaner to implement than what you’re looking at.
This is not a case of NIH as much a case of cleaning up technical debt, and this time the technical debt was just hiding a bug.
Don’t Blame The Tool (Prematurely)
If you’re using a really popular tool2, or a language that’s in the top 20 of popularity, don’t blame the tool, or the language or its standard libraries, until you’ve been able to reproduce a failure case outside of your app.
I say this as someone who has submitted bug patches that were accepted to fix actual bugs to very popular tools that have a reputation for being bulletproof. The thing is, statistically speaking, the bug you’re running into is almost certainly not a bug in the tool, and it’s on you to prove otherwise. Even with 30 years of experience I can still fall into this trap.
A huge percentage of people will throw up their hands because of mistaken assumptions and simply blame the tool. This almost qualifies as giving up before you start.
Don’t Give Up
Speaking of giving up, a huge cause of bugs not being fixed is simply giving up before you’ve tried hard enough to fix them. The list of tools above can get you closer to figuring out what is going on in your code. Cycle through this list trying the various suggestions long enough, and eventually the bug will get resolved. If you give up, it’s guaranteed not to be resolved. At least not by you.3
Being a software engineer means taking responsibility for getting things done. If you want to earn the big bucks, don’t throw up your hands just because it takes longer than you want to solve a problem.
- If you have unpredictable behavior between runs of a program (or between debug and release builds), and you’re using a language that doesn’t initialize all variables (or you’re using a language that relies a lot on globals, and you could be accidentally using a leaked global), this is a primary cause. [return]
- More than 500 stars on Github, for example. [return]
- This is where a lot of my consulting comes from! [return]