How To Solve Any Crash
(Note: This article was written with iOS development in mind, but the concepts apply to all frontend platforms.)
Closed: Cannot Reproduce
Ever had a crash in which you had absolutely no idea what was going on, and no amount of testing allowed you to reproduce the issue? If so, you've come to the right place!
Well, sort of. As you'll see in this article, the ability to debug complex crashes is not something immediate. Keep this out of your expectations: there's no magical instrument that you forgot to run that will give you the output you're expecting. When it comes to complex crashes, what we need to do instead is prepare our environment so that these issues are better understood when they arrive, making them more actionable. Let's see how to do that!
What are complex crashes?
One thing that I find helpful is to rationalize the issue. It's easy to look at a weird issue and just dismiss it as something magical that will never again, but that makes no sense. There's always a perfectly logical reason the issue happened (most likely your fault), and the more users are affected by it, the more likely it's that this is not a freak accident. So how come you might be looking right now at an issue that affects a high amount of users, and still you have no idea what's going on or how to reproduce it?
In my experience, the inability to understand a crash will always boil down to a lack of information. The problem is never that the issue is "too complicated", but that you don't have enough data. Think about the weirdest crash you ever had to look at: wouldn't it be a lot less complicated if the crash report told you exactly what the issue was and how to solve it? It doesn't matter how bizarre a crash is, it's your ability to understand and reproduce the issue which dictates how likely it's to be solved.
Thus, if you want to be able to solve any crash ever, you need to enhance the data that accompanies them. Let's take a look at a couple of ways do to that!
You might have noticed that crash platforms like Firebase will always include some useful pieces of device-related metadata on your crashes, such as the most common iOS version causing the crash, if the users were in foreground or background, if the devices are jailbroken, how much disk space each user had left when the crash happened, and so on. These are extremely useful, but are not nearly enough. What you truly need here is to include metadata of your app which helps you pinpoint what the user was doing at the moment of the crash. Some examples of things you should add are:
- The screen the user was looking at
- The "type" of the user, if applicable (free? premium? logged out?)
- The last action the user did (did they try to navigate somewhere?)
- Did the app finish launching correctly?
- Did the user receive a memory warning?
- Is the app shutting down?
- Does the user have an active internet connection?
- Which language is the user seeing?
It's hard to provide a complete list given that this will be completely different from app to app, but what you need to do here is essentially assemble everything that you can think about your app that can make a difference in its execution and include it to your crash reports.
You can add this information to Firebase through its SDK's key/value pairs API, but in order to see this information as percentages you will probably have to abstract Firebase under your own crash reporting backend.
In addition to metadata, another critical component is to have a solid analytics infrastructure in your app. This is something that most apps might already include, though you might require some changes to make it useable for crash reporting purposes.
The point here is that if you have a solid analytics implementation, you should be able to use it to replay a users steps to the crash. Thus, for this to happen, you need to make sure your analytics SDK is receiving as much information as possible regarding user interactions like:
- "User touched button X"
- "User saw banner Y"
- "User navigated to screen Z"
For the replayability itself, most third-party SDKs nowadays include a "timeline" feature that shows you all the events sent by a particular user around a specific time.
Using this information to solve crashes
Finally, with your crashes receiving as much information as possible about the state of the app at the moment of the crash, you can now follow this step-by-step guide I made that should help you track down and solve the great majority of cases!
Check the crashed thread
If you're reading this guide it probably means that you already tried this and it didn't work, but it's good to mention anyway that for the huge majority of cases the answer lies directly in the crashed trace. By checking the path the code took, you may be able to locate and reproduce the issue.
Check the metadata for the crash
If the trace is vague, then looking at your added metadata may reveal the issue. When looking at the metadata, pay attention to values that are close to either 100% or 0%. This may reveal that the crash is tied to a very specific device or condition inside the app.
Check the background threads of the crash
If the metadata is also vague, then it may mean that the crashed code is not the problem itself, but more of an indirect consequence of a problem that happened asynchronously somewhere else. In this situation, you may be able to locate the issue by looking at what's happening in the other threads of the crash. Try grabbing many occurrences of the crash and compare their threads with each other. Do they all have something in common that you don't see in other issues? If so, that could be the cause of the problem.
Match the environment of the crashing users
If after deeply analyzing the trace and metadata you still can't figure out what's going on, it may be the case that the crash is tied to a specific device and/or AB test. If that's the case, then you should be able to reproduce the crash by matching the user's environment. Besides making sure to use the exact phone/OS version that the user experienced the crash on, make sure that you're also matching the user's AB testing flags (if your app has them).
Regarding flags, one very useful thing to do is to compare the flags of a list of users with the issue against those of a list of users without the issue. If the issue is connected to a flag, then compiling a list of common flags in these groups will reveal which flag (or lack of, if the issue was caused by removing an experiment) is causing the problem.
EDIT: Dave Verwer also mentioned something important that I forgot to add -- make sure to also run the exact build of the app that the users are crashing on! It's not unlikely for the changes on your branch to affect the conditions for the crash, so always make sure you're on the exact commit the build was archived on. You can gain this ability by making your CI create a git tag every time it uploads a new build -- by naming the tag with the correspondent build number, you'll have the power to rollback to any release you've ever created.
Retrace the user's steps
If everything proved to be useless, you should be able to reproduce the issue by mimicking what the users are doing before the crash happens. This can sometimes be something absurdly specific like opening/closing the app a couple of times, turning it upside-down, opening a playlist and then throwing the device against a wall, and if that's the case, then you should be able to find these steps by looking at your analytics SDK's timeline for that user.
It's important to note that it's possible that the conditions to trigger the issue can span multiple sessions, like an issue that involves content that was downloaded a couple of days ago. In cases like that, understanding the issue requires looking not only at the data of the session where the crash happened, but also of the sessions that came before it.
Instrument for thread / memory safety issues
If you still can't figure out what's happening, then you may be dealing with a non-deterministic issue caused by either thread safety issues such as race conditions or memory issues like heap corruption. There's unfortunately no easy way of figuring these out, and you'll need to have a deep understanding of the code to catch them. In iOS, the Zombies instrument and the thread/memory sanitizers can be of some help.
The best thing you can do here is to prevent these from being possible to happen in the first place. If you're working with asynchronous code, always be 100% sure that your implementation is thread-safe for all its usage scenarios before merging it. While thread-related issues are very easy to introduce, they're extremely hard to debug. Choose to always be on the safer side to avoid issues like this in the future.
Check what was introduced in the release the crash started
In some cases, especially very old issues, it can be helpful to track down the exact version the issue started happening and hop into GitHub to see what exactly was introduced in that release. If nothing worked, then reverting suspicious pull requests could do the trick.
Add more logs
In the event that you have absolutely no clue what's happening, then adding additional logs could provide some relief. Firebase allows you to attach generic logs to a crash report, and one way you can use them is to log information about the state of the user's app around the place the crash happens. Try to think of anything weird or unintentional that can happen around the code that is crashing and log it to Firebase -- in the next release, you'll be able to see them alongside the crashes. They can also be useful even before you push a new feature; if you think a new feature could cause issues, you can already safeguard it with logs before it's even released. In the event that it does cause an issue, you'll already have the additional information you need to debug it.
If you reach this far, then it's possible for your initial thoughts might be true: you're dealing with some bizarre hardware problem caused by the sun's radiation at a specific time of the day for a specific user in Latvia.
To avoid situations like this, I personally try to completely avoid looking into issues until they are consistently happening for a sufficiently large amount of users. It's always possible for issues to be caused by situational things, but unless they are consistent or of high impact, it's probably best to ignore them to avoid the possibility of wasting your time looking into something that turns out to not be your fault.