The quick and dirty deadlock solution to get those threads off your back
What exactly is a code hack? We’ve asked developers: What’s the most useful debugging trick you use, what are some of the things you do that most developers aren’t aware of and how you managed to solve that issue that was bugging you for way too long. Basically, any non-straightforward solution using a piece of gum and a paperclip.
This post is the first story in the series, kicking it off with Uri Shamay, Principal Lead System Software Engineer at Akamai, who shares a story of deadlock madness.
At a company I previously worked in, we were developing a system that dramatically decreased the lead time on some business logic at a bank. Operations that used to take 30 days to complete now only take a single day. The flow starts from a validation process where the system receives a customer and runs all kinds of checks against external government APIs and internal APIs. To get a hold of this data, the old model’s implementation used to invoke a native thread for each of these API fetches.
A tech lead decided this model for API calls can’t scale, so he moved on to a newer implementation with a Non-Blocking I/O using Java NIO that allowed him to create only a few threads that were based on CPU architecture. The new complex flow also contained JNI and DLLs for file system notifications when changes occur on a local path that belongs to a Message Queue system.
The unexpected behavior occurred with the new deployment, but the tech lead that was previously in charge quit, so no one knew where the source code was and what it actually did. The bank’s authorization policy required us to use the existing API library to get the data from the government’s systems. Every few hours it would all stop functioning, and jstack told us there is a deadlock in place. Actually, there were a lot of deadlocks when we used those libraries, but we never fully understood why, as we haven’t seen any patterns in the jstack output. To top it all, some of the code was just Jars, JNI, and DLLs without the source files and no documentation, so I used the DJ – Java Decompiler for the Jars, and WinDbg for the DLL interception.
Deadlocks investigations are hard, and deadlocks in code you’re not familiar with are even harder. But Deadlocks that happen when other developers don’t understand concurrency patterns are the hardest! When I started to investigate the deadlocks in production I was sure it’s just a small issue and I can quickly fix it. That optimism quickly became a joke in the R&D group, with many developers going through the jstack output when no one really understood the pattern and the root cause for the deadlocks.
We looked for open-source tools to get more information that will help us, and tried some tracing tools but none of them helped. We were trying to get the state of the sytem, and reproduce it in a QA environment but the data that we simulated wasn’t diverse enough, so the deadlocks didn’t reappear. Since real production data couldn’t be provided due to a regulation policy we reached a dead end.
After months of problems and manual restarts in the middle of night, DevOps jokes 24x7, budget concerns and a new strict deadline for the project, we decided to rewrite the problematic libraries. They contained so many problems that we needed to fallback to the old version with native threads for each call. After a deep analysis of the code, we understood this not the behavior we need. No running away from rewriting the massive codebase. Right now we’re working on rewriting everything from scratch.
So, we decided to rewrite everything, but in the meantime, we didn’t want devops to kill us, so we wrote some JMX code that polls ‘findDeadlockedThreads’ and does an automatic restart when it happens. This is how we automatically restarted the app after running into a deadlock. The DeadLockDetector thread polls the JMX of findDeadlockedThreads and exits brutally by System.exit($CODE) when a deadlock is detected:
The error code is then returned to the main loop and in case it’s the deadlock code, the app will be executed again from scratch. Note that you can use that JMX external to your program with Remote Management in case you don’t have the source or you want to do it on a live system:
** The full code example with a simulated deadlock is available on GitHub.
Enjoyed reading this blog post or have questions or feedback?
Share your thoughts by creating a new topic in the Harness community forum.