We love hearing from people who use Time Travel Debugging (TTD) to dramatically reduce the time spent diagnosing issues. Our security team recently worked with the compiler team to solve a nasty deadlock when using ASAN and shared their journey.
In the narrative below they make use of the reproducibility that trace files offer along with hardware breakpoints to solve the problem. If you already know how to debug with Windbg the barrier of entry to TTD is low.
As a bonus tip, you can use the timeline view to construct call queries or memory queries instead of using breakpoints and stepping through execution. Or, you can use “dx -g” with a call or memory query to generate a table of information about the call parameters / memory values / etc (my favorite!)
Do you have a good story to tell? Please share as inspiration for all of us!
-----
When rolling ASAN out across Windows, we hit an interesting bug on a particular build target. Without ASAN the program worked as expected, however when we turned on ASAN we experienced a deadlock every time.
I threw the program into TTD and the problem reproduced immediately.
From the trace, I could see all threads were blocked on a lock. From there, I rewound the trace to the start, setting a breakpoint on the locking functions, conditional on my specific blocking lock, then executed forward. Very quickly I could see that one location locked the lock, but its epilog ended up calling unlock with a different point.
Returning to the trace with the compiler team later, we were able to narrow down the issue to a specific place where, under ASAN combined with a tailcall, the stack would not be appropriately adjusted causing the saved registers to be clobbered. These clobbered registers then got fed into the unlock routine, leading to the inconsistent lock state and eventual deadlock. The fix for this issue is in VS 2022 17.10 Preview 3.
Start to finish, from reproducing outside of the debugger to identifying the offending codegen took around four hours, with another hour spent later with the compiler team to identify the specific behavior that was at fault.
Without a TTD trace, root causing this would have taken much, much longer. While I find TTD valuable in day-to-day debugging, especially in instances of toolchain issues, the time saved by the replayability is incredible.
Verichains - Chief Operating Officer | Head of Security Business. Head of TrueID eKYC, Face Authen, SDK Liveness Business. Head of Bshield product.
2moCongrats for the new release of Aptos Move Compiler Wolfgang Grieskamp 👍