Custom Kernel Debugging is Faster

When you start to write a post you always get a problem with the headline for the post. You need to find something that will, in a few words, sum it up for the reader. I was wondering which one is better, “Boosting WinDbg”, “Faster Kernel Debugging in WinDbg”, “Hacking WinDbg” and so on. But they might be not accurate, and once you will read the post you won’t find them appropriate. But instead of talking about meta-post issues, let’s get going.

Two posts ago, I was talking about hunting a specific race condition bug we had in some software I work on. At last, I have free time to write this post and get into some interesting details about Windows Kernel and Debugging.

First I want to say that I got really pissed off that I couldn’t hunt the bug we had in the software like a normal human being, that Jond and I had to do it the lame old school way, which takes more time, lots of time. What really bothered me is that computers are fast and so is debugging, at least, should be. Why the heck do I have to sit down in front of the computer, not mentioning – trying to dupe the damned bug, and only then manage to debug it and see what’s going on wrong. Unacceptable. You might say, write a better code in the first place, I agree, but even then people have bugs, and will have, forever, and I was called to simply help.

Suppose we want to set a breakpoint on memory access this time, but something more complicated with conditions. The reason we need a condition, rather than a normal breakpoint is because the memory we want to monitor gets accessed thousands times per second, in my case with the race condition, for instance.
You’re even welcome to make the following test locally on your computer, fire up Visual Studio, and test the following code: unsigned int counter = 1; while (counter < 99999999+1) { counter++; }, set a memory access breakpoint on counter which stops when hit count reach 99999999, and time the whole process, and then time it without the bp set, and compare the result, what's the ratio you got? Isn't that just crazy? Here's an example in WinDbg's syntax, would be something like this: ba w4 0x491004 "j (poi(0x491004)==0) 'gc'" Which reads: break on write access for an integer at address 0x491004 only if its value is 0, otherwise continue execution. It will be tens-thousands times faster without the bp set, hence the debugging infrastructure, even locally (usermode), is slowing things down seriously. And think that you want to debug something similar on a remote machine, it's impossible, you are going to wait years in vain for something to happen on that machine. Think of all the COM/Pipe/USB/whatever-protocol messages that have to be transmitted back and forth the debugged machine to the debugger. And add to that the conditional breakpoint we set, someone has to see whether the condition is true or false and continue execution accordingly. And even if you use great tools like VirtualKD. Suppose you set a breakpoint on a given address, what really happens once the processor executes the instruction at that address? Obviously a lot, but I am going to talk about Windows Kernel point of view. Let's start bottom up, Interrupt #3 is being raised by the processor which ran that thread, which halts execution of the thread and transfers control _KiTrap3 in ntoskrnl. _KiTrap3 will build a context for the trapped thread, with all registers and this likely info and call CommonDispatchException with code 0x80000003 (to denote a breakpoint exception). Since the 'exception-raising' is common, everybody uses it, in other exceptions as well. CommonDispatchException calls _KiDispatchException. And _KiDispatchException is really the brain behind all the Windows-Exception mechanism. I'm not going to cover normal exception handling in Windows, which is very interesting in its own. So far nothing is new here. But we're getting to this function because it has something to do with debugging, it checks whether the _KdDebuggerEnabled is set and eventually it will call _KiDebugRoutine if it's set as well. Note that _KiDebugRoutine is a pointer to a function that gets set when the machine is debug-enabled. This is where we are going to get into business later, so as you can see the kernel has some minimal infrastructure to support kernel debugging with lots of functionality, many functions in ntoskrnl which start in "kdp", like KdpReadPhysicalMemory, KdpSetContext and many others. Eventually the controlling machine that uses WinDbg, has to speak to the remote machine using some protocol named KdCom, there's a KDCOM.DLL which is responsible for all of it. Now, once we set a breakpoint in WinDbg, I don't know exactly what happens, but I guess it’s something like this: it stores the bp in some internal table locally, then sends it to the debugged machine using this KdCom protocol, the other machine receives the command and sets the breakpoint locally. Then when the bp occurs, eventually WinDbg gets an event that describes the debug event from the other machine. Then it needs to know what to do with this bp according to the dude who debugs the machine. So much going on for what looks like a simple breakpoint. The process is very similar for single stepping as well, though sending a different exception code.

The problem with conditional breakpoints is that they are being tested for the condition locally, on the WinDbg machine, not on the server, so to speak. I agree it’s a fine design for Windows, after all, Windows wasn’t meant to be an uber debugging infrastructure, but an operating system. So having a kernel debugging builtin we should say thanks… So no complaints on the design, and yet something has to be done.

Custom Debugging to our call!

That’s the reason I decided to describe above how the debugging mechanism works in the kernel, so we know where we can intervene that process and do something useful. Since we want to do smart debugging, we have to use conditional breakpoints, otherwise in critical variables that get touched every now and then, we will have to hit F5 (‘go’) all the time, and the application we are debugging won’t get time to process. That’s clear. Next thing we realized is that the condition tests are being done locally on our machine, the one that runs WinDbg. That’s not ok, here’s the trick:
I wrote a driver that replaces (hooks) the _KiDebugRoutine with my own function, which checks for the exception code, then examines the context according to my condition and only then sends the event to WinDbg on the other machine, or simply “continues-execution”, thus the whole technique happens on the debugged machine without sending a single message outside (regarding the bp we set), unless that condition is true, and that’s why everything is thousands of times or so faster, which is now acceptable and usable. Luckily, we only need to replace a pointer to a function and using very simple tests we get the ability to filter exceptions on spot. Although we need to get our hands dirty with touching Debug-Registers and the context of the trapping thread, but that’s a win, after all.

Here’s the debug routine I used to experiment this issue (using constants tough):

int __stdcall my_debug(IN PVOID TrapFrame,
	IN PVOID Reserved,
	IN PEXCEPTION_RECORD ExceptionRecord,
	IN PCONTEXT Context,
	IN KPROCESSOR_MODE PreviousMode,
	IN UCHAR LastChance)
{
	ULONG _dr6, _dr0;
	__asm {
		mov eax, dr6
		mov _dr6, eax
		mov eax, dr0
		mov _dr0, eax
	};
	if ((ExceptionRecord->ExceptionCode == 0x80000003) &&
		(_dr6 & 0xf) &&
		(_dr0 == MY_WANTED_POINTER) &&
		(ExceptionRecord->ExceptionAddress != MY_WANTED_EIP))
	{
		return 1;
	}
	return old_debug_routine(TrapFrame, Reserved, ExceptionRecord, Context, PreviousMode, LastChance);
}

This routine checks when a breakpoint interrupt happened and stops the thread only if the pointer I wanted to monitor was accessed from a given address, else it would resume running that thread. This is where you go custom, and write whatever crazy condition you are up to. Using up to 4 breakpoints, that’s the processor limit for hardware breakpoints. Also checking out which thread or process trapped, etc. using the Kernel APIs… It just reminds me “compiled sprites” :)

I was assuming that there’s only one bp set on the machine which is the one I set through WinDbg, though this time, there was no necessity to set a conditional breakpoint in WinDbg itself, since we filter them using our own routine, and once WinDbg gets the event it will stop and let us act.

For some reason I had a problem with accessing the DRs from the Context structure, I didn’t try too hard, so I just backed to use them directly because I can.

Of course, doing what I did is not anything close to production quality, it was only a proof of concept, and it worked well. Next time that I will find myself in a weird bug hunting, I will know that I can draw this weapon.
I’m not sure how many people are interested in such things, but I thought it might help someone out there, I wish one day someone would write an open source WinDbg plugin that injects kernel code through WinDbg to the debugged machine that sets this routine with its custom runtime conditional breakpoints :)

I really wanted to paint some stupid pictures that show what’s going on between the two machines and everything, but my capabilities at doing that are aweful, so it’s up to you to imagine that, sorry.

For more related information you can see:
http://uninformed.org/index.cgi?v=8&a=2&p=16
http://www.vsj.co.uk/articles/display.asp?id=265

This entry was posted on Tuesday, July 20th, 2010 at 9:00 am and is filed under Assembly, Code Analysis, Debugging, Hardware, OS Dev, Patching, Reversing, Software. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to “Custom Kernel Debugging is Faster”

jduck says:

July 20, 2010 at 3:39 pm

Great post!

I think that the existing tool chains are generally less than optimal for non-trivial debugging situations. The problem is that what you want to do may vary from case to case. The situation you write about is one example. Examining code coverage comes to mind as well.

There are well-known solutions to many of the problems that exist. Unfortunately, afaik, no tool or tool chain has implemented the solutions in a (re-)usable, user-friendly way.

Hopefully the future will bring better tools!
grander says:

July 21, 2010 at 3:10 pm

Great post indeed!

any chance you remember how much slower was the solution from the original (no bp at all) run? just curious about the timings.

Thanks!
arkon says:

July 22, 2010 at 8:25 am

Without a bp set on the counter++ line it runs in no time, ~0 sec (gross timing is good enough).
Setting an execution bp on that line with a hit count of 9999999, takes minutes to fully run, i halted it cause i didn’t wait for it to finish. This test was done using visual studio. where the counter variable was defined volatile.

doing same experiment with windbg yielded same results.

anyway, using what my post talked about, gained unbelievable speed, from something that doesn’t let the system get rendered to a very usable and fast case.
anonymous says:

August 6, 2010 at 4:46 pm

In response to:
“I’m not sure how many people are interested in such things, but I thought it might help someone out there, I wish one day someone would write an open source WinDbg plugin that injects kernel code through WinDbg to the debugged machine that sets this routine with its custom runtime conditional breakpoints ”

It’s already available. It was written by flierlu combined with boost-python. It uses Alex Ionescu’s “trick” he talked about at Recon that windbg.exe uses in order to .AttachKernel to interact with kernelspace on the local machine. It’s also not threadsafe if you’re injecting your interpreter into another usespace process, but it’s not too hard to wrap all the python entry points in the module with an acquire/release of the GIL.

That’s still that dependency on python, but it’s still useful for instrumenting code/data through IDebugClient. So then just tie it together with your disassembler, and then a stub, and you can hook different points of execution without suffering the cost of a context switch.

ftw.
Yuhong Bao says:

August 9, 2010 at 10:52 pm

I’d use the __readdr intrinsic instead of inline asm.

Insanely Low-Level

Custom Kernel Debugging is Faster

5 Responses to “Custom Kernel Debugging is Faster”

Leave a Reply