Archive for the ‘Assembly’ Category

Calling System Service APIs in Kernel

Wednesday, January 26th, 2011

In this post I am not going to shed any new light about this topic, but I didn’t find anything like this organized in one place, so I decided to write it down, hope you will find it useful.

Sometimes when you develop a kernel driver you need to use some internal API that cannot be accessed normally through the DDK. Though you may say “but it’s not an API if it’s not officially exported and supported by MS”. Well that’s kinda true, the point is that some functions like that which are not accessible from the kernel, are really accessible from usermode, hence they are called API. After all, if you can call NtCreateFile from usermode, eventually you’re supposed to be able to do that from kernel, cause it really happens in kernel, right? Obviously, NtCreateFile is an official API in the kernel too.

When I mean using system service APIs, I really mean by doing it platform/version independent, so it will work on all versions of Windows. Except when MS changes the interface (number of parameters for instance, or their type) to the services themselves, but that rarely happens.

I am not going to explain how the architecture of the SSDT and the transitions from user to kernel or how syscalls, etc work. Just how to use it to our advantage. It is clear that MS doesn’t want you to use some of its APIs in the kernel. But sometimes it’s unavoidable, and using undocumented API is fine with me, even in production(!) if you know how to do it well and as robust as possible, but that’s another story. We know that MS doesn’t want you to use some of these APIs because a) they just don’t export it in kernel on purpose, that is. b) starting with 64 bits versions of Windows they made it harder on purpose to use or manipulate the kernel, by removing previously exported symbols from kernel, we will get to that later on.

Specifically I needed ZwProtectVirtualMemory, because I wanted to change the protection of some page in the user address space. And that function isn’t exported by the DDK, bummer. Now remember that it is accessible to usermode (as VirtualProtectMemory through kernel32.dll syscall…), therefore there ought to be a way to get it (the address of the function in kernel) in a reliable manner inside a kernel mode driver in order to use it too. And this is what I’m going to talk about in this post. I’m going to assume that you already run code in the kernel and that you are a legitimate driver because it’s really going to help us with some exported symbols, not talking about shellcodes here, although shellcodes can use this technique by changing it a bit.

We have a few major tasks in order to achieve our goal: Map the usermode equivalent .dll file. We need to get the index number of the service we want to call. Then we need to get the base address of ntos and the address of the (service) table of pointers (the SSDT itself) to the functions in the kernel. And voila…

The first one is easy both in 32 and 64 bits systems. There are mainly 3 files which make the syscalls in usermode, such as: ntdll, kernel32 and user32 (for GDI calls). For each API you want to call in kernel, you have to know its prototype and in which file you will find it (MSDN supplies some of this or just Google it). The idea is to map the file to the address space as an (executable) image. Note that the cool thing about this mapping is that you will get the address of the required file in usermode. Remember that these files are physically shared among all processes after boot time (For instance, addresses might change because of ASLR but stay consistent as long as the machine is up). Following that we will use a similar functionality to GetProcAddress, but one that you have to write yourself in kernel, which is really easy for PE and PE+ (64 bits).

Alright, so we got the image mapped, we can now get some usermode API function’s address using our GetProcAddress, now what? Well, now we have to get the index number of the syscall we want. Before I continue, this is the right place to say that I’ve seen so many approaches to this problem, disassemblers, binary patterns matching, etc. And I decided to come up with something really simple and maybe new. You take two functions that you know for sure that are going to be inside kernel32.dll (for instance), say, CreateFile and CloseHandle. And then simply compare byte after byte from both functions to find the first different byte, that byte contains the index number of the syscall (or the low byte out of the 4 bytes integer really). Probably you have no idea what I’m talking about, let me show you some usermode API’s that directly do syscalls:

XP SP3 ntdll.dll
B8 25 00 00 00                    mov     eax, 25h        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 19 00 00 00                    mov     eax, 19h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP1 32 bits ntdll.dll

B8 3C 00 00 00                    mov     eax, 3Ch        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 30 00 00 00                    mov     eax, 30h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

2008 sp2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

Win7 64bits syswow64 ntdll.dll

B8 52 00 00 00                    mov     eax, 52h        ; NtCreateFile
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 2C 00                          retn    2Ch

B8 0C 00 00 00                    mov     eax, 0Ch        ; NtClose
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 04 00                          retn    4

These are a few snippets to show you how the syscall function templates look like. They are generated automatically by some tool MS wrote and they don’t change a lot as you can see from the various architectures I gathered here. Anyway, if you take a look at the bytes block of each function, you will see that you can easily spot the correct place where you can read the index of the syscall we are going to use. That’s why doing a diff on two functions from the same .dll would work well and reliably. Needless to say that we are going to use the index number we get with the table inside the kernel in order to get the corresponding function in the kernel.

This technique gives us the index number of the syscall of any exported function in any one of the .dlls mentioned above. This is valid both for 32 and 64 bits. And by the way, notice that the operand type (=immediate) that represents the index number is always a 4 bytes integer (dword) in the ‘mov’ instruction, just makes life easier.

To the next task, in order to find the base address of the service table or what is known as the system service descriptor table (in short SSDT), we will have to get the base address of the ntoskrnl.exe image first. There might be different kernel image loaded in the system (with or without PAE, uni-processor or multi-processor), but it doesn’t matter in the following technique I’m going to use, because it’s based on memory and not files… This task is really easy when you are a driver, means that if you want some exported symbol from the kernel that the DDK supplies – the PE loader will get it for you. So it means we get, without any work, the address of any function like NtClose or NtCreateFile, etc. Both are inside ntos, obviously. Starting with that address we will round down the address to the nearest page and scan downwards to find an ‘MZ’ signature, which will mark the base address of the whole image in memory. If you’re afraid from false positives using this technique you’re welcome to go further and check for a ‘PE’ signature, or use other techniques.

This should do the trick:

PVOID FindNtoskrnlBase(PVOID Addr)
{
    /// Scandown from a given symbol's address.
    Addr = (PVOID)((ULONG_PTR)Addr & ~0xfff);
    __try {
        while ((*(PUSHORT)Addr != IMAGE_DOS_SIGNATURE)) {
            Addr = (PVOID) ((ULONG_PTR)Addr - PAGE_SIZE);
        }
        return Addr;
    }
    __except(1) { }
    return NULL;
}

And you can call it with a parameter like FindNtoskrnlBase(ZwClose). This is what I meant that you know the address of ZwClose or any other symbol in the image which will give you some “anchor”.

After we got the base address of ntos, we need to retrieve the address of the service table in kernel. That can be done using the same GetProcAddress we used earlier on the mapped user mode .dll files. But this time we will be looking for the “KeServiceDescriptorTable” exported symbol.

So far you can see that we got anchors (what I call for a reliable way to get an address of anything in memory) and we are good to go, this will work in production without the need to worry. If you wanna start the flame war about the unlegitimate use of undocumented APIs, etc. I’m clearly not interested. :)
Anyway, in Windows 32 bits, the latter symbol is exported, but it is not exported in 64 bits! This is part of the PatchGuard system, to make life harder for rootkits, 3rd party drivers doing exactly what I’m talking about, etc. I’m not going to cover how to get that address in 64 bits in this post.

The KeServiceDescriptorTable is a table that holds a few pointers to other service tables which contain the real addresses of the service functions the OS supplies to usermode. So a simple dereference to the table and you get the pointer to the first table which is the one you are looking for. Using that pointer, which is really the base address of the pointers table, you use the index we read earlier from the required function and you got, at last, the pointer to that function in kernel, which you can now use.

The bottom line is that now you can use any API that is given to usermode also in kernelmode and you’re not limited to a specific Windows version, nor updates, etc. and you can do it in a reliable manner which is the most important thing. Also we didn’t require any special algorithms nor disassemblers (as much as I like diStorm…). Doing so in shellcodes make life a bit harder, because we had the assumption that we got some reliable way to find the ntos base address. But every kid around the block knows it’s easy to do it anyway.

Happy coding :)

References I found interesting about this topic:
http://j00ru.vexillium.org/?p=222
http://alter.org.ua/docs/nt_kernel/procaddr/

http://uninformed.org/index.cgi?v=3&a=4&p=5

And how to do it in 64 bits:

http://www.gamedeception.net/threads/20349-X64-Syscall-Index

New Project – ReviveR

Saturday, September 25th, 2010

Hey all,

long time haven’t posted. I’m kinda busy with lots of stuff.
Anyway I just wanted to let you know that I’m starting to work on the sequel of diStorm, you guessed it right… A reversing studio!
Unlike what many people said, the core is going to be written in C++, the GUI is going to be written per OS. No thanks, QT. Top goals are performance, scripting, good UI and most important good analysis capabilities. Obviously it’s going to be open source, cross platform. For a start, it will support only x86 and AMD64 and PE file format, maybe ELF too, though not my priority. I’m not sure about a debugger yet, but it will probably be implemented later. GUI is going to be written using WPF under C#, just to give you an idea.

My main interests are performance and binary code analysis algorithms.

If there are highly skilled programmers who wish to help, please contact me.
For now it seems we are a group of 4 coders, I’m still not going to publish their names, until everything is settled.

Anyway, design is taking place nowadays. This is your time for suggesting new features and ideas.

Big good luck

diStorm3 is Ready

Monday, August 16th, 2010

diStorm3 is ready for the masses! :)
– if you want to maximize the information you get from a single instruction; Structure output rather than text, flow control analysis support and more!

Check it out now at its new google page.

Good luck!

Heapos Forever

Friday, August 6th, 2010

There are still hippos around us, beware:
heapo

Kernel heap overflow.

DEVMODE dm = {0};
dm.dmSize  = sizeof(DEVMODE);
dm.dmBitsPerPel = 8;
dm.dmPelsWidth = 800;
dm.dmPelsHeight = 600;
dm.dmFields = DM_PELSWIDTH | DM_PELSHEIGHT | DM_BITSPERPEL;
ChangeDisplaySettings(&dm, 0);

BITMAPINFOHEADER bmih = {0};
bmih.biClrUsed = 0x200;

HGLOBAL h = GlobalAlloc(GMEM_FIXED, 0x1000);
memcpy((PVOID)GlobalLock(h), &bmih, sizeof(bmih));
GlobalUnlock(h);

OpenClipboard(NULL);
SetClipboardData(CF_DIBV5, (HANDLE)h);
CloseClipboard();

OpenClipboard(NULL);
GetClipboardData(CF_PALETTE);


[Update, 11th Aug]: Here is MSRC response.

Custom Kernel Debugging is Faster

Tuesday, July 20th, 2010

When you start to write a post you always get a problem with the headline for the post. You need to find something that will, in a few words, sum it up for the reader. I was wondering which one is better, “Boosting WinDbg”, “Faster Kernel Debugging in WinDbg”, “Hacking WinDbg” and so on. But they might be not accurate, and once you will read the post you won’t find them appropriate. But instead of talking about meta-post issues, let’s get going.

Two posts ago, I was talking about hunting a specific race condition bug we had in some software I work on. At last, I have free time to write this post and get into some interesting details about Windows Kernel and Debugging.

First I want to say that I got really pissed off that I couldn’t hunt the bug we had in the software like a normal human being, that Jond and I had to do it the lame old school way, which takes more time, lots of time. What really bothered me is that computers are fast and so is debugging, at least, should be. Why the heck do I have to sit down in front of the computer, not mentioning – trying to dupe the damned bug, and only then manage to debug it and see what’s going on wrong. Unacceptable. You might say, write a better code in the first place, I agree, but even then people have bugs, and will have, forever, and I was called to simply help.

Suppose we want to set a breakpoint on memory access this time, but something more complicated with conditions. The reason we need a condition, rather than a normal breakpoint is because the memory we want to monitor gets accessed thousands times per second, in my case with the race condition, for instance.
You’re even welcome to make the following test locally on your computer, fire up Visual Studio, and test the following code: unsigned int counter = 1; while (counter < 99999999+1) { counter++; }, set a memory access breakpoint on counter which stops when hit count reach 99999999, and time the whole process, and then time it without the bp set, and compare the result, what's the ratio you got? Isn't that just crazy? Here's an example in WinDbg's syntax, would be something like this: ba w4 0x491004 "j (poi(0x491004)==0) 'gc'" Which reads: break on write access for an integer at address 0x491004 only if its value is 0, otherwise continue execution. It will be tens-thousands times faster without the bp set, hence the debugging infrastructure, even locally (usermode), is slowing things down seriously. And think that you want to debug something similar on a remote machine, it's impossible, you are going to wait years in vain for something to happen on that machine. Think of all the COM/Pipe/USB/whatever-protocol messages that have to be transmitted back and forth the debugged machine to the debugger. And add to that the conditional breakpoint we set, someone has to see whether the condition is true or false and continue execution accordingly. And even if you use great tools like VirtualKD. Suppose you set a breakpoint on a given address, what really happens once the processor executes the instruction at that address? Obviously a lot, but I am going to talk about Windows Kernel point of view. Let's start bottom up, Interrupt #3 is being raised by the processor which ran that thread, which halts execution of the thread and transfers control _KiTrap3 in ntoskrnl. _KiTrap3 will build a context for the trapped thread, with all registers and this likely info and call CommonDispatchException with code 0x80000003 (to denote a breakpoint exception). Since the 'exception-raising' is common, everybody uses it, in other exceptions as well. CommonDispatchException calls _KiDispatchException. And _KiDispatchException is really the brain behind all the Windows-Exception mechanism. I'm not going to cover normal exception handling in Windows, which is very interesting in its own. So far nothing is new here. But we're getting to this function because it has something to do with debugging, it checks whether the _KdDebuggerEnabled is set and eventually it will call _KiDebugRoutine if it's set as well. Note that _KiDebugRoutine is a pointer to a function that gets set when the machine is debug-enabled. This is where we are going to get into business later, so as you can see the kernel has some minimal infrastructure to support kernel debugging with lots of functionality, many functions in ntoskrnl which start in "kdp", like KdpReadPhysicalMemory, KdpSetContext and many others. Eventually the controlling machine that uses WinDbg, has to speak to the remote machine using some protocol named KdCom, there's a KDCOM.DLL which is responsible for all of it. Now, once we set a breakpoint in WinDbg, I don't know exactly what happens, but I guess it’s something like this: it stores the bp in some internal table locally, then sends it to the debugged machine using this KdCom protocol, the other machine receives the command and sets the breakpoint locally. Then when the bp occurs, eventually WinDbg gets an event that describes the debug event from the other machine. Then it needs to know what to do with this bp according to the dude who debugs the machine. So much going on for what looks like a simple breakpoint. The process is very similar for single stepping as well, though sending a different exception code.

The problem with conditional breakpoints is that they are being tested for the condition locally, on the WinDbg machine, not on the server, so to speak. I agree it’s a fine design for Windows, after all, Windows wasn’t meant to be an uber debugging infrastructure, but an operating system. So having a kernel debugging builtin we should say thanks… So no complaints on the design, and yet something has to be done.

Custom Debugging to our call!

That’s the reason I decided to describe above how the debugging mechanism works in the kernel, so we know where we can intervene that process and do something useful. Since we want to do smart debugging, we have to use conditional breakpoints, otherwise in critical variables that get touched every now and then, we will have to hit F5 (‘go’) all the time, and the application we are debugging won’t get time to process. That’s clear. Next thing we realized is that the condition tests are being done locally on our machine, the one that runs WinDbg. That’s not ok, here’s the trick:
I wrote a driver that replaces (hooks) the _KiDebugRoutine with my own function, which checks for the exception code, then examines the context according to my condition and only then sends the event to WinDbg on the other machine, or simply “continues-execution”, thus the whole technique happens on the debugged machine without sending a single message outside (regarding the bp we set), unless that condition is true, and that’s why everything is thousands of times or so faster, which is now acceptable and usable. Luckily, we only need to replace a pointer to a function and using very simple tests we get the ability to filter exceptions on spot. Although we need to get our hands dirty with touching Debug-Registers and the context of the trapping thread, but that’s a win, after all.

Here’s the debug routine I used to experiment this issue (using constants tough):

int __stdcall my_debug(IN PVOID TrapFrame,
	IN PVOID Reserved,
	IN PEXCEPTION_RECORD ExceptionRecord,
	IN PCONTEXT Context,
	IN KPROCESSOR_MODE PreviousMode,
	IN UCHAR LastChance)
{
	ULONG _dr6, _dr0;
	__asm {
		mov eax, dr6
		mov _dr6, eax
		mov eax, dr0
		mov _dr0, eax
	};
	if ((ExceptionRecord->ExceptionCode == 0x80000003) &&
		(_dr6 & 0xf) &&
		(_dr0 == MY_WANTED_POINTER) &&
		(ExceptionRecord->ExceptionAddress != MY_WANTED_EIP))
	{
		return 1;
	}
	return old_debug_routine(TrapFrame, Reserved, ExceptionRecord, Context, PreviousMode, LastChance);
}

This routine checks when a breakpoint interrupt happened and stops the thread only if the pointer I wanted to monitor was accessed from a given address, else it would resume running that thread. This is where you go custom, and write whatever crazy condition you are up to. Using up to 4 breakpoints, that’s the processor limit for hardware breakpoints. Also checking out which thread or process trapped, etc. using the Kernel APIs… It just reminds me “compiled sprites” :)

I was assuming that there’s only one bp set on the machine which is the one I set through WinDbg, though this time, there was no necessity to set a conditional breakpoint in WinDbg itself, since we filter them using our own routine, and once WinDbg gets the event it will stop and let us act.

For some reason I had a problem with accessing the DRs from the Context structure, I didn’t try too hard, so I just backed to use them directly because I can.

Of course, doing what I did is not anything close to production quality, it was only a proof of concept, and it worked well. Next time that I will find myself in a weird bug hunting, I will know that I can draw this weapon.
I’m not sure how many people are interested in such things, but I thought it might help someone out there, I wish one day someone would write an open source WinDbg plugin that injects kernel code through WinDbg to the debugged machine that sets this routine with its custom runtime conditional breakpoints :)

I really wanted to paint some stupid pictures that show what’s going on between the two machines and everything, but my capabilities at doing that are aweful, so it’s up to you to imagine that, sorry.

For more related information you can see:
http://uninformed.org/index.cgi?v=8&a=2&p=16
http://www.vsj.co.uk/articles/display.asp?id=265

Cracking for Fun and Non-Profit

Saturday, May 22nd, 2010

One of the fun things to do with applications is to bypass their copy-protection mechanisms. So I want to share my experience about some iPad application, though the application is targeted for the Jailbroken devices. It all began a few days ago, when a friend was challenging me to crack some application. I had my motives, and I’m not going to talk about them. However, that’s why the title says non-profit. Or maybe when they always say “for profit” they mean the technical-knowledge profit.

So before you start to crack some application, what you should do is see how it works, what happens when you run it, what GUI related stuff you can see, like dialog boxes or messages that popup, upon some event you fire. There are so many techniques to approach application-cracking, but I’m not here to write a tutorial, just to talk a bit about what I did.

So I fired IDA with the app loaded, the app was quite small, around 35kb. First thing I was doing was to see the imported functions. This is how I know what I’m going to fight with in one glare. I saw MD5/RSA imported from the crypto library, and that was like “oh uh”, but no drama. Thing is, my friend purchased the app and gave me the license file. Obviously it’s easier with a license file, otherwise, sometimes it’s proved that it’s impossible to crack software without critical info that is encrypted in the license file, that was the issue in my case too. Of course, there’s no point in a license file that only checks the serial-number or something like that, because it’s not enough. So without the license file, there wasn’t much to do.

For some reason IDA didn’t like to parse the app well, so I had to recall how to use this ugly API of IDC (the internal scripting language of IDA), yes, I know IDA Python, but didn’t want to use it. So my script was fixing all LDR instructions, cause the code is PICy so with the strings revealed I could easily follow all those ugly objc_msgSend calls. For Apple’s credit, the messages are text based, so it’s easy to understand what’s going on, once you manage to get to that string. For performance’s sake, this is so lame, I rather use integers than strings, com’on.

Luckily the developer of that app didn’t bother to hide the exported list of functions, he was busy with pure protection algorithm in Objective-C, good for me.
So eventually the way the app worked (license perspective) was to check if the license file exists, if so, parse it. Otherwise, ask for a permission to connect to the Internet and send the UDID (unique device ID) of the device to the app’s server, get a response, and if the status code was success, write it to a file, then run the license validator again.

The license validator was quite cool, it was calling dladdr on itself to get the full path of the executable itself, then calculating the MD5 of the binary. Can you see why? So if you thought you could easily tamper with the file, you were wrong. Taking the MD5 hash, and xoring it in some pattern with the data from the license file; Then decrypting the result with the public key that was in the static segment, though I didn’t care much about it. Since the MD5 of the binary itself was used, this dependency is a very clever trick of the developer, though expected. So I tried to learn more about how the protection works.

Suppose the license was legit, the app would take that buffer and strtok() it to tokens, to check that the UDID was correct. The developer was nice enough to call the lockdownd APIs directly, so in one second I knew where and what was going on around it. In the beginning I wanted to create a proxy dylib for this lockdownd library, but it would require me to patch the header of the mach-o so the imported function will be through my new file – but it still requires a change to the file, no good. So the way it worked with the decrypted string – it kept on tokenizing the string, but this time, it checked for some string match, as if someone tampered with the binary, the decryption would go wrong and the string wouldn’t compare well. And then it did some manipulation on some object, adding methods to it in runtime, with the names from the tokenized string, thus if you don’t have a license file to begin with, you don’t know the names of the new methods that were added. One star for the developer, yipi.

All in all, I have to say that I wasn’t using any debugger or runtime tricks, everything was static reversing, yikes. Therefore, after I was convinced that I can’t ignore the protection because I lack of the names of the new methods, and I can’t use a debugger to phish the names easily. I was left with one solution, as I said before – faking the UDID and fixing the MD5.

What I really cared about for a start, was how the app calculates the MD5 of itself:
Since the developer retrieved the name of the binary using dladdr, I couldn’t just change some path to point to the original copy of the binary, so when it hashes it, it would get the expected hash. That was a bammer, I had to do something else, but similar idea… I decided to patch the file-open function. The library functions are called in ARM mode and it’s very clear. The app itself was in THUMB, so it transitions to ARM using a BX instruction and calls a thunk, that in order will call the imported function. So the thunk function is in ARM mode, thus 4 bytes per instruction, very wasteful IMHO.

The goal of my patches was to patch those thunks, rather than all the callers to those thunks. Cause I could end up with a dozen of different places to patch. So I was limited in the patches I could do in a way. So eventually I extended the thunk of the file-open and made R0 register point to my controlled path, where I could guarantee an original copy of the binary, so when it calculated the MD5 of it, it would be the expected hash. Again, I could do so many other things, like planting a new MD5 value in the binary and copy it in the MD5-Final API call, but that required too much code changes. And oh yes, I’m such a jackass that I didn’t even use an Arm-assembler. Pfft, hex-editing FTW :( Oh also, I have to comment that it was safe to patch the thunk of file-open, cause all the callers were related to the MD5 hashing…

Ok, so now I got the MD5 good and I could patch the file however I saw fit. Patching the UDID-strcmp’s wasn’t enough, since the license wasn’t a “yes/no” check, it had essential data I needed, otherwise I could finish with the protection in 1 minute patch (without going to the MD5 hassle). So I didn’t even touch those strcmp’s.

RSA encryption then? Ahhh not so fast, the developer was decrypting the xored license with the resulted MD5 hash, then comparing the UDID, so I got the license decrypted well with the MD5 patch, but now the UDID that was returned from the lockdownd was wrong, wrong because it wasn’t corresponding to the purchased license. So I had to change it as well. The problem with that UDID and the lockdownd API, is that it returns a CFSTR, so I had to wrap it with that annoying structure. That done, I patched the thunk of the lockdown API to simply return my CFSTR of the needed UDID string.

And guess what?? it crashed :) I put my extra code in a __ustring segment, in the beginning I thought the segment wasn’t executable, because it’s a data. But I tried to run something very basic that would work for sure, and it did, so I understood the problem was with my patch. So I had to double check it. Then I found out that I was piggy-backing on the wrong (existing) CFSTR, because I changed its type. Probably some code that was using the patched CFSTR was expecting a different type and therefore crashed, so I piggy-backed a different CFSTR that wouldn’t harm the application and was a similar type to what I needed (Just a string, 0x7c8). What don’t we do when we don’t have segment slacks for our patch code. :)

And then it worked… how surprising, NOT. But it required lots of trial and errors, mind you, because lack of tools mostly.
End of story.
It’s really hard to say how I would design it better, when I had my chance, I was crazy about obfuscation, to make the reverser desperate, so he can’t see a single API call, no strings, nothing. Plant decoy strings, code, functionality, so he wastes more time. Since it’s always possible to bypass the protections, if the CPU can do it, I can do it too, right? (as long as I’m on the same ring).

Race Condition From Hell, aren’t they all?

Monday, April 19th, 2010

Actually I had a trouble to come up with a good title for this post, at least one that I was satisfied with. Therefore I will start with a background story, as always.
The problem started when I had to debug a huge software which was mostly in Kernel mode. And there was this critical section (critsec from now on) synchronization object that wasn’t held always correctly. And eventually after 20 mins of trying to replicate the bug, we managed to crash the system with a NULL dereference. This variable was a global that everybody who after acquiring the critsec was its owner. Then how come we got a crash ? Simple, someone was touching the global out of it critsec scope. That’s why it was also very hard to replicate, or took very long.

The pseudo code was something like this:
Acquire Crit-Sec
g_ptr = “some structure we use”
do safe task with g_ptr

g_ptr = NULL
Release Crit-Sec

So you see, before the critsec was released the global pointer was NULLed again. Obvisouly this is totally fine, because it’s still in the scope of the acquired crit, so we can access it safely.

Looking at the crash dumps, we saw a very weird thing, but nothing surprising for those race conditions bugs. Also if you ask me, I think I would prefer dead-lock bugs to race conditions, since in dead lock, everything gets stuck and then you can examine which locks are held, and see why some thread (out of the two) is trying to acquire the lock, when it surely can’t… Not saying it’s easier, though.
Anyway, back to the crash dump, we saw that the g_ptr variable was accessed in some internal function after the critsec was acquired. So far so good. Then after a few instructions, in an inner function that referenced the variable again, suddenly it crashed. Traversing back to the point where we know by the disassembly listing of the function, where the g_ptr was touched first, we knew it worked there. Cause otherwise, it would have crashed there and then, before going on, right? I have to mention that between first time reading the variable and the second one where it crashed, we didn’t see any function calls.
This really freaked me out, because the conclusion was one – somebody else is tempering with our g_ptr in a different thread without locking the crit. If there were any function calls, might be that some of them, caused our thread to be in a Waitable state, which means we could accept APCs or other events, and then it could lead to a whole new execution path, that was hidden from the crash dump, which somehow zeroed the g_ptr variable. Also at the time of the crash, it’s important to note that the owner of the critsec was the crashing thread, no leads then to other problematic threads…

Next thing was to see that everybody touches the g_ptr only when the critsec is acquired. We surely know for now that someone is doing something very badly and we need to track the biatch down. Also we know the value that is written to the g_ptr variable is zero, so it limits the number of occurrences of such instruction (expression), which lead to two spots. Looking at both spots, everything looked fine. Of course, it looked fine, otherwise I would have spotted the bug easily, besides, we got a crash, which means, nothing is fine. Also, it’s time to admit, that part of the code was Windows itself, which made the problem a few times harder, because I couldn’t do whatever I wanted with it.

I don’t know how you guys would approach such a problem in order to solve it. But I had three ideas. Sometimes just like printf/OutputDebugPrint is your best friend, print logs when the critsec is acquired and released, who is waiting for it and just every piece of information we can gather about it. Mind you that part of it was Windows kernel itself, so we had to patch those functions too, to see, who’s acquiring the critsec and when. Luckily in debug mode, patchguard is down :) Otherwise, it would be bloody around the kernel. So looking at the log, everything was fine, again, damn. You can stare at the god damned thing for hours and tracking the acquiring and releasing pairs of the critsec, and nothing is wrong. So it means, this is not going to be the savior.

The second idea, was to comment out some code portions with #if 0 surrouding the potential problematic code. And starting to eliminate the possibilities of which function is the cause of this bug. This is not such a great idea. Since a race condition can happen in a few places, finding one of them is not enough usually. Though it can teach you something about the original bug’s characteristics, then you can look at the rest of the code to fix that same thing. It’s really old school technique but sometimes it is of a help as bad as it sounds. So guess what we did? Patched the g_ptr = NULL of the kernel and then everything went smooth, no crashes and nothing. But the problem still was around, now we knew for sure it’s our bug and not MS, duh. And there were only a few places in our code which set this g_ptr. Looking at all of them, again, seemed fine. This is where I started going crazy, seriously.

While you were reading the above ideas, didn’t you come up with the most banal idea, to put a dumb breakpoint – on memory access, on g_ptr with a condition of “who writes zero”. Of course you did, that what you should have done in the first place. I hope you know that. Why we couldn’t do that?
Because the breakpoint was fired tens of thousands times in a single second. Rendering the whole system almost to freeze. Assuming it took us 20 mins to replicate the bug, when we heavily loaded the system. Doing that with such a breakpoint set, would take days or so, no kidding. Which is out of question.

This will lead me to the next post. Stay tuned.

Trying to Pwn Stuff my way

Saturday, January 30th, 2010

I have been playing CS since 2001 :) Kinda addicted I can say. Like, after I had been in South America for half a year, suddenly I caught myself thinking “ohhh I wish I could play CS”… So I think it means I’m addicted. Anyway I really like that game. A few days ago I was playing on some server and suddenly hl2 crashed. How good is that they generate a crash dump automatically, so I fired up WinDbg and took a look what happened, I found out that some pointer was set to 1, not NULL, mind you. Looking around the crash area I found a buffer overflow on the stack, but only for booleans, so I don’t know what was the point and how it was triggered or who sent it (server or another player). Anyway, since I like this game so much, there is only one thing I don’t like it, the stupid children you play with/against, they curse and TK (team-kill) like noobs. One day I promised to myself that I will pwn those little bastards. Therefore I started to investigate this area of crash, which I won’t say anything about the technical details here, so you won’t be able to replicate it, except that I found a stack buffer overflow. The way from there to pwn the clients who connect to a server I set up is really easy. The down side is that they have to connect to a server I control, which is quite lame, the point is to pwn other players on a remote server, so I still work on that. For me pwning would be to find a way to kick them from the server for instance, I don’t need to execute code on their machines. Besides since I do everything for fun, and I’m not a criminal, I have to mention that it’s for eductional purposes only :) Being the good guy I am, in ZERT and stuff. I just wanted to add that the protocol used to be really hole-y before CS: Source came out, everything was vulnerable, really, you could tell the server that you wanted to upload a file to it (your spray-decal file) with a name longer than 256 characters, and bam, you own the server through a stupid strcpy to a buffer on the stack. But after CSS came out, the guys did a great job and I could hardly find stuff. What I found is in some isoteric parser that the input comes from the server… What was weird is that some functions were protected with a security cookie and some weren’t. I don’t know what configuration those guys use to compile the game, but they surely need to work it out better.

Another thing I’ve been trying to pwn for a long time now, without much success, I have to say, is NTVDM. This piece of software is huge, though most of it is mostly in user-mode, there are lots of related code in kernel. Recently a very crazy bug was found there (which can lead to a privilege escalation), something in the design, of how the kernel transfers control to BIOS code and returns. You can read more here to get a better clue. So it gave me some idea what to do about some potential buggy code I found. Suppose I found a code in the kernel that takes DS:SI and changes it to a flat pointer, the calculation is (DS << 4) + SI. The thing is that DS is 16 bits only. The thing I thought is that with some wizardy I will be able to change DS to have some value above 0xffff. For some of you it might sound impossible, but in 32 bits and playing with pop ds, mov ds, ax and the like, I managed to put random values in the high 16 bits of DS (say it’s a 32 bit segment register). Though I don’t know if WinDbg showed me garbage or how it really worked, or what happened there, I surely saw big values in DS. So since I couldn’t reproduce this behavior in 16 bits under NTVDM, I tried to think of a way to set DS in the VDM Context itself. If you look at the exports of NTVDM you will see a function named “SetDS”, so taking a look of how it works I tried to use it inside my 16 bits code (exploiting some Escape bug I found myself and posted on this blog earlier), I could set DS to whatever arbitary value I wanted. Mind you, I set DS for the VM itself, not the DS of the usermode application of ntvdm.exe. And then I tried to trigger the other part in the kernel which takes my raw pointer and tries to write to it, but DS high 16 bits were zeros. Damn it. Then I gave to it more thought, and understood that what I did is not good enough. This is because once I set DS to some value, then I get to code to execute on the processor for real and then it enters kernel’s trap handler, DS high half gets truncated once again and I lost in the game. So I’m still thinking if it’s spossible. Maybe next step I should try is to invoke the kernel’s trap handler directly with DS set to whatever value I want, but that’s probably not possible since I can’t control the trap frame myself… or maybe I can ;)

diStorm3 – News

Tuesday, December 29th, 2009

Yo yo yo… forgot to say happy xmas last time, never too late, ah? :)

This time I wanted to update you about diStorm3 once again. Yesterday I had a good coding session and I added some of the new features regarding flow control. The decode function gets a new parameter called ‘features’. Which is a bit field flag that lets you ask the disassembler to do some new stuff such as:

  1. Stop on INT instructions [INT, INT1, INT3, INTO]
  2. Stop on CALL instructions [CALL, CALL FAR]
  3. Stop on RET instructions [RET, RETF, IRET]
  4. Stop on JMP instructions [JMP, JMP FAR]
  5. Stop on any conditional branch instructions [JXXX, JXCX, LOOPXX]
  6. Stop on any flow control (all of the above)

I wasn’t sure about SYSCALL and the like and UD2, for now I left them out. So what we got now is the ability to instruct the disassembler to stop decoding after it encounters one of the above conditions. This makes the higer disassembler layer more efficient, because now you can disassemble code by basic blocks. Also building a call-graph or branches-graph faster.

Note that now you will be able to ask the disassembler to return a single instruction. I know it sounds stupid, but I talked about it already, and I had some reasons to avoid this behavior. Anyway, now you’re free to ask how many instructions you want, as long as the disassembler can read them from the stream you supply.

Another feature added is the ability to filter non-flow-control instructions. Suppose you are interested in building a call-graph only, there’s no reason that you will get all the data-control instructions, because they are probably useless for the case. Mixing this flag with ‘Stop on RET’ and ‘Stop on CALL’, you can do nice stuff.

Another thing is that I separated the memory-indirection description of an operand into two forms. First of all, memory indirection operand is when an instruction reads/writes from/to memory. Usually in Assembly text, you will see the brackets characters surrounding some expressions. Something like: MOV [EDX], EAX. Means we write a DWORD to EDX pointer. If you followed me ’till here, you should know exactly what I’m talking about anyway.

When you get the result of such instruction from diStorm3, the type of the operand will be SMEM (stands for simple-memory), which hints there’s only one register in the memory-indirection operand. Although it doesn’t hint anything about the displacement, that’s that offset you usually see in the brackets. Like MOV [EDX+0x12345678], EAX. So you will have to test if the displacement exists in both forms. The other form is MEM (Normal memory indirection, or probably should be called ‘complex’) since it supports the full memory indirection operand, like: MOV [EAX*4 + ESI + 0x12345678], EAX. Then you will have to read another register that supplies the base register, in addition to the index register and scale. Note that this applies for 16 bits mode addressing as well, when you can have a mix of [‘BX+SI]’ or only ‘[BX]’. Also note that sometimes in 32/64 bits mode, you can have a SIB byte, that sets only the base register and the index register is unused, but diStorm3 will return it as an SMEM, to simplify matters. This way it’s really easy to read the instruction’s parameters.

Another feature for text formatting is the ability to tell the disassembler to limit the address to 16 or 32 bits. This is good since the offsets are 64 bits today. And if you encounter an instruction that jumps backwards, you will get a huge negative value, which won’t make much sense if you disassemble 16 bits code…

diStorm3 still supplies the bad old interface. And now it supports two new additional functions. The decompose function, which returns the structures for each instruction read. And another function that formats a given structure into text, which is pretty straight forward. The text format is not an accurate behavior of diStorm64, it’s more simplified, but good enough. Besides I have never heard any special comments about the formatting of diStorm64, so I guess it doesn’t matter much to you guys. And maybe maybe I will add AT&T syntax later on.

Another field that is returned now, unlike diStorm64, is the instruction-set-class type of the instruction, with very broad categories, like Integer instructions, FPU instructions, SSE instructions, and so on. Still might be handy. And the hint about the flow-control type of the instruction.

Also I changed tons of code, and I really mean it, the skeleton is still the same, but the prefixes engine works totally different now. Trying to imitate a real processor this time. By including the last prefix found of that prefix-type. You can read more about this, here. I made the code way more optimized and eliminated double code and it’s still readable, if not for the better. Also I changed the way instruction are fetched, so the locate-instruction function is much smaller and better.

I’m pertty satisfied with the new version of diStorm and hopefully I will be able to share it with you guys soon. Still I got tons of tests to do, maybe I will add that unit-test module in Python to the proejct so you can enjoy it too, not sure yet.

Also I got a word from Mario Vilas, that he is going to help with compiling diStorm for different platforms, and I’m going to integrate his new Python wrappers that use ctypes, so you don’t need the Python extension anymore. Thanks Mario! ;) However, diStorm3 has its own Python module for the new structure output.

If you have more ideas, comments, complaints or you just hate me, this is the time to say so.
Cheers, happy new year soon!
Gil

Terminate Process @ 1 byte

Monday, December 28th, 2009

In the DOS days there was this lovely trick that you could just ‘ret’ anytime you wanted and the process would be terminated, I’m talking about .com files. That’s true that there were no real processes back then, more like a big chaotic jungle. Anyways, how it worked? Apparently the RET instruction, popped a zero word from the stack, and jumped to address 0. At address 0, there was this PSP (Program Segment Prefix) which had lots of interesting stuff, like command line buffer, and the like. The first word in this PSP (at address 0 of the whole segment too) was 0x20CD (instruction INT 0x20), or ‘Terminate Process’ interrupt. So branching to address 0 would run this INT 0x20 and close the program. Of course, you could execute INT 0x20 on your own, but then it would cost another byte :) But as long as the stack was balanced it seemed you could totally rely on popping a zero word in the stack for this usage. In other times I used this value to zero a register, by simply pop ax, for instance…

Time passed, and now we all use Windows (almost all of us anyway) and I came with a similar trick when I wrote Tiny PE. What I did was to put a last byte in my code with the value of 0xBC. Now usually there is some extra bytes following your code, either because you have data there or page alignment when the code section was allocated by the PE loader. This byte really means “MOV ESP, ???”. Though we don’t know what comes next to fill ESP with (some DWORD value), probably some junk, which is great.
At the cost of 1 byte, we caused ESP to get some uncontrolled value, which probably doesn’t map to anywhere valid. After this MOV ESP instruction was executed, nothing happens, alas, the next instruction is getting executed too and so on. When the process gets to terminate you ask? That’s the point we have to wait for one of two conditions to occur. Either by executing some junk instruction that will cause an access violation because it touched some random address which is unmapped. Or the second option is that we run out of bytes to execute and hit an unmapped page, but this time because of the EIP pointer itself.
This is where the SEH mechanism comes in, the system sees there’s an AV exception and tries to call the safe-exception handler because it’s just raised an exception. Now since ESP is a junk really it’s in an unrecoverable state already, so the system terminated the process. 1 byte FTW.