Archive for the ‘Code Analysis’ Category

Finding Kernel32 Base Address Shellcode

Thursday, July 7th, 2011

Yet another one…
This time, smaller, more correct, and still null-free.
I looked a bit at some shellcodes at exploit-db and googled too, to see whether anyone got a smaller way to no avail.

I based my code on:
http://skypher.com/index.php/2009/07/22/shellcode-finding-kernel32-in-windows-7/
AFAIK, who based his post on:
http://blog.harmonysecurity.com/2009_06_01_archive.html

And this is my version:

00000000 (02) 6a30                     PUSH 0x30
00000002 (01) 5e                       POP ESI
; Use DB 0x64; LODSD
00000003 (02) 64ad                     LODS EAX, [FS:ESI]
00000005 (03) 8b700c                   MOV ESI, [EAX+0xc]
00000008 (03) 8b761c                   MOV ESI, [ESI+0x1c]
0000000b (03) 8b5608                   MOV EDX, [ESI+0x8]
0000000e (04) 807e1c18                 CMP BYTE [ESI+0x1c], 0x18
00000012 (02) 8b36                     MOV ESI, [ESI]
00000014 (02) 75f5                     JNZ 0xb

The tricky part was how to read from FS:0x30, and the way I use is the smallest one, at least from what I checked.
Another issue that was fixed is the check for kernel32.dll, usually the variation of this shellcode checks for a null byte, but it turned out to be bogous on W2k machines, so it was changed to check for a null word. Getting the shellcode by a byte or two longer.

This way, it’s only 22 bytes, it doesn’t assume that kernel32.dll is the second/third entry in the list, it actually loops till it finds the correct module length (len of ‘kernel32.dll’ * 2 bytes). Also since kernelbase.dll can come first and that renders lots of implementations of this technique unusable.
And obviously the resulting base address of kernel32.dll is in EDX.

Enjoy

[Update July 9th:]
Here’s a link to an explanation about PEB/LDR lists.
See first comment for a better version which is only 17 bytes.

Private Symbols Look Up by Binary Signatures

Friday, July 1st, 2011

This post could really be extended and divided into a few posts, but I decided to try and keep it small as much as I can. If I see it draws serious attention I might elaborate on the topic.

Signature matching for finding functions is a very old technique, but I haven’t found anyone who talks about it with juicy details or at all, and decided to show you a real life example. It is related to the last post about finding service functions in the kernel. The problem is that sometimes inside the kernel you want to use internal functions, which are not exported. Don’t start with “this is not documented story”, I don’t care, sometimes we need to get things done no matter what. Sometimes there is no documented way to do what you want. Even in legitimate code, it doesn’t have to be a rootkit, alright? I can say, however, that when you wanna add new functionality to an existing and working system, in whatever level it might be, you would better depend as much as you can on the existing functionality that was written by the original programmers of that system. So yes, it requires lot of good reversing, before injecting more code and mess up with the process.
The example of a signature I’m going to talk about is again about getting the function ZwProtectVirtualMemory address in the kernel. See the old post here to remember what’s going on. Obviously the solution in the older post is almost 100% reliable, because we have anchors to rely upon. But sometimes with signature matching the only anchors you have are binary pieces of:
* immediate operand values
* strings
* xrefs
* disassembled instructions
* a call graph to walk on
and the list gets longer and can really get crazy and it does, but that’s another story.

I don’t wanna convert this post into a guideline of how to write a good signature, though I have lots of experience with it, even for various archs, though I will just say that you never wanna put binary code as part of your signature, only in extreme cases (I am talking about the actual opcode bytes), simply because you usually don’t know what the compiler is going to do with the source code, how it’s going to look in assembly, etc. The idea of a good signature is that it will be as generic as possible so it will survive (hopefully) the updates of the target binary you’re searching in. This is probably the most important rule about binary signatures. Unfortunately we can never guarantee a signature is to be future compatible with new updates. But always test that the signature matches on a few versions of the binary file. Suppose it’s a .DLL, then try to get as many versions of that DLL file as possible and make a script to try it out on all of them, a must. The more DLLs the signature is able to work on successfully, the better the signature is! Usually the goal is to write a single signature that covers all versions at once.
The reason you can’t rely on opcodes in your binary signature is because they get changed many times, almost in every compilation of the code in a different version, the compiler will allocate new registers for the instructions and thus change the instructions. Or since code might get compiled to many variations which effectively do the same thing, I.E: MOV EAX, 0 and XOR EAX, EAX.
One more note, a good signature is one that you can find FAST. We don’t really wanna disassemble the whole file and run on the listing it generated. Anyway, caching is always a good idea and if you have many passes to do for many signatures to find different things, you can always cache lots of stuff, and save precious loading time. So think well before you write a signature and be sure you’re using a good algorithm. Finding an XREF for a relative branch takes lots of time, try to avoid that, that should be cached, in one pass of scanning the whole code section of the file, into a dictionary of “target:source” pairs, with false positives (another long story) that can be looked up for a range of addresses…

I almost forgot to mention, I used such a binary signature inside the patch I wrote as a member of ZERT, for closing a vulnerability in Internet Explorer, I needed to find the weak function and patch it in memory, so you can both grab the source code and see for yourself. Though the example does use opcodes (and lots of them) as part of the signature, but there’s special reason for it. Long story made short: The signature won’t match once the function will get officially patched by MS (recall that we published that solution before MS acted), and then this way the patcher will know that it didn’t find the signature and probably the function was already patched well, so we don’t need to patch it on top of the new patch.. confusing shit.

The reason I find signatures amazing is because only reversers can do them well and it takes lots of skills to generate good ones,
happy signaturing :)

And surprisingly I found the following link which is interesting: http://wiki.amxmodx.org/Signature_Scanning

So let’s delve into my example, at last.
Here’s a real example of a signature for ZwProtectVirtualMemory in a Kernel driver.

Signature Source Code

From my tests this signature worked well on many versions…though always expect it might be broken.

Calling System Service APIs in Kernel

Wednesday, January 26th, 2011

In this post I am not going to shed any new light about this topic, but I didn’t find anything like this organized in one place, so I decided to write it down, hope you will find it useful.

Sometimes when you develop a kernel driver you need to use some internal API that cannot be accessed normally through the DDK. Though you may say “but it’s not an API if it’s not officially exported and supported by MS”. Well that’s kinda true, the point is that some functions like that which are not accessible from the kernel, are really accessible from usermode, hence they are called API. After all, if you can call NtCreateFile from usermode, eventually you’re supposed to be able to do that from kernel, cause it really happens in kernel, right? Obviously, NtCreateFile is an official API in the kernel too.

When I mean using system service APIs, I really mean by doing it platform/version independent, so it will work on all versions of Windows. Except when MS changes the interface (number of parameters for instance, or their type) to the services themselves, but that rarely happens.

I am not going to explain how the architecture of the SSDT and the transitions from user to kernel or how syscalls, etc work. Just how to use it to our advantage. It is clear that MS doesn’t want you to use some of its APIs in the kernel. But sometimes it’s unavoidable, and using undocumented API is fine with me, even in production(!) if you know how to do it well and as robust as possible, but that’s another story. We know that MS doesn’t want you to use some of these APIs because a) they just don’t export it in kernel on purpose, that is. b) starting with 64 bits versions of Windows they made it harder on purpose to use or manipulate the kernel, by removing previously exported symbols from kernel, we will get to that later on.

Specifically I needed ZwProtectVirtualMemory, because I wanted to change the protection of some page in the user address space. And that function isn’t exported by the DDK, bummer. Now remember that it is accessible to usermode (as VirtualProtectMemory through kernel32.dll syscall…), therefore there ought to be a way to get it (the address of the function in kernel) in a reliable manner inside a kernel mode driver in order to use it too. And this is what I’m going to talk about in this post. I’m going to assume that you already run code in the kernel and that you are a legitimate driver because it’s really going to help us with some exported symbols, not talking about shellcodes here, although shellcodes can use this technique by changing it a bit.

We have a few major tasks in order to achieve our goal: Map the usermode equivalent .dll file. We need to get the index number of the service we want to call. Then we need to get the base address of ntos and the address of the (service) table of pointers (the SSDT itself) to the functions in the kernel. And voila…

The first one is easy both in 32 and 64 bits systems. There are mainly 3 files which make the syscalls in usermode, such as: ntdll, kernel32 and user32 (for GDI calls). For each API you want to call in kernel, you have to know its prototype and in which file you will find it (MSDN supplies some of this or just Google it). The idea is to map the file to the address space as an (executable) image. Note that the cool thing about this mapping is that you will get the address of the required file in usermode. Remember that these files are physically shared among all processes after boot time (For instance, addresses might change because of ASLR but stay consistent as long as the machine is up). Following that we will use a similar functionality to GetProcAddress, but one that you have to write yourself in kernel, which is really easy for PE and PE+ (64 bits).

Alright, so we got the image mapped, we can now get some usermode API function’s address using our GetProcAddress, now what? Well, now we have to get the index number of the syscall we want. Before I continue, this is the right place to say that I’ve seen so many approaches to this problem, disassemblers, binary patterns matching, etc. And I decided to come up with something really simple and maybe new. You take two functions that you know for sure that are going to be inside kernel32.dll (for instance), say, CreateFile and CloseHandle. And then simply compare byte after byte from both functions to find the first different byte, that byte contains the index number of the syscall (or the low byte out of the 4 bytes integer really). Probably you have no idea what I’m talking about, let me show you some usermode API’s that directly do syscalls:

XP SP3 ntdll.dll
B8 25 00 00 00                    mov     eax, 25h        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 19 00 00 00                    mov     eax, 19h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP1 32 bits ntdll.dll

B8 3C 00 00 00                    mov     eax, 3Ch        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 30 00 00 00                    mov     eax, 30h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

2008 sp2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

Win7 64bits syswow64 ntdll.dll

B8 52 00 00 00                    mov     eax, 52h        ; NtCreateFile
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 2C 00                          retn    2Ch

B8 0C 00 00 00                    mov     eax, 0Ch        ; NtClose
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 04 00                          retn    4

These are a few snippets to show you how the syscall function templates look like. They are generated automatically by some tool MS wrote and they don’t change a lot as you can see from the various architectures I gathered here. Anyway, if you take a look at the bytes block of each function, you will see that you can easily spot the correct place where you can read the index of the syscall we are going to use. That’s why doing a diff on two functions from the same .dll would work well and reliably. Needless to say that we are going to use the index number we get with the table inside the kernel in order to get the corresponding function in the kernel.

This technique gives us the index number of the syscall of any exported function in any one of the .dlls mentioned above. This is valid both for 32 and 64 bits. And by the way, notice that the operand type (=immediate) that represents the index number is always a 4 bytes integer (dword) in the ‘mov’ instruction, just makes life easier.

To the next task, in order to find the base address of the service table or what is known as the system service descriptor table (in short SSDT), we will have to get the base address of the ntoskrnl.exe image first. There might be different kernel image loaded in the system (with or without PAE, uni-processor or multi-processor), but it doesn’t matter in the following technique I’m going to use, because it’s based on memory and not files… This task is really easy when you are a driver, means that if you want some exported symbol from the kernel that the DDK supplies – the PE loader will get it for you. So it means we get, without any work, the address of any function like NtClose or NtCreateFile, etc. Both are inside ntos, obviously. Starting with that address we will round down the address to the nearest page and scan downwards to find an ‘MZ’ signature, which will mark the base address of the whole image in memory. If you’re afraid from false positives using this technique you’re welcome to go further and check for a ‘PE’ signature, or use other techniques.

This should do the trick:

PVOID FindNtoskrnlBase(PVOID Addr)
{
    /// Scandown from a given symbol's address.
    Addr = (PVOID)((ULONG_PTR)Addr & ~0xfff);
    __try {
        while ((*(PUSHORT)Addr != IMAGE_DOS_SIGNATURE)) {
            Addr = (PVOID) ((ULONG_PTR)Addr - PAGE_SIZE);
        }
        return Addr;
    }
    __except(1) { }
    return NULL;
}

And you can call it with a parameter like FindNtoskrnlBase(ZwClose). This is what I meant that you know the address of ZwClose or any other symbol in the image which will give you some “anchor”.

After we got the base address of ntos, we need to retrieve the address of the service table in kernel. That can be done using the same GetProcAddress we used earlier on the mapped user mode .dll files. But this time we will be looking for the “KeServiceDescriptorTable” exported symbol.

So far you can see that we got anchors (what I call for a reliable way to get an address of anything in memory) and we are good to go, this will work in production without the need to worry. If you wanna start the flame war about the unlegitimate use of undocumented APIs, etc. I’m clearly not interested. :)
Anyway, in Windows 32 bits, the latter symbol is exported, but it is not exported in 64 bits! This is part of the PatchGuard system, to make life harder for rootkits, 3rd party drivers doing exactly what I’m talking about, etc. I’m not going to cover how to get that address in 64 bits in this post.

The KeServiceDescriptorTable is a table that holds a few pointers to other service tables which contain the real addresses of the service functions the OS supplies to usermode. So a simple dereference to the table and you get the pointer to the first table which is the one you are looking for. Using that pointer, which is really the base address of the pointers table, you use the index we read earlier from the required function and you got, at last, the pointer to that function in kernel, which you can now use.

The bottom line is that now you can use any API that is given to usermode also in kernelmode and you’re not limited to a specific Windows version, nor updates, etc. and you can do it in a reliable manner which is the most important thing. Also we didn’t require any special algorithms nor disassemblers (as much as I like diStorm…). Doing so in shellcodes make life a bit harder, because we had the assumption that we got some reliable way to find the ntos base address. But every kid around the block knows it’s easy to do it anyway.

Happy coding :)

References I found interesting about this topic:
http://j00ru.vexillium.org/?p=222
http://alter.org.ua/docs/nt_kernel/procaddr/

http://uninformed.org/index.cgi?v=3&a=4&p=5

And how to do it in 64 bits:

http://www.gamedeception.net/threads/20349-X64-Syscall-Index

New Project – ReviveR

Saturday, September 25th, 2010

Hey all,

long time haven’t posted. I’m kinda busy with lots of stuff.
Anyway I just wanted to let you know that I’m starting to work on the sequel of diStorm, you guessed it right… A reversing studio!
Unlike what many people said, the core is going to be written in C++, the GUI is going to be written per OS. No thanks, QT. Top goals are performance, scripting, good UI and most important good analysis capabilities. Obviously it’s going to be open source, cross platform. For a start, it will support only x86 and AMD64 and PE file format, maybe ELF too, though not my priority. I’m not sure about a debugger yet, but it will probably be implemented later. GUI is going to be written using WPF under C#, just to give you an idea.

My main interests are performance and binary code analysis algorithms.

If there are highly skilled programmers who wish to help, please contact me.
For now it seems we are a group of 4 coders, I’m still not going to publish their names, until everything is settled.

Anyway, design is taking place nowadays. This is your time for suggesting new features and ideas.

Big good luck

diStorm3 is Ready

Monday, August 16th, 2010

diStorm3 is ready for the masses! :)
– if you want to maximize the information you get from a single instruction; Structure output rather than text, flow control analysis support and more!

Check it out now at its new google page.

Good luck!

Custom Kernel Debugging is Faster

Tuesday, July 20th, 2010

When you start to write a post you always get a problem with the headline for the post. You need to find something that will, in a few words, sum it up for the reader. I was wondering which one is better, “Boosting WinDbg”, “Faster Kernel Debugging in WinDbg”, “Hacking WinDbg” and so on. But they might be not accurate, and once you will read the post you won’t find them appropriate. But instead of talking about meta-post issues, let’s get going.

Two posts ago, I was talking about hunting a specific race condition bug we had in some software I work on. At last, I have free time to write this post and get into some interesting details about Windows Kernel and Debugging.

First I want to say that I got really pissed off that I couldn’t hunt the bug we had in the software like a normal human being, that Jond and I had to do it the lame old school way, which takes more time, lots of time. What really bothered me is that computers are fast and so is debugging, at least, should be. Why the heck do I have to sit down in front of the computer, not mentioning – trying to dupe the damned bug, and only then manage to debug it and see what’s going on wrong. Unacceptable. You might say, write a better code in the first place, I agree, but even then people have bugs, and will have, forever, and I was called to simply help.

Suppose we want to set a breakpoint on memory access this time, but something more complicated with conditions. The reason we need a condition, rather than a normal breakpoint is because the memory we want to monitor gets accessed thousands times per second, in my case with the race condition, for instance.
You’re even welcome to make the following test locally on your computer, fire up Visual Studio, and test the following code: unsigned int counter = 1; while (counter < 99999999+1) { counter++; }, set a memory access breakpoint on counter which stops when hit count reach 99999999, and time the whole process, and then time it without the bp set, and compare the result, what's the ratio you got? Isn't that just crazy? Here's an example in WinDbg's syntax, would be something like this: ba w4 0x491004 "j (poi(0x491004)==0) 'gc'" Which reads: break on write access for an integer at address 0x491004 only if its value is 0, otherwise continue execution. It will be tens-thousands times faster without the bp set, hence the debugging infrastructure, even locally (usermode), is slowing things down seriously. And think that you want to debug something similar on a remote machine, it's impossible, you are going to wait years in vain for something to happen on that machine. Think of all the COM/Pipe/USB/whatever-protocol messages that have to be transmitted back and forth the debugged machine to the debugger. And add to that the conditional breakpoint we set, someone has to see whether the condition is true or false and continue execution accordingly. And even if you use great tools like VirtualKD. Suppose you set a breakpoint on a given address, what really happens once the processor executes the instruction at that address? Obviously a lot, but I am going to talk about Windows Kernel point of view. Let's start bottom up, Interrupt #3 is being raised by the processor which ran that thread, which halts execution of the thread and transfers control _KiTrap3 in ntoskrnl. _KiTrap3 will build a context for the trapped thread, with all registers and this likely info and call CommonDispatchException with code 0x80000003 (to denote a breakpoint exception). Since the 'exception-raising' is common, everybody uses it, in other exceptions as well. CommonDispatchException calls _KiDispatchException. And _KiDispatchException is really the brain behind all the Windows-Exception mechanism. I'm not going to cover normal exception handling in Windows, which is very interesting in its own. So far nothing is new here. But we're getting to this function because it has something to do with debugging, it checks whether the _KdDebuggerEnabled is set and eventually it will call _KiDebugRoutine if it's set as well. Note that _KiDebugRoutine is a pointer to a function that gets set when the machine is debug-enabled. This is where we are going to get into business later, so as you can see the kernel has some minimal infrastructure to support kernel debugging with lots of functionality, many functions in ntoskrnl which start in "kdp", like KdpReadPhysicalMemory, KdpSetContext and many others. Eventually the controlling machine that uses WinDbg, has to speak to the remote machine using some protocol named KdCom, there's a KDCOM.DLL which is responsible for all of it. Now, once we set a breakpoint in WinDbg, I don't know exactly what happens, but I guess it’s something like this: it stores the bp in some internal table locally, then sends it to the debugged machine using this KdCom protocol, the other machine receives the command and sets the breakpoint locally. Then when the bp occurs, eventually WinDbg gets an event that describes the debug event from the other machine. Then it needs to know what to do with this bp according to the dude who debugs the machine. So much going on for what looks like a simple breakpoint. The process is very similar for single stepping as well, though sending a different exception code.

The problem with conditional breakpoints is that they are being tested for the condition locally, on the WinDbg machine, not on the server, so to speak. I agree it’s a fine design for Windows, after all, Windows wasn’t meant to be an uber debugging infrastructure, but an operating system. So having a kernel debugging builtin we should say thanks… So no complaints on the design, and yet something has to be done.

Custom Debugging to our call!

That’s the reason I decided to describe above how the debugging mechanism works in the kernel, so we know where we can intervene that process and do something useful. Since we want to do smart debugging, we have to use conditional breakpoints, otherwise in critical variables that get touched every now and then, we will have to hit F5 (‘go’) all the time, and the application we are debugging won’t get time to process. That’s clear. Next thing we realized is that the condition tests are being done locally on our machine, the one that runs WinDbg. That’s not ok, here’s the trick:
I wrote a driver that replaces (hooks) the _KiDebugRoutine with my own function, which checks for the exception code, then examines the context according to my condition and only then sends the event to WinDbg on the other machine, or simply “continues-execution”, thus the whole technique happens on the debugged machine without sending a single message outside (regarding the bp we set), unless that condition is true, and that’s why everything is thousands of times or so faster, which is now acceptable and usable. Luckily, we only need to replace a pointer to a function and using very simple tests we get the ability to filter exceptions on spot. Although we need to get our hands dirty with touching Debug-Registers and the context of the trapping thread, but that’s a win, after all.

Here’s the debug routine I used to experiment this issue (using constants tough):

int __stdcall my_debug(IN PVOID TrapFrame,
	IN PVOID Reserved,
	IN PEXCEPTION_RECORD ExceptionRecord,
	IN PCONTEXT Context,
	IN KPROCESSOR_MODE PreviousMode,
	IN UCHAR LastChance)
{
	ULONG _dr6, _dr0;
	__asm {
		mov eax, dr6
		mov _dr6, eax
		mov eax, dr0
		mov _dr0, eax
	};
	if ((ExceptionRecord->ExceptionCode == 0x80000003) &&
		(_dr6 & 0xf) &&
		(_dr0 == MY_WANTED_POINTER) &&
		(ExceptionRecord->ExceptionAddress != MY_WANTED_EIP))
	{
		return 1;
	}
	return old_debug_routine(TrapFrame, Reserved, ExceptionRecord, Context, PreviousMode, LastChance);
}

This routine checks when a breakpoint interrupt happened and stops the thread only if the pointer I wanted to monitor was accessed from a given address, else it would resume running that thread. This is where you go custom, and write whatever crazy condition you are up to. Using up to 4 breakpoints, that’s the processor limit for hardware breakpoints. Also checking out which thread or process trapped, etc. using the Kernel APIs… It just reminds me “compiled sprites” :)

I was assuming that there’s only one bp set on the machine which is the one I set through WinDbg, though this time, there was no necessity to set a conditional breakpoint in WinDbg itself, since we filter them using our own routine, and once WinDbg gets the event it will stop and let us act.

For some reason I had a problem with accessing the DRs from the Context structure, I didn’t try too hard, so I just backed to use them directly because I can.

Of course, doing what I did is not anything close to production quality, it was only a proof of concept, and it worked well. Next time that I will find myself in a weird bug hunting, I will know that I can draw this weapon.
I’m not sure how many people are interested in such things, but I thought it might help someone out there, I wish one day someone would write an open source WinDbg plugin that injects kernel code through WinDbg to the debugged machine that sets this routine with its custom runtime conditional breakpoints :)

I really wanted to paint some stupid pictures that show what’s going on between the two machines and everything, but my capabilities at doing that are aweful, so it’s up to you to imagine that, sorry.

For more related information you can see:
http://uninformed.org/index.cgi?v=8&a=2&p=16
http://www.vsj.co.uk/articles/display.asp?id=265

Cracking for Fun and Non-Profit

Saturday, May 22nd, 2010

One of the fun things to do with applications is to bypass their copy-protection mechanisms. So I want to share my experience about some iPad application, though the application is targeted for the Jailbroken devices. It all began a few days ago, when a friend was challenging me to crack some application. I had my motives, and I’m not going to talk about them. However, that’s why the title says non-profit. Or maybe when they always say “for profit” they mean the technical-knowledge profit.

So before you start to crack some application, what you should do is see how it works, what happens when you run it, what GUI related stuff you can see, like dialog boxes or messages that popup, upon some event you fire. There are so many techniques to approach application-cracking, but I’m not here to write a tutorial, just to talk a bit about what I did.

So I fired IDA with the app loaded, the app was quite small, around 35kb. First thing I was doing was to see the imported functions. This is how I know what I’m going to fight with in one glare. I saw MD5/RSA imported from the crypto library, and that was like “oh uh”, but no drama. Thing is, my friend purchased the app and gave me the license file. Obviously it’s easier with a license file, otherwise, sometimes it’s proved that it’s impossible to crack software without critical info that is encrypted in the license file, that was the issue in my case too. Of course, there’s no point in a license file that only checks the serial-number or something like that, because it’s not enough. So without the license file, there wasn’t much to do.

For some reason IDA didn’t like to parse the app well, so I had to recall how to use this ugly API of IDC (the internal scripting language of IDA), yes, I know IDA Python, but didn’t want to use it. So my script was fixing all LDR instructions, cause the code is PICy so with the strings revealed I could easily follow all those ugly objc_msgSend calls. For Apple’s credit, the messages are text based, so it’s easy to understand what’s going on, once you manage to get to that string. For performance’s sake, this is so lame, I rather use integers than strings, com’on.

Luckily the developer of that app didn’t bother to hide the exported list of functions, he was busy with pure protection algorithm in Objective-C, good for me.
So eventually the way the app worked (license perspective) was to check if the license file exists, if so, parse it. Otherwise, ask for a permission to connect to the Internet and send the UDID (unique device ID) of the device to the app’s server, get a response, and if the status code was success, write it to a file, then run the license validator again.

The license validator was quite cool, it was calling dladdr on itself to get the full path of the executable itself, then calculating the MD5 of the binary. Can you see why? So if you thought you could easily tamper with the file, you were wrong. Taking the MD5 hash, and xoring it in some pattern with the data from the license file; Then decrypting the result with the public key that was in the static segment, though I didn’t care much about it. Since the MD5 of the binary itself was used, this dependency is a very clever trick of the developer, though expected. So I tried to learn more about how the protection works.

Suppose the license was legit, the app would take that buffer and strtok() it to tokens, to check that the UDID was correct. The developer was nice enough to call the lockdownd APIs directly, so in one second I knew where and what was going on around it. In the beginning I wanted to create a proxy dylib for this lockdownd library, but it would require me to patch the header of the mach-o so the imported function will be through my new file – but it still requires a change to the file, no good. So the way it worked with the decrypted string – it kept on tokenizing the string, but this time, it checked for some string match, as if someone tampered with the binary, the decryption would go wrong and the string wouldn’t compare well. And then it did some manipulation on some object, adding methods to it in runtime, with the names from the tokenized string, thus if you don’t have a license file to begin with, you don’t know the names of the new methods that were added. One star for the developer, yipi.

All in all, I have to say that I wasn’t using any debugger or runtime tricks, everything was static reversing, yikes. Therefore, after I was convinced that I can’t ignore the protection because I lack of the names of the new methods, and I can’t use a debugger to phish the names easily. I was left with one solution, as I said before – faking the UDID and fixing the MD5.

What I really cared about for a start, was how the app calculates the MD5 of itself:
Since the developer retrieved the name of the binary using dladdr, I couldn’t just change some path to point to the original copy of the binary, so when it hashes it, it would get the expected hash. That was a bammer, I had to do something else, but similar idea… I decided to patch the file-open function. The library functions are called in ARM mode and it’s very clear. The app itself was in THUMB, so it transitions to ARM using a BX instruction and calls a thunk, that in order will call the imported function. So the thunk function is in ARM mode, thus 4 bytes per instruction, very wasteful IMHO.

The goal of my patches was to patch those thunks, rather than all the callers to those thunks. Cause I could end up with a dozen of different places to patch. So I was limited in the patches I could do in a way. So eventually I extended the thunk of the file-open and made R0 register point to my controlled path, where I could guarantee an original copy of the binary, so when it calculated the MD5 of it, it would be the expected hash. Again, I could do so many other things, like planting a new MD5 value in the binary and copy it in the MD5-Final API call, but that required too much code changes. And oh yes, I’m such a jackass that I didn’t even use an Arm-assembler. Pfft, hex-editing FTW :( Oh also, I have to comment that it was safe to patch the thunk of file-open, cause all the callers were related to the MD5 hashing…

Ok, so now I got the MD5 good and I could patch the file however I saw fit. Patching the UDID-strcmp’s wasn’t enough, since the license wasn’t a “yes/no” check, it had essential data I needed, otherwise I could finish with the protection in 1 minute patch (without going to the MD5 hassle). So I didn’t even touch those strcmp’s.

RSA encryption then? Ahhh not so fast, the developer was decrypting the xored license with the resulted MD5 hash, then comparing the UDID, so I got the license decrypted well with the MD5 patch, but now the UDID that was returned from the lockdownd was wrong, wrong because it wasn’t corresponding to the purchased license. So I had to change it as well. The problem with that UDID and the lockdownd API, is that it returns a CFSTR, so I had to wrap it with that annoying structure. That done, I patched the thunk of the lockdown API to simply return my CFSTR of the needed UDID string.

And guess what?? it crashed :) I put my extra code in a __ustring segment, in the beginning I thought the segment wasn’t executable, because it’s a data. But I tried to run something very basic that would work for sure, and it did, so I understood the problem was with my patch. So I had to double check it. Then I found out that I was piggy-backing on the wrong (existing) CFSTR, because I changed its type. Probably some code that was using the patched CFSTR was expecting a different type and therefore crashed, so I piggy-backed a different CFSTR that wouldn’t harm the application and was a similar type to what I needed (Just a string, 0x7c8). What don’t we do when we don’t have segment slacks for our patch code. :)

And then it worked… how surprising, NOT. But it required lots of trial and errors, mind you, because lack of tools mostly.
End of story.
It’s really hard to say how I would design it better, when I had my chance, I was crazy about obfuscation, to make the reverser desperate, so he can’t see a single API call, no strings, nothing. Plant decoy strings, code, functionality, so he wastes more time. Since it’s always possible to bypass the protections, if the CPU can do it, I can do it too, right? (as long as I’m on the same ring).

Ending The Race (Condition)

Friday, April 23rd, 2010

After talking to my co-worker, Jond, he agreed that I will write about him too. Actually we were working on solving that race condition together.
So everything I told you in the last post was in a timeline of around 15 hours, almost consecutive, where Jond and I were debugging the system and trying to track down the bass-turd. So it was around 6 am in the morning, after we had a few hooks on the critsec acquire and leave functions in the kernel. But the log looked fine and this is where I decided to call it a night and went home to sleep a bit. Jond decided to continue, the problem with us, is that we take bugs personally. So he got the logs better and wrote some Python script to analyze it. I was too lazy to do that earlier, I decided to analyze manually once, it is the excuse that if we do it only once, writing a script might take longer. I was wrong. Pity. Then, according to Jond’s story, he actually saw something wrong in the log, at f@cking last. So I’m not sure about the small details, but he noticed that the critsec was entered twice or something imaginary like that from different threads, obvisouly. And that time he knew he nailed the guy down.

There are not many options, once you see that the other ‘waiters’ don’t wait when some guy holds it, right? So he looked at the code again, and yet it looked fine! Now he decided it’s time to act upon “WTF is going on”, and he did the following experiment, trying to acquire the critsec in a loop (he didn’t really need a loop, but after you’re going insane… so he had to write something that totally looks like “I GOT THE CRIT” – or not). And to his surprise other threads continued to work normally as if there was no lock. As if huh. Soooo, this is going to be embarrassing a bit. And then he found out that the call to the critsec acquire function wasn’t correct. It was missing a dereference to a pointer. A single character, you got it right. To make it clearer, he saw something like Enter-Crit (m_ptr), instead of Enter-Crit(*m_ptr), which is a pointer to a pointer of an ERESOURCE.
So obviously, the the lock wasn’t acquired at all, for some odd reason it aligned well in the logs we analyzed together, until he improved the logs and found a quirk. A question I asked myself, after we knew what was the bug, is that we gave it some garbage pointer, instead of an ERESOURCE, so the function obviously failed all the times we called it. But how come we didn’t think of testing the return value even though we knew the lock didn’t work? I guess it has something to do that nobody ever checks the return value of “acquire” crit-sec, even in MS code… Bad practice? Not sure, what can you do if you want the lock, and can’t get it? It means one thing, that you have a bug, otherwise it should wait on the lock… So it’s the kind of stuff nobody checks anyway, but maybe a line of ASSERT could help. Oh well, next time.

That was it, kinda nasty, it always come down to something stupid at the end, no? :(
Now it leaves me totally with that breakpoint we couldn’t do because the system was too slow with it, and I will write about it next week.
See you then.

Race Condition From Hell, aren’t they all?

Monday, April 19th, 2010

Actually I had a trouble to come up with a good title for this post, at least one that I was satisfied with. Therefore I will start with a background story, as always.
The problem started when I had to debug a huge software which was mostly in Kernel mode. And there was this critical section (critsec from now on) synchronization object that wasn’t held always correctly. And eventually after 20 mins of trying to replicate the bug, we managed to crash the system with a NULL dereference. This variable was a global that everybody who after acquiring the critsec was its owner. Then how come we got a crash ? Simple, someone was touching the global out of it critsec scope. That’s why it was also very hard to replicate, or took very long.

The pseudo code was something like this:
Acquire Crit-Sec
g_ptr = “some structure we use”
do safe task with g_ptr

g_ptr = NULL
Release Crit-Sec

So you see, before the critsec was released the global pointer was NULLed again. Obvisouly this is totally fine, because it’s still in the scope of the acquired crit, so we can access it safely.

Looking at the crash dumps, we saw a very weird thing, but nothing surprising for those race conditions bugs. Also if you ask me, I think I would prefer dead-lock bugs to race conditions, since in dead lock, everything gets stuck and then you can examine which locks are held, and see why some thread (out of the two) is trying to acquire the lock, when it surely can’t… Not saying it’s easier, though.
Anyway, back to the crash dump, we saw that the g_ptr variable was accessed in some internal function after the critsec was acquired. So far so good. Then after a few instructions, in an inner function that referenced the variable again, suddenly it crashed. Traversing back to the point where we know by the disassembly listing of the function, where the g_ptr was touched first, we knew it worked there. Cause otherwise, it would have crashed there and then, before going on, right? I have to mention that between first time reading the variable and the second one where it crashed, we didn’t see any function calls.
This really freaked me out, because the conclusion was one – somebody else is tempering with our g_ptr in a different thread without locking the crit. If there were any function calls, might be that some of them, caused our thread to be in a Waitable state, which means we could accept APCs or other events, and then it could lead to a whole new execution path, that was hidden from the crash dump, which somehow zeroed the g_ptr variable. Also at the time of the crash, it’s important to note that the owner of the critsec was the crashing thread, no leads then to other problematic threads…

Next thing was to see that everybody touches the g_ptr only when the critsec is acquired. We surely know for now that someone is doing something very badly and we need to track the biatch down. Also we know the value that is written to the g_ptr variable is zero, so it limits the number of occurrences of such instruction (expression), which lead to two spots. Looking at both spots, everything looked fine. Of course, it looked fine, otherwise I would have spotted the bug easily, besides, we got a crash, which means, nothing is fine. Also, it’s time to admit, that part of the code was Windows itself, which made the problem a few times harder, because I couldn’t do whatever I wanted with it.

I don’t know how you guys would approach such a problem in order to solve it. But I had three ideas. Sometimes just like printf/OutputDebugPrint is your best friend, print logs when the critsec is acquired and released, who is waiting for it and just every piece of information we can gather about it. Mind you that part of it was Windows kernel itself, so we had to patch those functions too, to see, who’s acquiring the critsec and when. Luckily in debug mode, patchguard is down :) Otherwise, it would be bloody around the kernel. So looking at the log, everything was fine, again, damn. You can stare at the god damned thing for hours and tracking the acquiring and releasing pairs of the critsec, and nothing is wrong. So it means, this is not going to be the savior.

The second idea, was to comment out some code portions with #if 0 surrouding the potential problematic code. And starting to eliminate the possibilities of which function is the cause of this bug. This is not such a great idea. Since a race condition can happen in a few places, finding one of them is not enough usually. Though it can teach you something about the original bug’s characteristics, then you can look at the rest of the code to fix that same thing. It’s really old school technique but sometimes it is of a help as bad as it sounds. So guess what we did? Patched the g_ptr = NULL of the kernel and then everything went smooth, no crashes and nothing. But the problem still was around, now we knew for sure it’s our bug and not MS, duh. And there were only a few places in our code which set this g_ptr. Looking at all of them, again, seemed fine. This is where I started going crazy, seriously.

While you were reading the above ideas, didn’t you come up with the most banal idea, to put a dumb breakpoint – on memory access, on g_ptr with a condition of “who writes zero”. Of course you did, that what you should have done in the first place. I hope you know that. Why we couldn’t do that?
Because the breakpoint was fired tens of thousands times in a single second. Rendering the whole system almost to freeze. Assuming it took us 20 mins to replicate the bug, when we heavily loaded the system. Doing that with such a breakpoint set, would take days or so, no kidding. Which is out of question.

This will lead me to the next post. Stay tuned.

diStorm3 – News

Tuesday, December 29th, 2009

Yo yo yo… forgot to say happy xmas last time, never too late, ah? :)

This time I wanted to update you about diStorm3 once again. Yesterday I had a good coding session and I added some of the new features regarding flow control. The decode function gets a new parameter called ‘features’. Which is a bit field flag that lets you ask the disassembler to do some new stuff such as:

  1. Stop on INT instructions [INT, INT1, INT3, INTO]
  2. Stop on CALL instructions [CALL, CALL FAR]
  3. Stop on RET instructions [RET, RETF, IRET]
  4. Stop on JMP instructions [JMP, JMP FAR]
  5. Stop on any conditional branch instructions [JXXX, JXCX, LOOPXX]
  6. Stop on any flow control (all of the above)

I wasn’t sure about SYSCALL and the like and UD2, for now I left them out. So what we got now is the ability to instruct the disassembler to stop decoding after it encounters one of the above conditions. This makes the higer disassembler layer more efficient, because now you can disassemble code by basic blocks. Also building a call-graph or branches-graph faster.

Note that now you will be able to ask the disassembler to return a single instruction. I know it sounds stupid, but I talked about it already, and I had some reasons to avoid this behavior. Anyway, now you’re free to ask how many instructions you want, as long as the disassembler can read them from the stream you supply.

Another feature added is the ability to filter non-flow-control instructions. Suppose you are interested in building a call-graph only, there’s no reason that you will get all the data-control instructions, because they are probably useless for the case. Mixing this flag with ‘Stop on RET’ and ‘Stop on CALL’, you can do nice stuff.

Another thing is that I separated the memory-indirection description of an operand into two forms. First of all, memory indirection operand is when an instruction reads/writes from/to memory. Usually in Assembly text, you will see the brackets characters surrounding some expressions. Something like: MOV [EDX], EAX. Means we write a DWORD to EDX pointer. If you followed me ’till here, you should know exactly what I’m talking about anyway.

When you get the result of such instruction from diStorm3, the type of the operand will be SMEM (stands for simple-memory), which hints there’s only one register in the memory-indirection operand. Although it doesn’t hint anything about the displacement, that’s that offset you usually see in the brackets. Like MOV [EDX+0x12345678], EAX. So you will have to test if the displacement exists in both forms. The other form is MEM (Normal memory indirection, or probably should be called ‘complex’) since it supports the full memory indirection operand, like: MOV [EAX*4 + ESI + 0x12345678], EAX. Then you will have to read another register that supplies the base register, in addition to the index register and scale. Note that this applies for 16 bits mode addressing as well, when you can have a mix of [‘BX+SI]’ or only ‘[BX]’. Also note that sometimes in 32/64 bits mode, you can have a SIB byte, that sets only the base register and the index register is unused, but diStorm3 will return it as an SMEM, to simplify matters. This way it’s really easy to read the instruction’s parameters.

Another feature for text formatting is the ability to tell the disassembler to limit the address to 16 or 32 bits. This is good since the offsets are 64 bits today. And if you encounter an instruction that jumps backwards, you will get a huge negative value, which won’t make much sense if you disassemble 16 bits code…

diStorm3 still supplies the bad old interface. And now it supports two new additional functions. The decompose function, which returns the structures for each instruction read. And another function that formats a given structure into text, which is pretty straight forward. The text format is not an accurate behavior of diStorm64, it’s more simplified, but good enough. Besides I have never heard any special comments about the formatting of diStorm64, so I guess it doesn’t matter much to you guys. And maybe maybe I will add AT&T syntax later on.

Another field that is returned now, unlike diStorm64, is the instruction-set-class type of the instruction, with very broad categories, like Integer instructions, FPU instructions, SSE instructions, and so on. Still might be handy. And the hint about the flow-control type of the instruction.

Also I changed tons of code, and I really mean it, the skeleton is still the same, but the prefixes engine works totally different now. Trying to imitate a real processor this time. By including the last prefix found of that prefix-type. You can read more about this, here. I made the code way more optimized and eliminated double code and it’s still readable, if not for the better. Also I changed the way instruction are fetched, so the locate-instruction function is much smaller and better.

I’m pertty satisfied with the new version of diStorm and hopefully I will be able to share it with you guys soon. Still I got tons of tests to do, maybe I will add that unit-test module in Python to the proejct so you can enjoy it too, not sure yet.

Also I got a word from Mario Vilas, that he is going to help with compiling diStorm for different platforms, and I’m going to integrate his new Python wrappers that use ctypes, so you don’t need the Python extension anymore. Thanks Mario! ;) However, diStorm3 has its own Python module for the new structure output.

If you have more ideas, comments, complaints or you just hate me, this is the time to say so.
Cheers, happy new year soon!
Gil