Binary Hooking Problems

May 14th, 2011

Most binary hooking engines write a detour in the entry point of the target function. Other hooking engines patch the IAT table, and so on. One of the problems with overwriting the entry point with a JMP instruction is that you need enough room for that instruction, usually a mere 5 bytes will suffice.

How do the hooking algorithms decide how much is “enough”?

Pretty easy, they use a dissasembler to query the size of each instruction they scan, so if the total size of the instructions that were scanned is more than 5 bytes, they’re done.
As an example, usually, functions start with these two instructions:
PUSH EBP
MOV EBP, ESP

which take only 3 bytes. And we already said that we need 5 bytes total, in order to replace the first instructions with our JMP instruction. Hence the scan will have to continue to the next instruction or so, till we got at least 5 bytes.

So 5 bytes in x86 could contain from one instruction to 5 instructions (where each takes a single byte, obviously). Or even a single instruction whose size is longer than 5 bytes. (In x64 you might need 12-14 bytes for a whole-address-space JMP, and it only makes matter worse).

It is clear why we need to know the size of instructions, since we overwrite the first 5 bytes, we need to relocate them to another location, the trampoline. There we want to continue the execution of the original function we hooked and therefore we need to continue from the next instruction that we haven’t override. And it is not necessarily the instruction at offset 5… otherwise we might continue execution in the middle of an instruction, which is pretty bad.

Lame hooking engines don’t use disassemblers, they just have a predefined table of popular prologue instructions. Come a different compiled code, they won’t be able to hook a function. Anyway, we also need a disassembler for another reason, to tell whether we hit a dead end instruction, such as: RET, INT 3, JMP, etc. These are hooking spoilers, because if the first instruction of the target function is a simple RET (thus the function doesn’t do anything, leave aside cache side effects for now), or even a “return 0” function, which usually translates into “xor eax, eax; ret”, still takes only 3 bytes and we can’t plant a detour. So we find ourselves trying to override 5 bytes where the whole function takes several bytes (< 5 bytes), and we cannot override past that instruction since we don’t know what’s there. It might be another function’s entry point, data, NOP slide, or what not. The point is that we are not allowed to do that and eventually cannot hook the function, fail.

Another problem is relative-offset instructions. Suppose any of the first 5 bytes is a conditional branch instruction, we will have to relocate that instruction. Usually conditional branch instruction are only 2 bytes. And if we copy them to the trampoline, we will have to convert them into the longer variation which is 6 bytes and fix the offset. And that would work well. In x64, RIP-relative instructions are also pain in the butt, as well as any other relative-offset instruction which requires a fix. So there’s quiet a long list of those and a good hooking engine has to support them all, especially in x64 where there’s no standard prologue for a function.

I noticed a specific case where WaitForSingleObject is being compiled to:
XOR R8D, R8D
JMP short WaitForSingleObjectEx

in x64 of course; the xor takes 3 bytes and the jmp is a short one, which takes 2 bytes. And so you got 5 bytes total and should be able to hook it (it’s totally legal), but the hook engine I use sucked and didn’t allow that.

So you might say, ok, I got a generic solution for that, let’s follow the unconditional branch and hook that point. So I will hook WaitForSingleObjectEx instead, right? But now you got to the dreaded entry points problem. You might get called for a different entry point that you never meant to hook. You wanted to hook WaitForSingleObject and now you end up hooking WaitForSingleObjectEx, so all callers to WaitForSingleObject get to you, that’s true. In addition, now all callers to WaitForSingleObjectEx get to you too. That’s a big problem. And you can’t ever realize on whose behalf you were called (with a quick and legitimate solution).

The funny thing about the implementation of WaitForSingleObject is that it was followed immediately by an alignment NOP slide, which is never executed really, but the hooking engine can’t make a use of it, because it doesn’t know what we know. So unconditional branches screw up hooking engines if they show up before 5 bytes from the entry point, and we just saw that following the unconditional branch might screw us up as well with wrong context. So if you do that, it’s ill.

What would you do then in such cases? Cause I got some solutions, although nothing is perfect.

Executing .PYC Files in Python

April 29th, 2011

I got this .PYC (compiled Python script) file that I used with command line Python, ala: “python.exe script.pyc”. And that would run the script.pyc file and let it do its job. The problem was that I wanted to run a few Python lines before running script.pyc itself. But apparently, it’s not really possible.

Suppose script.pyc does the familiar name check:
if __name__ == “__main__”:
main()

I guess everyone who wrote a module or two in Python knows this trick. This way you can tell in your script whether it was imported or executed and do whatever you are up to correspondingly. Thus, if it was executed, you will probably want to run a test case for the module, hence calling main() usually, and you could even pass command line arguments to it the normal way.

Since I wanted to do a few things before the script gets to run in Python, it means I had to open Python myself, do my stuff, then execute the script. Unfortunately, it is not possible to execfile() a .PYC file, beat me. Again, I couldn’t just import the file since then that if statement I presented above would fail and won’t call the main() function and nothing would happen, fail. execfile() also doesn’t work, simply because it runs only pure Python source code and not a compiled script.

What I eventually came up with was to import the file myself, but that required fooling a bit :)
You can __import__ file in the code in runtime. What I mean is that when you don’t have the filename to import in static time, you can use that function which receives a filename and dynamically loads the module in runtime.
I tried that on the script.pyc file and obviously it didn’t work as well, because the __name__ was wrong, so the main() didn’t get executed. That made me realize that I need to do the __import__’s internals on my own and only then I will be able to change the __name__ for the module (if that’s possible, but it has to be, since the distinction exists, right?)

Then after a bit of googling around I sumbled upon: imp.
Which shows how to import a file ourselves, then I changed it to:
import imp
fp, pathname, description = imp.find_module(“script”)
imp.load_module(“__main__”, fp, pathname, description)

Notice how I pass __main__ instead of the module’s real name, then the check for main() inside script.py would really work and execute the main() function and I’m all happy once again.

JavaScript Once Again

April 22nd, 2011

I stumbled upon this JavaScript Garden page.
It’s one of the best resources about in-depth JS knowledge IMHO. And if you wanna be a JS guru it’s a must read. It shows all the more why JS sucks though :) So many fugly syntax problems, non-intuitive language. Weird scope management. And what not… Just read it anyway.

Uh Ah! I Happened To Use POP ESP

April 15th, 2011

I was telling the story to a friend of mine about me using POP ESP in some code I wrote, and then he noted how special it is to use such an instruction and probably I’m the first one whom he’s heard of that used it. So I decided to share. I’m sorry to be mystical about my recent posts, it’s just that they are connected to the place I work at, and I can’t talk really elaborate about everything.

Here we go.
I had to call a C++ function from my Assembly code and keep the return value untouched so the caller will get it. Usually return values are passed on EAX, in x86 that is. But that’s not the whole truth, they might be passed on EDX:EAX, if you want to return 64 bits integer, for instance.
My Assembly code was a wrapper to the C++ function, so once the C++ function returned, it got back to me, and so I couldn’t touch both EDX and EAX. The problem was that I had to clean the stack, as my wrapper function acted as STDCALL calling convention. Cleaning the stack is pretty easy, after you popped EBP and the stack pointer points to the return address, you still have to do POPs as the number of arguments your function receives. The calling convention also specifies which registers are to be preserved between calls, and which registers are scratch. Therefore I decided to use ECX for my part, because it’s a scratch register, and I didn’t want to dirty any other register. Note that by the time you need to return to the caller and both clean the arguments on the stack, it’s pretty hard to use push and pop instructions to back up a register so you can freely use it. Again, because you’re in the middle of cleaning the stack, so by the time you POP that register, the ESP moved already. Therefore I got stuck with ECX only, but that’s fine with me. After the C++ function returned to me, I read from some structure the number of arguments to clean. Suppose I had the pointer to that structure in my frame and it was easily accessible as a local variable. Then I cleaned my own stack frame, mov esp, ebp and pop ebp. Then ESP pointed the return address.
This is where it gets tricky:

Assume ECX holds the number of arguments to clean:
lea ecx, [esp + ecx*4 + 4]

That calculation gets the fixed stack address, like the ESP that a RET N instruction would get it to. So it needs to skip the number of arguments multiplied by 4, 4 bytes per argument, and add to that the return value itself.

Going on with:
xchg [esp], ecx

Which puts on the stack the fixed stack address, and getting ECX with the return address. This is where usually people get confused, take your time. I’m waiting 😉

And then the almighty:
pop esp
jmp ecx

We actually popped the fixed stack pointer from the stack itself into the stack pointer. LOL
And since we got ECX loaded with the return address, we just have to branch to it.

What I was actually doing is to simulate the RET N instruction, using only ECX. And ESP should be used anyway. Now the function I was returning to, could access both the optional EDX and EAX as return values from the C++ function.

It seems that the solution begged a SMC (self modifying code) so I could just patch the N, in the RET N instruction, which is a 16 bits immediate value. But SMC is bad for performance, and obviously for multi threading…

Also note that I could just clean the stack, and then branched to something like: jmp [esp – argsCount*4 – 4],
but I don’t like reading off my stack pointer, that’s a bad practice (mostly from the days of interrupts…).

POP ESP FTW

Getting RAX Register in C/C++ or coders that should be killed ;)

April 13th, 2011

Just a weird story, suppose I need to get the value of RAX register in x64 inside some function I wrote.
I was using Visual Studio and as you might know in x64 you are not allowed to use the declspec(naked) or the inline __asm keyword anymore, what a shame. So obviously, I could write some .asm file and link it in. But I prefered to come up with more elegant idea. Anyway, I just wanna show you the solution.

if I set up a function such as:
uint64_t getRax() { }
An empty function which doesn’t do anything. The compiler will shout at me that it cannot compile such a function because there’s no any return statement. What a shame. But suppose we could compile that function, we could then just call it and it would immediately return to the caller without doing anything, then we could read the return value, which wasn’t changed cause the function is empty, thus we could get RAX. Following so far?

A single cast should do the trick. So first, we will have to change the function into:
void getRax() { }
Now the compiler will actually compile it. And now we will add a new pointer to a function that returns an actual uint64_t.

Defining a pointer to a function as follows:
uint64_t (*_getRax)()) = getRax;
But that wouldn’t compile as well, because the compiler is smart enough to know that we are messing up with types here. We will either end up with a warning or an error, too bad. As we know everything can be casted to void* and that’s why we need to cast through void* for success, such as:
uint64_t (*_getRax)() = (void*) getRax;

This way we got rid off the warnings when using W3/4 or treat warning as errors, the way I usually work. Though I could also disable the warning for the region of that fugly code. However, this is one of the reasons C/C++ is probably one of the strongest programming languages, this flexibility…
Surprisingly, I just found that GCC is more permissive in this case than VS.

And then we can simply use it:
uint64_t rax = _getRax();

I wish this could also work:
unsigned char getRax[] = { 0xc3 };
uint64_t (*_getRax)() = (void*) getRax;
_getRax();
But obviously since DEP is enabled nowadays, it will fail with an awful death :(

It really bothers me that I have to CALL to a RET in order to get RAX accessible in C, LOL. Anyone aware of any intrinsic to do similar things with registers? For some reason I remember something like __EAX in VS, but I couldn’t find it.

Calling System Service APIs in Kernel

January 26th, 2011

In this post I am not going to shed any new light about this topic, but I didn’t find anything like this organized in one place, so I decided to write it down, hope you will find it useful.

Sometimes when you develop a kernel driver you need to use some internal API that cannot be accessed normally through the DDK. Though you may say “but it’s not an API if it’s not officially exported and supported by MS”. Well that’s kinda true, the point is that some functions like that which are not accessible from the kernel, are really accessible from usermode, hence they are called API. After all, if you can call NtCreateFile from usermode, eventually you’re supposed to be able to do that from kernel, cause it really happens in kernel, right? Obviously, NtCreateFile is an official API in the kernel too.

When I mean using system service APIs, I really mean by doing it platform/version independent, so it will work on all versions of Windows. Except when MS changes the interface (number of parameters for instance, or their type) to the services themselves, but that rarely happens.

I am not going to explain how the architecture of the SSDT and the transitions from user to kernel or how syscalls, etc work. Just how to use it to our advantage. It is clear that MS doesn’t want you to use some of its APIs in the kernel. But sometimes it’s unavoidable, and using undocumented API is fine with me, even in production(!) if you know how to do it well and as robust as possible, but that’s another story. We know that MS doesn’t want you to use some of these APIs because a) they just don’t export it in kernel on purpose, that is. b) starting with 64 bits versions of Windows they made it harder on purpose to use or manipulate the kernel, by removing previously exported symbols from kernel, we will get to that later on.

Specifically I needed ZwProtectVirtualMemory, because I wanted to change the protection of some page in the user address space. And that function isn’t exported by the DDK, bummer. Now remember that it is accessible to usermode (as VirtualProtectMemory through kernel32.dll syscall…), therefore there ought to be a way to get it (the address of the function in kernel) in a reliable manner inside a kernel mode driver in order to use it too. And this is what I’m going to talk about in this post. I’m going to assume that you already run code in the kernel and that you are a legitimate driver because it’s really going to help us with some exported symbols, not talking about shellcodes here, although shellcodes can use this technique by changing it a bit.

We have a few major tasks in order to achieve our goal: Map the usermode equivalent .dll file. We need to get the index number of the service we want to call. Then we need to get the base address of ntos and the address of the (service) table of pointers (the SSDT itself) to the functions in the kernel. And voila…

The first one is easy both in 32 and 64 bits systems. There are mainly 3 files which make the syscalls in usermode, such as: ntdll, kernel32 and user32 (for GDI calls). For each API you want to call in kernel, you have to know its prototype and in which file you will find it (MSDN supplies some of this or just Google it). The idea is to map the file to the address space as an (executable) image. Note that the cool thing about this mapping is that you will get the address of the required file in usermode. Remember that these files are physically shared among all processes after boot time (For instance, addresses might change because of ASLR but stay consistent as long as the machine is up). Following that we will use a similar functionality to GetProcAddress, but one that you have to write yourself in kernel, which is really easy for PE and PE+ (64 bits).

Alright, so we got the image mapped, we can now get some usermode API function’s address using our GetProcAddress, now what? Well, now we have to get the index number of the syscall we want. Before I continue, this is the right place to say that I’ve seen so many approaches to this problem, disassemblers, binary patterns matching, etc. And I decided to come up with something really simple and maybe new. You take two functions that you know for sure that are going to be inside kernel32.dll (for instance), say, CreateFile and CloseHandle. And then simply compare byte after byte from both functions to find the first different byte, that byte contains the index number of the syscall (or the low byte out of the 4 bytes integer really). Probably you have no idea what I’m talking about, let me show you some usermode API’s that directly do syscalls:

XP SP3 ntdll.dll
B8 25 00 00 00                    mov     eax, 25h        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 19 00 00 00                    mov     eax, 19h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP1 32 bits ntdll.dll

B8 3C 00 00 00                    mov     eax, 3Ch        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 30 00 00 00                    mov     eax, 30h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

2008 sp2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

Win7 64bits syswow64 ntdll.dll

B8 52 00 00 00                    mov     eax, 52h        ; NtCreateFile
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 2C 00                          retn    2Ch

B8 0C 00 00 00                    mov     eax, 0Ch        ; NtClose
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 04 00                          retn    4

These are a few snippets to show you how the syscall function templates look like. They are generated automatically by some tool MS wrote and they don’t change a lot as you can see from the various architectures I gathered here. Anyway, if you take a look at the bytes block of each function, you will see that you can easily spot the correct place where you can read the index of the syscall we are going to use. That’s why doing a diff on two functions from the same .dll would work well and reliably. Needless to say that we are going to use the index number we get with the table inside the kernel in order to get the corresponding function in the kernel.

This technique gives us the index number of the syscall of any exported function in any one of the .dlls mentioned above. This is valid both for 32 and 64 bits. And by the way, notice that the operand type (=immediate) that represents the index number is always a 4 bytes integer (dword) in the ‘mov’ instruction, just makes life easier.

To the next task, in order to find the base address of the service table or what is known as the system service descriptor table (in short SSDT), we will have to get the base address of the ntoskrnl.exe image first. There might be different kernel image loaded in the system (with or without PAE, uni-processor or multi-processor), but it doesn’t matter in the following technique I’m going to use, because it’s based on memory and not files… This task is really easy when you are a driver, means that if you want some exported symbol from the kernel that the DDK supplies – the PE loader will get it for you. So it means we get, without any work, the address of any function like NtClose or NtCreateFile, etc. Both are inside ntos, obviously. Starting with that address we will round down the address to the nearest page and scan downwards to find an ‘MZ’ signature, which will mark the base address of the whole image in memory. If you’re afraid from false positives using this technique you’re welcome to go further and check for a ‘PE’ signature, or use other techniques.

This should do the trick:

PVOID FindNtoskrnlBase(PVOID Addr)
{
    /// Scandown from a given symbol’s address.
    Addr = (PVOID)((ULONG_PTR)Addr &amp; ~0xfff);
    __try {
        while ((*(PUSHORT)Addr != IMAGE_DOS_SIGNATURE)) {
            Addr = (PVOID) ((ULONG_PTR)Addr – PAGE_SIZE);
        }
        return Addr;
    }
    __except(1) { }
    return NULL;
}

And you can call it with a parameter like FindNtoskrnlBase(ZwClose). This is what I meant that you know the address of ZwClose or any other symbol in the image which will give you some “anchor”.

After we got the base address of ntos, we need to retrieve the address of the service table in kernel. That can be done using the same GetProcAddress we used earlier on the mapped user mode .dll files. But this time we will be looking for the “KeServiceDescriptorTable” exported symbol.

So far you can see that we got anchors (what I call for a reliable way to get an address of anything in memory) and we are good to go, this will work in production without the need to worry. If you wanna start the flame war about the unlegitimate use of undocumented APIs, etc. I’m clearly not interested. :)
Anyway, in Windows 32 bits, the latter symbol is exported, but it is not exported in 64 bits! This is part of the PatchGuard system, to make life harder for rootkits, 3rd party drivers doing exactly what I’m talking about, etc. I’m not going to cover how to get that address in 64 bits in this post.

The KeServiceDescriptorTable is a table that holds a few pointers to other service tables which contain the real addresses of the service functions the OS supplies to usermode. So a simple dereference to the table and you get the pointer to the first table which is the one you are looking for. Using that pointer, which is really the base address of the pointers table, you use the index we read earlier from the required function and you got, at last, the pointer to that function in kernel, which you can now use.

The bottom line is that now you can use any API that is given to usermode also in kernelmode and you’re not limited to a specific Windows version, nor updates, etc. and you can do it in a reliable manner which is the most important thing. Also we didn’t require any special algorithms nor disassemblers (as much as I like diStorm…). Doing so in shellcodes make life a bit harder, because we had the assumption that we got some reliable way to find the ntos base address. But every kid around the block knows it’s easy to do it anyway.

Happy coding :)

References I found interesting about this topic:
http://j00ru.vexillium.org/?p=222
http://alter.org.ua/docs/nt_kernel/procaddr/

http://uninformed.org/index.cgi?v=3&a=4&p=5

And how to do it in 64 bits:

http://www.gamedeception.net/threads/20349-X64-Syscall-Index

Memory Management

December 7th, 2010

In this post I decided to write about a few things you have to keep on mind while writing or designing a big application regarding memory management, I will also try to give you a new way for looking at memory allocations.

The problem is that one day you wake up and you have either memory leaks or memory fragmentation, you don’t know which or why. To eliminate the possibility of memory fragmentation, you will have to rule out memory leaks first.

Trying to find a solution to such a case you find that your dozens of DLLs work (normally but-)sharing the same CRT, so if you pass an allocated object from one DLL to be used and freed by another DLL, you’re screwed. Also, you can’t separate the CRTs, etc. The root of the problem is because each CRT has its own heap, thus if you statically link a DLL with its own CRT, it will have its own heap. If all DLLs use the same CRT, they share the same heap, or even the application’s global heap. Why is it bad? Because then if you have memory leaks in one DLL, it will dirty the whole global heap. I will talk about it soon. An advantage for separating heaps is that each thread that belongs to some DLL can access its own heap faster, without congestion on the global heap lock which all DLLs/threads need.
The idea is to separate the applications into logical groups of components, and each logical group will use its own heap, so you can pass pointers around. This is really hard to do if you have existing code and never thought about this issue. So you’re saying, “oh yeah, but if I don’t have leaks, and my code is doing alright, I don’t need it anyway”. Well I wouldn’t be so quick about thinking that. Keep on reading.

In C++ for instance, overriding new and delete for your class, is really useful, because even if it gets used in a different component, C++ is responsible to bring the “deletion” to the appropriate heap.

Now I wanna go back a bit to some background on how memory allocators work in general, which is usually in a naive fashion. It really means that when you ask for a block of memory, the memory manager (malloc for you) allocates some chunk and returns a pointer to you. Once another request for memory is being done, the next memory address is getting used to supply the memory request… (suppose the first chunk requested is still busy), and so on. What you can see here, is that the memory allocator works linearly (allocates a block after the other) and chronologically (depends on the time you asked for the allocation).

Let me give you some example so you understand what I’m talking about, in this example there are two threads: Thread A and Thread B.
Thread A allocates chunks of 0.5kb in a loop, and at some point shuts down after it’s done with its task.
Thread B allocates chunks of 0.5kb in a loop, simultaneously, and keeps on working.
And suppose they are using the same CRT, thus the same heap.

What we will see from the memory manager perspective is something like:
<0.5kb busy> <0.5kb busy> <0.5kb busy>…………………….and then something like ….

Each chunk might belong to either Thread A or Thread B. Don’t be picky about it, it’s really up to so many environmental issues (processors, locks, etc).

Suppose Thread A is finished with its memory because it was only up to do some initialization job(!) and decides to free everything it has allocated. The memory picture would be something like this now:
<0.5kb free><0.5kb busy><0.5kb free><0.5kb busy> …………………lots of <0.5kb busy belongs to ThreadB> and then ….

What we got now is a hole-y memory map, some blocks are free, the others are busy. Now if we want to allocate another 0.5kb, no problem, we can use the first free block, it will match perfectly, so everybody’s happy so far. As long as the block we want to allocate in such a memory state is smaller or equal to 0.5kb, we will get them for sure, right?
But what happens if we ask to allocate a block of 10kb, ha ah, now we have to scan the whole memory to find a free chunk whose size is bigger than 10kb, right? Can you see what’s wrong here?
What really is wrong is that you have a total free amount of X kbytes, which surely can contain 10kb, but they are all chunked, so the memory allocator has to find the next block which can satisfy such a request and it so happens that all these free kbytes are useless to us.
In short, this example is a classic memory fragmentation scenario. To check how bad a memory fragmentation is, you have to measure it by the total free memory, and largest free block size. This way, if you have 2 megs that are fragmented in tiny chunks over 200mb, and the biggest free block is 2 megs – it means you can’t even allocate a 3mb block while the total free size is 198mb. Memory is cruel.

This brings me to say that each memory-consumer has a specific behavior. A memory-consumer like Thread A, that was doing some initialization job and then got down, and freeing all its blocks; is just an init phase memory user.
A memory-consumer like Thread B, that still works after Thread A is done and uses its allocated memory, is an application life-time memory user.
Basically each memory-consumer can be classified into some behavior! Once you mix between those behaviors on the same heap, you’re sooner or later screwed.

The rule of thumb I would suggest you should follow is:
You should distinguish between each memory-consumer type, if you know something consumes memory and it really caches some objects, make sure it works as early as possible in your application. And only then allocate your application life-time objects. This causes its memory to be allocated first and not cause fragmentations to other consumers with different behaviors.

If you know you have to cache some objects while other application life-time objects have to be allocated from the same heap at the same time(!), this is where you should see the red light and create another heap for each behavior.
A practical example is when you open some application that has a workspace, now while you open a workspace in the first time, the application is designed to init its UI stuff only at that moment. Therefore while you’re loading and parsing the data from the file into memory, you’re also setting up the UI and do lots of allocations, the UI is going to stay there until you shutdown the application, and the data itself is going to be de-allocated once you close the workspace. This will create fragmentations if you don’t split the heaps. Can you see why? Think of loading and unloading workspaces and what happens to the memory. Don’t forget that each time you open a workspace, the memory chunks that need to get allocated have a different size. And that the UI still uses its memory which is probably fragmented, so it only gets worse to some extent. Another alternative would be to make sure the UI allocations happen first, no matter what. And only then you can allocate and de-allocate lots of memory blocks without worrying for fragmentations.
Even if you got the UI allocations done first, then there still might be a problem. Suppose you load the file’s data to memory and some logic works in a way that some of the data stays cached for the application life-time even after the workspace is going down. This is a big no no, it will leave you with fragmentations again, hence you should split your heap again.

That’s it mostly, I hope I managed to make it clear.
I didn’t want to talk about the implementation of the heaps, whether they are (LFH) low-fragmentation-heap or whatever, if you mix your memory-consumers, nothing will save you.

One note to mention, that if your cached-memory leaks, for whatever reason, with the new strategy and a standalone heap, you won’t hurt anybody with your leaks, and it will even be easier to track leaks per heap rather than dozens of DLLs on the same global heap. And if you already have a big application with a jungle of one heap and various memory-consumers types, there are some techniques to save the day…

Unsigned in Java

October 26th, 2010

Hello everyone,

as you may know we (ReviveR team) chose to use Java as the main language for the framework and maybe the UI too (they are totally separated for now).
We started by converting diStorm to Java using JNI, and converting diSlib64, my robust PE file parser in Python.

While we were doing the conversion we found out that there is no unsigned keyword in Java. Yes, I gotta admit, we are noobs in Java, but we are professional coders using other languages, as a matter of fact. On the other side, everyone knows it’s all about new syntax and the benefits of the language itself that once you’re used to, then you rock with the language. So syntax is easy. And it’s gonna take a short while ’till we learn the benefits of Java. After all we chose Java because it is widely cross platform, and C# is ten times better, I can just claim it, not going to prove it. In this post, however, I’m going to talk about the disadvantages of Java, to name one, unsigned numbers.

When you parse a PE+ file (that’s for AMD64), you need to read some 64 bit integers. Therefore we needed a way to hold an address in 64 bits, usually addresses are unsigned, in contrast to RVAs. The problem was that there is no way to define an unsigned long in Java. This is a really unpleasant welcome to Java, seriously. Wtf did the designer think? And I looked for his stupid comment, it read something like: “ahh, most coders don’t need ‘unsigned’, it only complicates stuff”. What a douche. Now, this is a denial to reality. Looking for other alternatives on the net I found that most people use a bigger size for their integer, suppose they need an unsigned 8 bits integer, then they will use the next bigger size that could hold such an integer as unsigned, which is short… This is so lame, you can even get unsigned 32 bit integers, by using longs, right? But what about using unsigned 64 bit integers? No bigger size, no way.
Others say, you can use BigIntegers, the moment I heard about that I wanted to cry out loud. My guess is that the implementation is a bit vector. So using BigInteger only for representing unsigned longs, that’s useless, if you ask me. Oh and I almost forgot to mention that it accepts the byte[] in big endian only, blee.

I really got pissed off, there were moments I wanted to go back to C++. Although I knew that I’m going to waste time on auto-pointers, data structures, and shit like that, but C++ has unsigned. How cool is that.

I consulted with a friend and he referred me to this link: Unsigned arithmetic in Java.
That seemed a bit helpful, and I liked the general idea. I think there are errors in the code snippets (didn’t check them though). Anyway, the guy suggests to use an “isLessThanUnsigned” comparison, I didn’t want to limit my unsigned long’s interface in such a way.

Therefore I took a look at the interface of BigInteger, saw that they use a compareTo method, and did the same on a new class I wrote, named ULong. The class can accept, byte array, bytebuffer, longs, and also as big endian if necessary.

The compareTo was written from scratch:

public long compareTo(ULong rh)
{
        // If both numbers have the same sign, it’s up to their real values.
        if (((mValue ^ rh.mValue) >> 63) == 0) return mValue – rh.mValue;
        // Here they have different signs, if mValue has the MSB set, it’s negative _in Java_, thus bigger.
        if (mValue < 0) return 1;
        // Else, the rh.mValue is bigger.
        return -1;
}

Very basic arithmetic operations, and it’s pretty quick relative to BigInteger’s, mine is 8.5 times faster, on my machine…
The point is that I couldn’t accept all the extra stuff it needed to do in order to represent an unsigned long. It bugged me. I’m not going to stop and take my time again (hopefully) on issues like this, but since I don’t know Java this well, I was curious to see how things work.

Another issue that I didn’t like is that you cannot define global functions (or am I wrong here?), everything has to be in classes, this is annoying sometimes, but I guess the rational was to force a kind of ‘namespaces’, so it’s fine eventually – but let me decide what to do, I know what I’m doing.

Last one, the separation to files based on public classes, it really forces one to divide all his classes into lots of files. Or dump them one after the other as inner classes. And then if you have a third inner enum, for instance, the compiler shouts at you that the outer class has to be static, etc. Consequently, it forces you to move it out, and then you find yourself dividing your code again, and now it’s out of context of the class you wanted to put it in…

Oh dear Java, a love begins :(

P.S – I think that the beauty is that I know to use high level languages when I have to, with all due respect to me and low level.

Deleting a New

October 24th, 2010

Recently I’ve been working on some software to find memory leaks and fix its fragmentation too. I wanted to start with some example of a leak. And next time I will talk about the fragmentations also.

The leak happens when you allocate objects with new [] and delete (scalar) it. If you’re a c++ savvy, you might say “WTF, of course it leaks”. But I want to be honest for a moment and say that except from not calling the destructors of the instances in the array, I really did not except it to leak. After all, it’s a simple memory pointer we are talking about. But I was wrong and thought it’s worth a post to share this with you.

Suppose we got the following code:

class A {
 public:
  A() { }
  ~A() { }
 private:
  int x, y;
};


A * a = new A[3];

delete a; // BUGBUG

What really happens in the code, how does the constructors and destructors get called? How does the compiler know whether to destroy a single objects, or how many objects to destroy when you’re finished working with the array ?

Well, first of all, I have to warn you that some of it is implementation specific (per compiler). I’m going to focus on Visual Studio though.

The first thing to understand is that the compiler knows which object to construct and destroy. All this information is available in compilation time (in our case since it’s all static). But if the object was dynamic, the compiler would have called the destructor dynamically, but I don’t care about that case now…

So allocating memory for the objects, it will eventually do malloc(3 * sizeof(A)) and return a pointer to assign in the variable ‘a’. Here’s the catch, the compiler can’t know how many destructors to call when it wants to delete the array, right? It has to bookkeep the count somehow. But Where??
Therefore the call to the memory allocation has more to it. The way MSVC does it is as following (some pseudo code):

int* tmpA = (int*)malloc(sizeof(A) * 3 + sizeof(int)); // Notice the extra space for the integer!
*tmpA = 3; // This is where it stores the count of instances! ta da
A* a = (A*)(tmpA + 1); // Skip that integer, really assigns the pointer allocated + 4 bytes in ‘a’.

Now all it has to do is calling the constructors on the array for each entry, easy.
When you work with ‘a’ the implementation is hidden to you, eventually you get a pointer you should only use for accessing the array and nothing else.
At the end you’re supposed to delete the array. The right way to delete this array is to do ‘delete []a;’. The compiler then understands you ask to delete a number of instances rather than a single instance. So it will loop on all the entries and call a destructor for each instance, and at last free the whole memory block. One of the questions I asked in the beginning is how would the compiler know how many objects to destroy? We already know the answer by now. Simple, it stored the count before the pointer you hold.

So deleting the array in a correct manner (and reading the counter) would be as easy as:

int* tmpA = (int*)a – 1; // Notice we take the pointer the compiler handed to you, and get the ‘count’ from it.
for (int i = 0; i < *tmpA; i++) a[i].~a();
free (tmpA); // Notice we call free() with the original pointer that got allocated!

So far so good, right? But the leak happens if you call a scalar delete on the pointer you get from allocating a new array. And that’s the problem even if you have an array of primitive types (like integers, chars, etc) that don’t require to call a destructor you still leak memory. Why’s that?
Since the new array, as we saw in this implementation returns you a pointer, which does not point to the beginning of the allocated block. And then eventually you call delete upon it, will make the memory manager not find the start of the allocated block (cause you feed it with an offset into the allocated block) and then it has a few options. Either ignore your call, and leak that block. Crash your application or give you a notification, aka Debug mode. Or maybe in extreme cases cause a security breach…

In some forum I read that there are many unexpected behaviors in our case, one of them made me laugh so hard, I feel I need to share it with you:
“A* a = new A[3]; delete a; Might get the first object destroyed and released, but keep the rest in memory”.

Well it doesn’t take a genius to understand that the compiler prefers to bulk allocate all objects in the same block…and yet, funny.
The point the guy tries to make is that you cannot know what the compiler implementation is, as weird as it might be, don’t ever rely on it. And I totally agree.

So in our case a leak happens in the following code:
(wrong:)

int*a = new int[100];

delete a;

The point is that when you new[], you should must call a corresponding delete [].
Except from the need to make your code readable and correct, it won’t be broken, and never trust the compiler, just code right in the first place.

And now you can imagine what happens if you alloc a single object and tries to delete[] it. Not healthy, to say the least.

diStorm for Java, JNI

October 4th, 2010

Since we decided to use Java for the reverse engineering studio, ReviveR, I had to wrap diStorm for Java. For now we decided that the core framework is going to be written in Java, and probably the UI too, although we haven’t concluded that yet. Anyway, now we are thinking about the design of the whole system, and I’m so excited about how things start to look. I will save a whole post to tell you about the design once it’s ready.

I wanted to talk a bit about the JNI, that’s the Java Native Interface. Since diStorm is written in C, I had to use JNI to use it inside Java now. It might remind P/Invoke to people, or Python extensions, etc.

The first thing I had to do is to define the same C structures of diStorm’s API, but in Java. And this time they are classes, encapsulated obviously. After I had this classes ready, and stupid Java, I had to put each public class in a separate file… Eventually I had like 10 files for all definitions and then next step was to compile the whole thing and use the javah tool to get the definitions for the native functions. I didn’t like the way it worked, for instance, any time you rename the package name, add/remove a package the name of the exported C function, of the native .DLL file, changes as well, big lose.
Once I decided on the names of the packages and classes finally I could move on to implement the native C functions that correspond to the native definitions in the Java class that I wrote earlier. If you’re familiar a bit with JNI, you probably know well jobject and its friends. And because I use classes rather than a few primitive type arguments, I had to work hard to wrap them, not mentioning arrays of the instructions I want to return to the caller.

The Java definition looks as such:

public static native void Decompose(CodeInfo ci, DecomposedResult dr);
 

The corresponding C function looks as such:

JNIEXPORT void JNICALL Java_distorm3_Distorm3_Decompose
  (JNIEnv *env, jobject thiz, jobject jciObj, jobject jdrObj);

Since the method is static, there’s no use for the thiz (equivalent of class’s this) argument. And then the two objects of input and output.
Now, the way we treat the jobjects is dependent on the classes we set in Java. I separated them in such a way that one class, CodeInfo, is used for the input of the disassembler. And the other class, DecomposedResult, is used for output, this one would contain an array to return the instructions that were disassembled.

Since we are now messing with arrays, we don’t need to use another out-argument to indicate the number of entries we returned in the array, right? Because now we can use something like array.length… As opposed to C function: void f(int a[], int n). So I found myself having to change the classes a bit to take this into account, nothing’s special though. Just need to get advantages of high level languages.

Moving on, we have to access the fields of the classes, this is where I got really irritated by the way the JNI works. I wish it were as easy as cTypes for Python, of course they are not parallel exactly, but they solve the same problem after all. Or a different approach like parsing a tuple in Embedded Python, PyArg_ParseTuple, which eases this process so much.

For each field in the class, you need to know both its type and its Id. The type is something you know at compile time, it’s predefined and simply depends on the way you defined your classes in Java, easy. The ugly part now begins, Ids – You have to know to which field you want to access, either for read or write access. The idea behind those Ids was to make the code more flexible, in the way that if you inherit a class, then the field you want to access probably moved to a new slot in the raw structure that contains it.
Think of something like this:

struct A {
int x, y;
};

struct B {
 int z, color;
};

struct AB {
 A;
 B;
};

Suddenly, accessing to AB::B.z has a different index than accessing to B.z. Can you see that?
So they guys who designed JNI came with the idea of querying the class, by using internal reflection to get this Id (or really an index to the variable in the struct, I take a guess). But this reflection thingy is really slow, obviously you need to do string comparisons on all members of the class, and all classes in the derived class… No good. So you might say, “but wait a sec, the class’s hierarchy is not going to change in the lifetime of the application, so why not reuse its value?”. This is where the JNI documentation talks about caching-ids. Now seriously, why don’t you guys do it for us internally, why I need to implement my own caching. Don’t give me this ‘more-control’ bullshit. I don’t want control, I want to access the field as quickly as possible and get on to other tasks.

Well, since the facts are different, and we have to do things the way we do, now we have to cache the stupid Ids for good. While I read how people did it and why they got mysterious crashes, I solved the problem quickly, but I want to elaborate on it.

In order to cache the Ids of the fields you want to have access to, you do the following:

if (g_ID_CodeOffset == NULL) {
    g_ID_CodeOffset = (*env)->GetFieldID(env, jObj, "mCodeOffset", "J");
    // Assume the field exists, otherwise your interfaces are broken anyway.
}
// Now we can use it…

Great right? Well, not so fast. The problem is that if you have a few functions that each accesses this same class and its members, you will need to have this few lines of code everywhere for each use. No go. Therefore the common solution is to have another native static InitIDs function and invoke it right after loading the native library in your Java code, for instance:

static {
        System.loadLibrary("distorm3");
        InitIDs();
}

Another option would be to use the JNI_OnLoad exported function to initialize all global Ids before the rest of the functions get ever called. I like that option more than the InitIDs, which is less artificial in my opinion.

Once we got the Id ready we can use it, for instance:

codeOffset = (*env)->GetLongField(env, jciObj, g_ID_CodeOffset);

Note that I use the C interface of the JNI API, just so you are aware to it. And jciObj is the argument we got from Java calling us in the Decompose function.

When calling the GetField function we have to pass a jclass, that’s a Java-class object’s pointer kinda. In contrast to the class instance, I hope you know the difference. Now since we cache the Ids for the rest of the application life time, we have to keep a reference to this Java-class, otherwise weird problems and nice crashes should (un)surprise you. This is crucial since we use the same Ids for the same classes along the code. So when we call the GetFieldID we should hold a reference to that class, by calling:

(*env)->NewWeakGlobalRef(env, jCls);

Note that jCls was retrieved using:

jCls = (*env)->FindClass(env, "distorm3/CodeInfo");

Of course, don’t forget to remove the reference to those classes you used in your code, by calling DeleteGlobalRef in JNI_OnUnload to avoid leaks…

The FindClass function is very good once you know how to use it. It took me a while to figure out the syntax and naming convention. For example, the String which seems to be a primitive type in Java, is really not, it’s just a normal class, therefore you will have to use “java/lang/String” if you want to access a string member.
Suppose you got a class “CodeInfo” in the “distorm3” package, then “distorm3/CodeInfo” is the package-name/class-name.
Suppose you got an inner class (inside another class), then “distorm3/Outer$Inner” is the package-name/outer-class-name$inner-class-name.
And probably there are a bit more to it, but that’s a good start.

About returning new objects to the caller. We said already that we don’t use out-arguments in Java.
Think of:

void f(int *n)
{
 *n = 5;
}

That’s exactly what an out-argument is, to return some value rather than using the return keyword…
When you want to return lots of info, it’s not a good idea, you will have to pass lots of arguments as well, pretty ugly.
The idea is to pass a structure/class that will hold this information, and even have some encapsulation to it.
The problem at hand is whether to use a constructor of the class, or just create the object and set each of its values manually.
Also, I wonder which method is faster, letting the JVM do it on its own in a constructor, or doing everything using JNI.
Unfortunately I don’t have an answer to this question. I can only say that I used the latter method of creating the raw object and setting its fields. I thought it would be better.
It looks like this:

jobject jOperand = (*env)->AllocObject(env, g_OperandIds.jCls);
if (jOperand == NULL) // Handle error!
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Type, insts[i].ops[j].type);
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Index, insts[i].ops[j].index);
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Size, insts[i].ops[j].size);

(*env)->SetObjectArrayElement(env, jOperands, j, jOperand);

This is real piece of code taken from the wrapper code. It constructs an Operand class from the Operand structure in C. Notice the way the AllocObject is used, using that jCls we hold a reference to, instead of calling FindClass again… Then setting the fields and setting this object in the array of Operands.

What I didn’t like much in the JNI is that I had to call SetField, GetField and those variations. On one hand, I understand they wanted you to know which type of field you access to. But on the other hand, when I queried the Id of the field, I specified its type, so I pretty much know what type-value I’m setting, so… Well, unless you have bugs in your code, but that will always cause problems.

To another issue, one of the members of the CodeInfo structure in diStorm is a pointer to the binary code that you want to disassemble. It means that we have to get some buffer from Java as well. But apparently, sometimes the JVM decided to make a copy of the buffer/array that is being passed to the native function. In the beginning I used a straight forward byte[] member in the class. This sucks hard. We don’t want to waste time on copying and freeing buffers that are read-only. Performance, if can be better, should be better by default, if you ask me. So reading the documentation there’s an extension to the JNI, to use java.nio.ByteBuffer, which gives you a direct access to the Java buffer without the extra efforts of copying. Note that it requires the caller to the native API to use this class specifically and sometimes you’re limited…

The bottom line is that it takes a short while to understand how to use JNI and then you get going with it. I found it cumbersome a bit… The most annoying part is all the extra preparations you have to do in order to access a class or a field, etc. Unless you don’t care at all about performance but then your code is less readable for sure. We don’t have any information about performance of allocating new objects and array usage. We can’t base our ways of coding on anything. I wish it could be more user friendly or parts of it eliminated somehow.