Archive for the ‘Algorithms’ Category

Armstorm – ARM Disassembler

Wednesday, December 12th, 2012


It seems the distorm-arm project is going to be called Armstorm, after asking the guys on Twitter.
Anyway, as I’m working on this project, I just came up with the following cool snippet and wanted to submit it as a riddle:

int base = 0;
int runLength = 0;
while (mask && (~mask & 1)) base++, mask >>= 1;
while (mask && (mask & 1)) runLength++, mask >>= 1;
if (!mask && runLength > 2) {

Any idea what does it do or why I need it ?

Private Symbols Look Up by Binary Signatures

Friday, July 1st, 2011

This post could really be extended and divided into a few posts, but I decided to try and keep it small as much as I can. If I see it draws serious attention I might elaborate on the topic.

Signature matching for finding functions is a very old technique, but I haven’t found anyone who talks about it with juicy details or at all, and decided to show you a real life example. It is related to the last post about finding service functions in the kernel. The problem is that sometimes inside the kernel you want to use internal functions, which are not exported. Don’t start with “this is not documented story”, I don’t care, sometimes we need to get things done no matter what. Sometimes there is no documented way to do what you want. Even in legitimate code, it doesn’t have to be a rootkit, alright? I can say, however, that when you wanna add new functionality to an existing and working system, in whatever level it might be, you would better depend as much as you can on the existing functionality that was written by the original programmers of that system. So yes, it requires lot of good reversing, before injecting more code and mess up with the process.
The example of a signature I’m going to talk about is again about getting the function ZwProtectVirtualMemory address in the kernel. See the old post here to remember what’s going on. Obviously the solution in the older post is almost 100% reliable, because we have anchors to rely upon. But sometimes with signature matching the only anchors you have are binary pieces of:
* immediate operand values
* strings
* xrefs
* disassembled instructions
* a call graph to walk on
and the list gets longer and can really get crazy and it does, but that’s another story.

I don’t wanna convert this post into a guideline of how to write a good signature, though I have lots of experience with it, even for various archs, though I will just say that you never wanna put binary code as part of your signature, only in extreme cases (I am talking about the actual opcode bytes), simply because you usually don’t know what the compiler is going to do with the source code, how it’s going to look in assembly, etc. The idea of a good signature is that it will be as generic as possible so it will survive (hopefully) the updates of the target binary you’re searching in. This is probably the most important rule about binary signatures. Unfortunately we can never guarantee a signature is to be future compatible with new updates. But always test that the signature matches on a few versions of the binary file. Suppose it’s a .DLL, then try to get as many versions of that DLL file as possible and make a script to try it out on all of them, a must. The more DLLs the signature is able to work on successfully, the better the signature is! Usually the goal is to write a single signature that covers all versions at once.
The reason you can’t rely on opcodes in your binary signature is because they get changed many times, almost in every compilation of the code in a different version, the compiler will allocate new registers for the instructions and thus change the instructions. Or since code might get compiled to many variations which effectively do the same thing, I.E: MOV EAX, 0 and XOR EAX, EAX.
One more note, a good signature is one that you can find FAST. We don’t really wanna disassemble the whole file and run on the listing it generated. Anyway, caching is always a good idea and if you have many passes to do for many signatures to find different things, you can always cache lots of stuff, and save precious loading time. So think well before you write a signature and be sure you’re using a good algorithm. Finding an XREF for a relative branch takes lots of time, try to avoid that, that should be cached, in one pass of scanning the whole code section of the file, into a dictionary of “target:source” pairs, with false positives (another long story) that can be looked up for a range of addresses…

I almost forgot to mention, I used such a binary signature inside the patch I wrote as a member of ZERT, for closing a vulnerability in Internet Explorer, I needed to find the weak function and patch it in memory, so you can both grab the source code and see for yourself. Though the example does use opcodes (and lots of them) as part of the signature, but there’s special reason for it. Long story made short: The signature won’t match once the function will get officially patched by MS (recall that we published that solution before MS acted), and then this way the patcher will know that it didn’t find the signature and probably the function was already patched well, so we don’t need to patch it on top of the new patch.. confusing shit.

The reason I find signatures amazing is because only reversers can do them well and it takes lots of skills to generate good ones,
happy signaturing :)

And surprisingly I found the following link which is interesting:

So let’s delve into my example, at last.
Here’s a real example of a signature for ZwProtectVirtualMemory in a Kernel driver.

Signature Source Code

From my tests this signature worked well on many versions…though always expect it might be broken.

Binary Hooking Problems

Saturday, May 14th, 2011

Most binary hooking engines write a detour in the entry point of the target function. Other hooking engines patch the IAT table, and so on. One of the problems with overwriting the entry point with a JMP instruction is that you need enough room for that instruction, usually a mere 5 bytes will suffice.

How do the hooking algorithms decide how much is “enough”?

Pretty easy, they use a dissasembler to query the size of each instruction they scan, so if the total size of the instructions that were scanned is more than 5 bytes, they’re done.
As an example, usually, functions start with these two instructions:

which take only 3 bytes. And we already said that we need 5 bytes total, in order to replace the first instructions with our JMP instruction. Hence the scan will have to continue to the next instruction or so, till we got at least 5 bytes.

So 5 bytes in x86 could contain from one instruction to 5 instructions (where each takes a single byte, obviously). Or even a single instruction whose size is longer than 5 bytes. (In x64 you might need 12-14 bytes for a whole-address-space JMP, and it only makes matter worse).

It is clear why we need to know the size of instructions, since we overwrite the first 5 bytes, we need to relocate them to another location, the trampoline. There we want to continue the execution of the original function we hooked and therefore we need to continue from the next instruction that we haven’t override. And it is not necessarily the instruction at offset 5… otherwise we might continue execution in the middle of an instruction, which is pretty bad.

Lame hooking engines don’t use disassemblers, they just have a predefined table of popular prologue instructions. Come a different compiled code, they won’t be able to hook a function. Anyway, we also need a disassembler for another reason, to tell whether we hit a dead end instruction, such as: RET, INT 3, JMP, etc. These are hooking spoilers, because if the first instruction of the target function is a simple RET (thus the function doesn’t do anything, leave aside cache side effects for now), or even a “return 0” function, which usually translates into “xor eax, eax; ret”, still takes only 3 bytes and we can’t plant a detour. So we find ourselves trying to override 5 bytes where the whole function takes several bytes (< 5 bytes), and we cannot override past that instruction since we don’t know what’s there. It might be another function’s entry point, data, NOP slide, or what not. The point is that we are not allowed to do that and eventually cannot hook the function, fail.

Another problem is relative-offset instructions. Suppose any of the first 5 bytes is a conditional branch instruction, we will have to relocate that instruction. Usually conditional branch instruction are only 2 bytes. And if we copy them to the trampoline, we will have to convert them into the longer variation which is 6 bytes and fix the offset. And that would work well. In x64, RIP-relative instructions are also pain in the butt, as well as any other relative-offset instruction which requires a fix. So there’s quiet a long list of those and a good hooking engine has to support them all, especially in x64 where there’s no standard prologue for a function.

I noticed a specific case where WaitForSingleObject is being compiled to:
JMP short WaitForSingleObjectEx

in x64 of course; the xor takes 3 bytes and the jmp is a short one, which takes 2 bytes. And so you got 5 bytes total and should be able to hook it (it’s totally legal), but the hook engine I use sucked and didn’t allow that.

So you might say, ok, I got a generic solution for that, let’s follow the unconditional branch and hook that point. So I will hook WaitForSingleObjectEx instead, right? But now you got to the dreaded entry points problem. You might get called for a different entry point that you never meant to hook. You wanted to hook WaitForSingleObject and now you end up hooking WaitForSingleObjectEx, so all callers to WaitForSingleObject get to you, that’s true. In addition, now all callers to WaitForSingleObjectEx get to you too. That’s a big problem. And you can’t ever realize on whose behalf you were called (with a quick and legitimate solution).

The funny thing about the implementation of WaitForSingleObject is that it was followed immediately by an alignment NOP slide, which is never executed really, but the hooking engine can’t make a use of it, because it doesn’t know what we know. So unconditional branches screw up hooking engines if they show up before 5 bytes from the entry point, and we just saw that following the unconditional branch might screw us up as well with wrong context. So if you do that, it’s ill.

What would you do then in such cases? Cause I got some solutions, although nothing is perfect.

Calling System Service APIs in Kernel

Wednesday, January 26th, 2011

In this post I am not going to shed any new light about this topic, but I didn’t find anything like this organized in one place, so I decided to write it down, hope you will find it useful.

Sometimes when you develop a kernel driver you need to use some internal API that cannot be accessed normally through the DDK. Though you may say “but it’s not an API if it’s not officially exported and supported by MS”. Well that’s kinda true, the point is that some functions like that which are not accessible from the kernel, are really accessible from usermode, hence they are called API. After all, if you can call NtCreateFile from usermode, eventually you’re supposed to be able to do that from kernel, cause it really happens in kernel, right? Obviously, NtCreateFile is an official API in the kernel too.

When I mean using system service APIs, I really mean by doing it platform/version independent, so it will work on all versions of Windows. Except when MS changes the interface (number of parameters for instance, or their type) to the services themselves, but that rarely happens.

I am not going to explain how the architecture of the SSDT and the transitions from user to kernel or how syscalls, etc work. Just how to use it to our advantage. It is clear that MS doesn’t want you to use some of its APIs in the kernel. But sometimes it’s unavoidable, and using undocumented API is fine with me, even in production(!) if you know how to do it well and as robust as possible, but that’s another story. We know that MS doesn’t want you to use some of these APIs because a) they just don’t export it in kernel on purpose, that is. b) starting with 64 bits versions of Windows they made it harder on purpose to use or manipulate the kernel, by removing previously exported symbols from kernel, we will get to that later on.

Specifically I needed ZwProtectVirtualMemory, because I wanted to change the protection of some page in the user address space. And that function isn’t exported by the DDK, bummer. Now remember that it is accessible to usermode (as VirtualProtectMemory through kernel32.dll syscall…), therefore there ought to be a way to get it (the address of the function in kernel) in a reliable manner inside a kernel mode driver in order to use it too. And this is what I’m going to talk about in this post. I’m going to assume that you already run code in the kernel and that you are a legitimate driver because it’s really going to help us with some exported symbols, not talking about shellcodes here, although shellcodes can use this technique by changing it a bit.

We have a few major tasks in order to achieve our goal: Map the usermode equivalent .dll file. We need to get the index number of the service we want to call. Then we need to get the base address of ntos and the address of the (service) table of pointers (the SSDT itself) to the functions in the kernel. And voila…

The first one is easy both in 32 and 64 bits systems. There are mainly 3 files which make the syscalls in usermode, such as: ntdll, kernel32 and user32 (for GDI calls). For each API you want to call in kernel, you have to know its prototype and in which file you will find it (MSDN supplies some of this or just Google it). The idea is to map the file to the address space as an (executable) image. Note that the cool thing about this mapping is that you will get the address of the required file in usermode. Remember that these files are physically shared among all processes after boot time (For instance, addresses might change because of ASLR but stay consistent as long as the machine is up). Following that we will use a similar functionality to GetProcAddress, but one that you have to write yourself in kernel, which is really easy for PE and PE+ (64 bits).

Alright, so we got the image mapped, we can now get some usermode API function’s address using our GetProcAddress, now what? Well, now we have to get the index number of the syscall we want. Before I continue, this is the right place to say that I’ve seen so many approaches to this problem, disassemblers, binary patterns matching, etc. And I decided to come up with something really simple and maybe new. You take two functions that you know for sure that are going to be inside kernel32.dll (for instance), say, CreateFile and CloseHandle. And then simply compare byte after byte from both functions to find the first different byte, that byte contains the index number of the syscall (or the low byte out of the 4 bytes integer really). Probably you have no idea what I’m talking about, let me show you some usermode API’s that directly do syscalls:

XP SP3 ntdll.dll
B8 25 00 00 00                    mov     eax, 25h        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 19 00 00 00                    mov     eax, 19h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP1 32 bits ntdll.dll

B8 3C 00 00 00                    mov     eax, 3Ch        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 30 00 00 00                    mov     eax, 30h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

2008 sp2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

Win7 64bits syswow64 ntdll.dll

B8 52 00 00 00                    mov     eax, 52h        ; NtCreateFile
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 2C 00                          retn    2Ch

B8 0C 00 00 00                    mov     eax, 0Ch        ; NtClose
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 04 00                          retn    4

These are a few snippets to show you how the syscall function templates look like. They are generated automatically by some tool MS wrote and they don’t change a lot as you can see from the various architectures I gathered here. Anyway, if you take a look at the bytes block of each function, you will see that you can easily spot the correct place where you can read the index of the syscall we are going to use. That’s why doing a diff on two functions from the same .dll would work well and reliably. Needless to say that we are going to use the index number we get with the table inside the kernel in order to get the corresponding function in the kernel.

This technique gives us the index number of the syscall of any exported function in any one of the .dlls mentioned above. This is valid both for 32 and 64 bits. And by the way, notice that the operand type (=immediate) that represents the index number is always a 4 bytes integer (dword) in the ‘mov’ instruction, just makes life easier.

To the next task, in order to find the base address of the service table or what is known as the system service descriptor table (in short SSDT), we will have to get the base address of the ntoskrnl.exe image first. There might be different kernel image loaded in the system (with or without PAE, uni-processor or multi-processor), but it doesn’t matter in the following technique I’m going to use, because it’s based on memory and not files… This task is really easy when you are a driver, means that if you want some exported symbol from the kernel that the DDK supplies – the PE loader will get it for you. So it means we get, without any work, the address of any function like NtClose or NtCreateFile, etc. Both are inside ntos, obviously. Starting with that address we will round down the address to the nearest page and scan downwards to find an ‘MZ’ signature, which will mark the base address of the whole image in memory. If you’re afraid from false positives using this technique you’re welcome to go further and check for a ‘PE’ signature, or use other techniques.

This should do the trick:

PVOID FindNtoskrnlBase(PVOID Addr)
    /// Scandown from a given symbol’s address.
    Addr = (PVOID)((ULONG_PTR)Addr &amp; ~0xfff);
    __try {
        while ((*(PUSHORT)Addr != IMAGE_DOS_SIGNATURE)) {
            Addr = (PVOID) ((ULONG_PTR)Addr – PAGE_SIZE);
        return Addr;
    __except(1) { }
    return NULL;

And you can call it with a parameter like FindNtoskrnlBase(ZwClose). This is what I meant that you know the address of ZwClose or any other symbol in the image which will give you some “anchor”.

After we got the base address of ntos, we need to retrieve the address of the service table in kernel. That can be done using the same GetProcAddress we used earlier on the mapped user mode .dll files. But this time we will be looking for the “KeServiceDescriptorTable” exported symbol.

So far you can see that we got anchors (what I call for a reliable way to get an address of anything in memory) and we are good to go, this will work in production without the need to worry. If you wanna start the flame war about the unlegitimate use of undocumented APIs, etc. I’m clearly not interested. :)
Anyway, in Windows 32 bits, the latter symbol is exported, but it is not exported in 64 bits! This is part of the PatchGuard system, to make life harder for rootkits, 3rd party drivers doing exactly what I’m talking about, etc. I’m not going to cover how to get that address in 64 bits in this post.

The KeServiceDescriptorTable is a table that holds a few pointers to other service tables which contain the real addresses of the service functions the OS supplies to usermode. So a simple dereference to the table and you get the pointer to the first table which is the one you are looking for. Using that pointer, which is really the base address of the pointers table, you use the index we read earlier from the required function and you got, at last, the pointer to that function in kernel, which you can now use.

The bottom line is that now you can use any API that is given to usermode also in kernelmode and you’re not limited to a specific Windows version, nor updates, etc. and you can do it in a reliable manner which is the most important thing. Also we didn’t require any special algorithms nor disassemblers (as much as I like diStorm…). Doing so in shellcodes make life a bit harder, because we had the assumption that we got some reliable way to find the ntos base address. But every kid around the block knows it’s easy to do it anyway.

Happy coding :)

References I found interesting about this topic:

And how to do it in 64 bits:

Memory Management

Tuesday, December 7th, 2010

In this post I decided to write about a few things you have to keep on mind while writing or designing a big application regarding memory management, I will also try to give you a new way for looking at memory allocations.

The problem is that one day you wake up and you have either memory leaks or memory fragmentation, you don’t know which or why. To eliminate the possibility of memory fragmentation, you will have to rule out memory leaks first.

Trying to find a solution to such a case you find that your dozens of DLLs work (normally but-)sharing the same CRT, so if you pass an allocated object from one DLL to be used and freed by another DLL, you’re screwed. Also, you can’t separate the CRTs, etc. The root of the problem is because each CRT has its own heap, thus if you statically link a DLL with its own CRT, it will have its own heap. If all DLLs use the same CRT, they share the same heap, or even the application’s global heap. Why is it bad? Because then if you have memory leaks in one DLL, it will dirty the whole global heap. I will talk about it soon. An advantage for separating heaps is that each thread that belongs to some DLL can access its own heap faster, without congestion on the global heap lock which all DLLs/threads need.
The idea is to separate the applications into logical groups of components, and each logical group will use its own heap, so you can pass pointers around. This is really hard to do if you have existing code and never thought about this issue. So you’re saying, “oh yeah, but if I don’t have leaks, and my code is doing alright, I don’t need it anyway”. Well I wouldn’t be so quick about thinking that. Keep on reading.

In C++ for instance, overriding new and delete for your class, is really useful, because even if it gets used in a different component, C++ is responsible to bring the “deletion” to the appropriate heap.

Now I wanna go back a bit to some background on how memory allocators work in general, which is usually in a naive fashion. It really means that when you ask for a block of memory, the memory manager (malloc for you) allocates some chunk and returns a pointer to you. Once another request for memory is being done, the next memory address is getting used to supply the memory request… (suppose the first chunk requested is still busy), and so on. What you can see here, is that the memory allocator works linearly (allocates a block after the other) and chronologically (depends on the time you asked for the allocation).

Let me give you some example so you understand what I’m talking about, in this example there are two threads: Thread A and Thread B.
Thread A allocates chunks of 0.5kb in a loop, and at some point shuts down after it’s done with its task.
Thread B allocates chunks of 0.5kb in a loop, simultaneously, and keeps on working.
And suppose they are using the same CRT, thus the same heap.

What we will see from the memory manager perspective is something like:
<0.5kb busy> <0.5kb busy> <0.5kb busy>…………………….and then something like ….

Each chunk might belong to either Thread A or Thread B. Don’t be picky about it, it’s really up to so many environmental issues (processors, locks, etc).

Suppose Thread A is finished with its memory because it was only up to do some initialization job(!) and decides to free everything it has allocated. The memory picture would be something like this now:
<0.5kb free><0.5kb busy><0.5kb free><0.5kb busy> …………………lots of <0.5kb busy belongs to ThreadB> and then ….

What we got now is a hole-y memory map, some blocks are free, the others are busy. Now if we want to allocate another 0.5kb, no problem, we can use the first free block, it will match perfectly, so everybody’s happy so far. As long as the block we want to allocate in such a memory state is smaller or equal to 0.5kb, we will get them for sure, right?
But what happens if we ask to allocate a block of 10kb, ha ah, now we have to scan the whole memory to find a free chunk whose size is bigger than 10kb, right? Can you see what’s wrong here?
What really is wrong is that you have a total free amount of X kbytes, which surely can contain 10kb, but they are all chunked, so the memory allocator has to find the next block which can satisfy such a request and it so happens that all these free kbytes are useless to us.
In short, this example is a classic memory fragmentation scenario. To check how bad a memory fragmentation is, you have to measure it by the total free memory, and largest free block size. This way, if you have 2 megs that are fragmented in tiny chunks over 200mb, and the biggest free block is 2 megs – it means you can’t even allocate a 3mb block while the total free size is 198mb. Memory is cruel.

This brings me to say that each memory-consumer has a specific behavior. A memory-consumer like Thread A, that was doing some initialization job and then got down, and freeing all its blocks; is just an init phase memory user.
A memory-consumer like Thread B, that still works after Thread A is done and uses its allocated memory, is an application life-time memory user.
Basically each memory-consumer can be classified into some behavior! Once you mix between those behaviors on the same heap, you’re sooner or later screwed.

The rule of thumb I would suggest you should follow is:
You should distinguish between each memory-consumer type, if you know something consumes memory and it really caches some objects, make sure it works as early as possible in your application. And only then allocate your application life-time objects. This causes its memory to be allocated first and not cause fragmentations to other consumers with different behaviors.

If you know you have to cache some objects while other application life-time objects have to be allocated from the same heap at the same time(!), this is where you should see the red light and create another heap for each behavior.
A practical example is when you open some application that has a workspace, now while you open a workspace in the first time, the application is designed to init its UI stuff only at that moment. Therefore while you’re loading and parsing the data from the file into memory, you’re also setting up the UI and do lots of allocations, the UI is going to stay there until you shutdown the application, and the data itself is going to be de-allocated once you close the workspace. This will create fragmentations if you don’t split the heaps. Can you see why? Think of loading and unloading workspaces and what happens to the memory. Don’t forget that each time you open a workspace, the memory chunks that need to get allocated have a different size. And that the UI still uses its memory which is probably fragmented, so it only gets worse to some extent. Another alternative would be to make sure the UI allocations happen first, no matter what. And only then you can allocate and de-allocate lots of memory blocks without worrying for fragmentations.
Even if you got the UI allocations done first, then there still might be a problem. Suppose you load the file’s data to memory and some logic works in a way that some of the data stays cached for the application life-time even after the workspace is going down. This is a big no no, it will leave you with fragmentations again, hence you should split your heap again.

That’s it mostly, I hope I managed to make it clear.
I didn’t want to talk about the implementation of the heaps, whether they are (LFH) low-fragmentation-heap or whatever, if you mix your memory-consumers, nothing will save you.

One note to mention, that if your cached-memory leaks, for whatever reason, with the new strategy and a standalone heap, you won’t hurt anybody with your leaks, and it will even be easier to track leaks per heap rather than dozens of DLLs on the same global heap. And if you already have a big application with a jungle of one heap and various memory-consumers types, there are some techniques to save the day…

Unsigned in Java

Tuesday, October 26th, 2010

Hello everyone,

as you may know we (ReviveR team) chose to use Java as the main language for the framework and maybe the UI too (they are totally separated for now).
We started by converting diStorm to Java using JNI, and converting diSlib64, my robust PE file parser in Python.

While we were doing the conversion we found out that there is no unsigned keyword in Java. Yes, I gotta admit, we are noobs in Java, but we are professional coders using other languages, as a matter of fact. On the other side, everyone knows it’s all about new syntax and the benefits of the language itself that once you’re used to, then you rock with the language. So syntax is easy. And it’s gonna take a short while ’till we learn the benefits of Java. After all we chose Java because it is widely cross platform, and C# is ten times better, I can just claim it, not going to prove it. In this post, however, I’m going to talk about the disadvantages of Java, to name one, unsigned numbers.

When you parse a PE+ file (that’s for AMD64), you need to read some 64 bit integers. Therefore we needed a way to hold an address in 64 bits, usually addresses are unsigned, in contrast to RVAs. The problem was that there is no way to define an unsigned long in Java. This is a really unpleasant welcome to Java, seriously. Wtf did the designer think? And I looked for his stupid comment, it read something like: “ahh, most coders don’t need ‘unsigned’, it only complicates stuff”. What a douche. Now, this is a denial to reality. Looking for other alternatives on the net I found that most people use a bigger size for their integer, suppose they need an unsigned 8 bits integer, then they will use the next bigger size that could hold such an integer as unsigned, which is short… This is so lame, you can even get unsigned 32 bit integers, by using longs, right? But what about using unsigned 64 bit integers? No bigger size, no way.
Others say, you can use BigIntegers, the moment I heard about that I wanted to cry out loud. My guess is that the implementation is a bit vector. So using BigInteger only for representing unsigned longs, that’s useless, if you ask me. Oh and I almost forgot to mention that it accepts the byte[] in big endian only, blee.

I really got pissed off, there were moments I wanted to go back to C++. Although I knew that I’m going to waste time on auto-pointers, data structures, and shit like that, but C++ has unsigned. How cool is that.

I consulted with a friend and he referred me to this link: Unsigned arithmetic in Java.
That seemed a bit helpful, and I liked the general idea. I think there are errors in the code snippets (didn’t check them though). Anyway, the guy suggests to use an “isLessThanUnsigned” comparison, I didn’t want to limit my unsigned long’s interface in such a way.

Therefore I took a look at the interface of BigInteger, saw that they use a compareTo method, and did the same on a new class I wrote, named ULong. The class can accept, byte array, bytebuffer, longs, and also as big endian if necessary.

The compareTo was written from scratch:

public long compareTo(ULong rh)
        // If both numbers have the same sign, it’s up to their real values.
        if (((mValue ^ rh.mValue) >> 63) == 0) return mValue – rh.mValue;
        // Here they have different signs, if mValue has the MSB set, it’s negative _in Java_, thus bigger.
        if (mValue < 0) return 1;
        // Else, the rh.mValue is bigger.
        return -1;

Very basic arithmetic operations, and it’s pretty quick relative to BigInteger’s, mine is 8.5 times faster, on my machine…
The point is that I couldn’t accept all the extra stuff it needed to do in order to represent an unsigned long. It bugged me. I’m not going to stop and take my time again (hopefully) on issues like this, but since I don’t know Java this well, I was curious to see how things work.

Another issue that I didn’t like is that you cannot define global functions (or am I wrong here?), everything has to be in classes, this is annoying sometimes, but I guess the rational was to force a kind of ‘namespaces’, so it’s fine eventually – but let me decide what to do, I know what I’m doing.

Last one, the separation to files based on public classes, it really forces one to divide all his classes into lots of files. Or dump them one after the other as inner classes. And then if you have a third inner enum, for instance, the compiler shouts at you that the outer class has to be static, etc. Consequently, it forces you to move it out, and then you find yourself dividing your code again, and now it’s out of context of the class you wanted to put it in…

Oh dear Java, a love begins :(

P.S – I think that the beauty is that I know to use high level languages when I have to, with all due respect to me and low level.

New Project – ReviveR

Saturday, September 25th, 2010

Hey all,

long time haven’t posted. I’m kinda busy with lots of stuff.
Anyway I just wanted to let you know that I’m starting to work on the sequel of diStorm, you guessed it right… A reversing studio!
Unlike what many people said, the core is going to be written in C++, the GUI is going to be written per OS. No thanks, QT. Top goals are performance, scripting, good UI and most important good analysis capabilities. Obviously it’s going to be open source, cross platform. For a start, it will support only x86 and AMD64 and PE file format, maybe ELF too, though not my priority. I’m not sure about a debugger yet, but it will probably be implemented later. GUI is going to be written using WPF under C#, just to give you an idea.

My main interests are performance and binary code analysis algorithms.

If there are highly skilled programmers who wish to help, please contact me.
For now it seems we are a group of 4 coders, I’m still not going to publish their names, until everything is settled.

Anyway, design is taking place nowadays. This is your time for suggesting new features and ideas.

Big good luck

SmartPointer In C++

Wednesday, March 3rd, 2010

Smart pointers, the way I see it, are there to help you with, eventually, two things: saving memory and auto-destruction. There are plenty kinds of smart pointers and only one type of a dumb pointer 😉 I am going talk about the one that keeps a reference count to the data. To me they are one of the most important and useful classes I have used in my code. Also the AutoResource class I posted about, here, is another type of a smart pointer. I fell in love with smart pointers as soon as I learnt about them long time ago. However I only happened to write the implementation for this concept only once, in some real product code. Most of the times I got to use libraries that supply them, like ATL and stuff. Of course, when we write code in high level languages like Python, C#, Java, etc. We are not even aware to the internal use of them, mostly anyway.

This topic is not new or anything, it is covered widely on the net, but I felt the need to share a small code snippet with my own implementation, which I wrote from scratch. It seems that in order to write this class you don’t need high skills in C++, not at all. Though if you wanna get dirty with some end cases, like the ones described in ‘More Effective c++’, you need to know the language pretty well.

As I said earlier, the smart pointer concept I’m talking about here is the one that keeps the number of references to the real instance of the object and eventually when all references are gone, it will simply delete the only real instance. Another requirement from this class is to behave like a dumb pointer (that’s just the normal pointer the language supplies), my implementation is not as perfect as the dumb pointer, in the essence of operators and the operations you can apply on the pointer. But I think for the most code usages, it will be just enough. It can be always extended, and besides if you really need a crazy ultra generic smart pointer, Boost is waiting for you.

In order to keep a reference count for the instance, we need to allocate that variable, also the instance itself, and to make sure they won’t go anywhere as long as somebody else still points to it. The catch is that if it will be a member of the SmartPointer class itself, it will die when the SmartPointer instance goes out of scope. Therefore it has to be a pointer to another object, which will hold the number of references and the real instance. Then a few smart pointers will be able to point to this core object that holds the real stuff. I think this was the only challenge in understanding how it works. The rest is a few more lines to add functionality to get the pointer, copy constructor, assignment operator and stuff.

Of course, it requires a template class, I didn’t even mention that once, because I think it’s obvious.
Here are the classes:

template <class T> class SmartPtr {
  SmartPtr(T o)
    // Notice we create a DataObject that gets an object of type T.
    m_Obj = new DataObj(o);
  // … A few of additional small methods are absent from this snippet, check link below
  // Now, here we define an internal class, which holds the reference count and the real object’s instance.
  class DataObj {
    DataObj(T o) : m_ReferenceCount(0)
      m_Ptr = new T(o); // First allocate, this time the real deal
      AddRef(); // And only then add the first reference count
    unsigned int AddRef()
    {  return m_ReferenceCount++;  }
    void Release()
      if (–m_ReferenceCount == 0) {
        delete m_Ptr; // Delete the instance
        delete this; // Delete the DataObj instance too
  T* m_Ptr; // Pointer to a single instance of T
  unsigned int m_ReferenceCount; // Number of references to the instance

// This is now part of the SmartPointer class itself, you see? It points the DataObj and not T !
DataObj* m_Obj;

To see the full source code get it SmartPointer.txt.

I didn’t show it in the snippet above but the assignment operator or copy constructor which get a right hand of a smart pointer class, will simply copy the m_Ptr from it and add a reference to it. And by that, the ‘magic’ was done.

To support multi-thread accesses to the class, you simply need to change the AddRef method to use InterlockedAdd. And to change the Release to use InterlockedSub, ahh of course, use InterlockedAdd with -1.
And then you would be fully thread safe. Also note that you will need to use the returned value of the InterlockedAdd in the Release, rather than compare the value directly after calling the function on it. This is a common bug when writing multi-thread code. Note that if the type object you want to create using the SmartPointer doesn’t support multi-threading in the first place, nothing you can do in the smart pointer method themselves is going to solve it, of course.

I didn’t show it in the snippet again but the code supports the comparison to NULL on the SmartPointer variable. Though you won’t be able to check something like:
if (!MySmartPtr) fail… It will shout at you that the operator ! is not supported. It takes exactly 3 lines to add it.

The only problem with this implementation is that you can write back to the data directly after getting the pointer to it. For me this is not a problem cause I never do that. But if you feel it’s not good enough for you, for some reason. Check out other implementations or just read the book I mentioned earlier.

Overall it’s really a small class that gives a lot. Joy

Optimize My Index Yo

Thursday, November 26th, 2009

I happened to work with UNICODE_STRING recently for some kernel stuff. That simple structure is similar to pascal strings in a way, you got the length and the string doesn’t have to be null terminated, the length though, is stored in bytes. Normally I don’t look at the assembly listing of the application I compile, but when you get to debug it you get to see the code the compiler generated. Since some of my functions use strings for input but as null terminated ones, I had to copy the original string to my own copy and add the null character myself. And now that I think of it, I will rewrite everything to use lengths, I don’t like extra wcslen’s. :)

Here is a simple usage case:

p = (PWCHAR)ExAllocatePool(SomePool, Str->Length + sizeof(WCHAR));
if (p == NULL) return STATUS_NO_MEMORY;
memcpy(p, Str->buffer, Str->Length);
p[Str->Length / sizeof(WCHAR)] = UNICODE_NULL;

I will show you the resulting assembly code, so you can judge yourself:

shr    esi,1
xor    ecx,ecx
mov  word ptr [edi+esi*2],cx

One time the compiler converts the length to WCHAR units, as I asked. Then it realizes it should take that value and use it as an index into the unicode string, thus it has to multiply the index by two, to get to the correct offset. It’s a waste-y.
This is the output of a fully optimized code by VS08, shame.

It’s silly, but this would generate what we really want:

*(PWCHAR)((PWCHAR)p + Str->Length) = UNICODE_NULL;

With this fix, this time without the extra div/mul. I just did a few more tests and it seems the dead-code removal and the simplifier algorithms are not perfect with doing some divisions inside the indexing for pointers.

Update: Thanks to commenter Roee Shenberg, it is now clear why the compiler does this extra shr/mul. The reason is that the compiler can’t know whether the length is odd, thus it has to round it.

diStorm3 – Call for Features

Friday, October 2nd, 2009

[Update diStorm3 News]

I have been working more and more on diStorm3 recently. The core code is already written, and it works so great. I am still not going to talk about the structure itself that diStorm uses to format the instructions. There are two API’s now, the old one, which takes a stream and formats it to text and a newer one, which takes a stream and formats it into structures. This one is much faster. Unlike diStorm64, where the text-formatting was coupled in the decoding code, it’s totally separated. For example, if you want to support AT&T syntax, you can do it in a couple of hours or less, really. I don’t like AT&T syntax, hence I am not going to implement it. I bet still many people don’t know how to read it without confusing…

Hereby, I am asking you guys to come up with ideas for diStorm3. So far I got some new ideas from people, which I am going to implement. Such as:
1) You will be able to tell the decoder to stop on any flow control instruction.
2) Instructions are going to be categorized, such as, flow-control, data-control, string instructions, io, etc. (To be honest, I am still not totally sure about this one).
3) Helper macros to extract data references. Since diStorm3 outputs structures, it’s really easy to know if there’s a data reference and its address. Therefore some macros will aid to do this work.
4) Code reference, – continues to next instruction, continues to a target according to a condition, or jump-always and call-always.

I am looking to hear more suggestions from you guys. Please be sure you are talking about disassembler features, and not other layers which use the disassembler.

Just wanted to let you know that diStorm3 is going to be dual licensed with GPL and commercial. diStorm64 is deprecated and I am not going to touch it anymore, though it’s still licensed as BSD, of course.