Armstorm – ARM Disassembler

December 12th, 2012

Heya

It seems the distorm-arm project is going to be called Armstorm, after asking the guys on Twitter.
Anyway, as I’m working on this project, I just came up with the following cool snippet and wanted to submit it as a riddle:

int base = 0;
int runLength = 0;
while (mask && (~mask & 1)) base++, mask >>= 1;
while (mask && (mask & 1)) runLength++, mask >>= 1;
if (!mask && runLength > 2) {
...
}

Any idea what does it do or why I need it ?

diStorm-ARM

November 11th, 2012

Hey guys

I’ve been pretty busy in the last few months, working, and working.

I am thinking about coming up with a diStorm-ARM version, it’s pretty self-explanatory. ;)
For the ones who are not familiar with diStorm, it is a disassembler that returns a binary structure that describes an x86/x64 instruction, instead of textual output.

Would love to hear your thoughts, as always.
Thanks,
Dabah

Appfront

December 17th, 2011

Hey everyone,
as you can see I am pretty occupied with other projects and I don’t have time for the blog, but I promise I will come back with lots of good stuff.

Anyway, in the last few months I’ve been working hard on Appfront, that’s a tweak for the jailbroken Iphones, which let you use Skype, Viber and Whatsapp directly from the original phone application by adding fast buttons. Or a better explanation can be found at Redmond Pie. :)

Let me know what you think, cause I’ve done the technical part, reversing, hooking, coding, etc…
And now I’m working on supporting Ios5, many private symbols were changed in the phone application, and so I have to do some more research and fix the broken code to work with the new stuff too.

Kernel Exploits

November 21st, 2011

Hey

I’m uploading a presentation of a good friend, Gilad Bakas, who has just spoken in Ruxcon in Australia.

Get it now: Kernel Exploits

Enjoy

IsDebuggerPresent – When To Attach a Debugger

October 12th, 2011

This API is really helpful sometimes. And no, I’m not talking about using it for anti-debugging, com’on.
Suppose you have a complicated application that you would like to debug on special occasions.
Two concerns arise:
1 – you don’t want to always DebugBreak() at a certain point, which will nuke the application every time the code is running at that point (because 99% of the times you don’t have a debugger attached, or it’s a release code, obviously).
2 – on the other hand, you don’t want to miss that point in execution, if you choose you want to debug it.

An example would be, to set a key in the registry that each time it will be checked and if it is set (no matter the value), the code will DebugBreak().
A similar one would be to set a timeout, that on points of your interest inside the code, it will be read and wait for that amount of time, thus giving you enough time for attaching a debugger to the process.
Or setting an environment variable to indicate the need for a DebugBreak, but that might be a pain as well, cause environment blocks are inherited from parent process, and if you set a system one, it doesn’t mean your process will be affected, etc.
Another idea I can think of is pretty obvious, to create a file in some directory, say, c:\debugme, that the application will check for existence, and if so it will wait for attaching a debugger.

What’s in common for all the approaches above? Eventually they will DebugBreak or get stuck waiting for you to do the work (attaching a debugger).

But here I’m suggesting a different flow, check that a debugger is currently present, using IsDebuggerPresent (or thousands of other tricks, why bother?) and only then fire the DebugBreak. This way you can extend it to wait in certain points for a debugger-attach.

The algorithm would be:

read timeout from registry (or check an existence of a file, or whatever you’re up to. Which is most convenient for you)
if exists, while (timeout not passed)
if IsDebuggerPresent DebugBreak()
sleep(100) – or just wait a bit not to hog CPU

That’s it, so the application would always run normally, unless there’s some value set to hint you would like to attach a debugger in certain points, and if you don’t want to, it will timeout and continue normally. Of course, it’s possible to add some log messages, you will know it’s time to attach a debugger, in case you haven’t attached it earlier…

It’s always funny to see people call MessageBox, and then they attach a debugger, they then want to set a breakpoint at some function or instruction or even straight away at the caller itself, but can’t find that place easily without symbols or expertise. Instead, put a breakpoint at the end of the MessageBox function and step out of it.

Thanks to Yuval Kokhavi for this great idea. If you have a better idea or implementation please share it with us ;)

isX64 Gem

July 13th, 2011

I needed a multi-arch shellcode for both x86 and x64 in the same code. Suppose you want to attack a platform, which can either be x86 or x64 where you don’t know in advance which it is. The problem is which version you really need to use at runtime then, right?

This is a tiny trick I’ve been using for a long while now which tells whether you run on x64 or not:

XOR EAX, EAX
INC EAX ; = DB 0×40
NOP
JZ x64_code
x86_code:
bits 32
.
.
.
RET
x64_code:
bits 64
.
.
 

The idea is very simple, since x64 and x86 share most opcodes’ values, there is a small in-similarity with the range of 0×40-0×50, in x86 it used for one byte INC and DEC opcodes. Since there’re 8 GPRs (General Purpose Register), and 2 opcodes, it spans over the whole range of 0×40-0×50.
Now when AMD64′s ISA (Instruction Set Architecture) was designed, they added another set of 8 GPRs, making it a total of whopping 16 GPRs. In a world where x86 ruled, you only needed 3 bits in the ModRM byte (some byte in the instruction that tells the processor how to read its operands) to access a specific register from 0 to 8. With the new ISA, an extra bit was required in order to be able to address all 16 registers. Therefore, a new prefix (called the REX prefix) was added to solve this problem with an extra bit (and there’s more to it, not relevant for now). The new prefix used the range of 0×40-0×50, thus eliminating old one byte INC/DEC (no worries however, now compilers use the 2 bytes existent variation for these instructions).

Back to our assembly code, it depends on the fact that in x86 the INC EAX, really increments EAX by one, and so it will become 1 if the code runs on x86. And when it’s run on x64, it becomes a prefix to the NOP instruction, which doesn’t do anything anyway. And hence, EAX stays zero. Just a final note for the inexperienced that in x64, operations on 32 bit registers are automatically promoted to 64 bit registers, so RAX is also 0.

Finding Kernel32 Base Address Shellcode

July 7th, 2011

Yet another one…
This time, smaller, more correct, and still null-free.
I looked a bit at some shellcodes at exploit-db and googled too, to see whether anyone got a smaller way to no avail.

I based my code on:
http://skypher.com/index.php/2009/07/22/shellcode-finding-kernel32-in-windows-7/
AFAIK, who based his post on:
http://blog.harmonysecurity.com/2009_06_01_archive.html

And this is my version:

00000000 (02) 6a30                     PUSH 0×30
00000002 (01) 5e                       POP ESI
; Use DB 0×64; LODSD
00000003 (02) 64ad                     LODS EAX, [FS:ESI]
00000005 (03) 8b700c                   MOV ESI, [EAX+0xc]
00000008 (03) 8b761c                   MOV ESI, [ESI+0x1c]
0000000b (03) 8b5608                   MOV EDX, [ESI+0×8]
0000000e (04) 807e1c18                 CMP BYTE [ESI+0x1c], 0×18
00000012 (02) 8b36                     MOV ESI, [ESI]
00000014 (02) 75f5                     JNZ 0xb
 

The tricky part was how to read from FS:0×30, and the way I use is the smallest one, at least from what I checked.
Another issue that was fixed is the check for kernel32.dll, usually the variation of this shellcode checks for a null byte, but it turned out to be bogous on W2k machines, so it was changed to check for a null word. Getting the shellcode by a byte or two longer.

This way, it’s only 22 bytes, it doesn’t assume that kernel32.dll is the second/third entry in the list, it actually loops till it finds the correct module length (len of ‘kernel32.dll’ * 2 bytes). Also since kernelbase.dll can come first and that renders lots of implementations of this technique unusable.
And obviously the resulting base address of kernel32.dll is in EDX.

Enjoy

[Update July 9th:]
Here’s a link to an explanation about PEB/LDR lists.
See first comment for a better version which is only 17 bytes.

Private Symbols Look Up by Binary Signatures

July 1st, 2011

This post could really be extended and divided into a few posts, but I decided to try and keep it small as much as I can. If I see it draws serious attention I might elaborate on the topic.

Signature matching for finding functions is a very old technique, but I haven’t found anyone who talks about it with juicy details or at all, and decided to show you a real life example. It is related to the last post about finding service functions in the kernel. The problem is that sometimes inside the kernel you want to use internal functions, which are not exported. Don’t start with “this is not documented story”, I don’t care, sometimes we need to get things done no matter what. Sometimes there is no documented way to do what you want. Even in legitimate code, it doesn’t have to be a rootkit, alright? I can say, however, that when you wanna add new functionality to an existing and working system, in whatever level it might be, you would better depend as much as you can on the existing functionality that was written by the original programmers of that system. So yes, it requires lot of good reversing, before injecting more code and mess up with the process.
The example of a signature I’m going to talk about is again about getting the function ZwProtectVirtualMemory address in the kernel. See the old post here to remember what’s going on. Obviously the solution in the older post is almost 100% reliable, because we have anchors to rely upon. But sometimes with signature matching the only anchors you have are binary pieces of:
* immediate operand values
* strings
* xrefs
* disassembled instructions
* a call graph to walk on
and the list gets longer and can really get crazy and it does, but that’s another story.

I don’t wanna convert this post into a guideline of how to write a good signature, though I have lots of experience with it, even for various archs, though I will just say that you never wanna put binary code as part of your signature, only in extreme cases (I am talking about the actual opcode bytes), simply because you usually don’t know what the compiler is going to do with the source code, how it’s going to look in assembly, etc. The idea of a good signature is that it will be as generic as possible so it will survive (hopefully) the updates of the target binary you’re searching in. This is probably the most important rule about binary signatures. Unfortunately we can never guarantee a signature is to be future compatible with new updates. But always test that the signature matches on a few versions of the binary file. Suppose it’s a .DLL, then try to get as many versions of that DLL file as possible and make a script to try it out on all of them, a must. The more DLLs the signature is able to work on successfully, the better the signature is! Usually the goal is to write a single signature that covers all versions at once.
The reason you can’t rely on opcodes in your binary signature is because they get changed many times, almost in every compilation of the code in a different version, the compiler will allocate new registers for the instructions and thus change the instructions. Or since code might get compiled to many variations which effectively do the same thing, I.E: MOV EAX, 0 and XOR EAX, EAX.
One more note, a good signature is one that you can find FAST. We don’t really wanna disassemble the whole file and run on the listing it generated. Anyway, caching is always a good idea and if you have many passes to do for many signatures to find different things, you can always cache lots of stuff, and save precious loading time. So think well before you write a signature and be sure you’re using a good algorithm. Finding an XREF for a relative branch takes lots of time, try to avoid that, that should be cached, in one pass of scanning the whole code section of the file, into a dictionary of “target:source” pairs, with false positives (another long story) that can be looked up for a range of addresses…

I almost forgot to mention, I used such a binary signature inside the patch I wrote as a member of ZERT, for closing a vulnerability in Internet Explorer, I needed to find the weak function and patch it in memory, so you can both grab the source code and see for yourself. Though the example does use opcodes (and lots of them) as part of the signature, but there’s special reason for it. Long story made short: The signature won’t match once the function will get officially patched by MS (recall that we published that solution before MS acted), and then this way the patcher will know that it didn’t find the signature and probably the function was already patched well, so we don’t need to patch it on top of the new patch.. confusing shit.

The reason I find signatures amazing is because only reversers can do them well and it takes lots of skills to generate good ones,
happy signaturing :)

And surprisingly I found the following link which is interesting: http://wiki.amxmodx.org/Signature_Scanning

So let’s delve into my example, at last.
Here’s a real example of a signature for ZwProtectVirtualMemory in a Kernel driver.

Signature Source Code

From my tests this signature worked well on many versions…though always expect it might be broken.

diStorm Goes on Diet

June 11th, 2011

I just wanted to share my happiness with you guys. After a long hard work (over a month in my free time, which ain’t much these days), I managed to refactor all the data-structures of the instructions DB in diStorm3. As the title says, I spared around 40kb in data! The original distorm3.dll file took around 130kb and currently it takes 90kb. I then went ahead and reconfigured the settings of the project in Visual Studio and instructed the compiler not to include the CRT shits. Then it bitched about “static constructors won’t be called and the like”, well duh. But since diStorm is written in C and I don’t have anything static to initialize (which is based on code) before the program starts, I didn’t mind it at all. And eventually I got the .dll file size to ~65kb. That’s really 50% of the original file size. This is sick.

I really don’t want to elaborate with the details of what I did, it’s really deep shit into diStorm. Hey, actually I can give you an simple example. Suppose I have around 900 mnemonics. Do not confuse mnemonic with opcodes – some opcodes share the same mnemonic although they are completely different in their behavior. You have so many variations of the instruction ‘ADD’, for instance. Just to clarify: mnemonic=display name of an opcode, opcode: the binary byte code which identifies the operation to do, instruction: all the bytes which represent the opcode and the operands, the whole.
Anyway, so there are 900 mnemonics, and the longest mnemonic by length takes 17 characters, some AVX mofo. Now since we want a quick look up in the mnemonics table, it was an array of [900][19], which means 900 mnemonics X 19 characters per mnemonic. Why 19? An extra character for the null terminating char, right? And another one for the Pascal string style – means there’s a leading length byte in front of the string data. Now you ask why I need them both: C string and Pascal string together. That’s because in diStorm all strings are concatenated very fast by using Pascal strings. And also because the guy who uses diStorm wants to use printf to display the mnemonic too, which uses C string, he will need a null terminating character at the end of the string, right?
So back to business, remember we have to allocate 19 bytes per mnemonic, even if the mnemonic is as short as ‘OR’, or ‘JZ’, we waste tons of space, right? 900×19=~17kb. And this is where you get the CPU vs. MEMORY issue once again, you get random access into the mnemonic, which is very important but it takes lots of space. Fortunately I came up with a cooler idea. I packed all the strings into a very long string, which looks something like this (copied from the source):

“\x09″ “UNDEFINED\0″ “\x03″ “ADD\0″ “\x04″ “PUSH\0″ “\x03″ “POP\0″ “\x02″ “OR\0″ \
“\x03″ “ADC\0″ “\x03″ “SBB\0″ “\x03″ “AND\0″ “\x03″ “DAA\0″ “\x03″ “SUB\0″
and so on… 

 

You can see the leading length and the extra null terminating character for each mnemonic, and then it’s being followed by another mnemonic. And now it seems like we’re lost with random-access cause each string has a varying length and we can never get to the one we want… but lo and behold! Each instruction in the DB contains a field ‘opcodeId’ which denotes the index in the mnemonics array, the offset into the new mnemonics uber string. And now if you use the macro mnemonics.h supplies, you will get to the same mnemonic nevertheless. And all in all I spared around 10kb only on mnemonic strings!

FYI the macro is:

#define GET_MNEMONIC_NAME(m) ((_WMnemonic*)&_MNEMONICS[(m)])->p

As you can see, I access the mnemonics string with the given OpcodeId field which is taken from the decoded instruction and returns a WMnemonic structure, which is a Pascal string (char length; char bytes[1])…

The DB was much harder to compact, but one thing I can tell you when you serialize trees is that you can (and should) use integer-indices, rather than pointers! In x64, each pointer takes 8 bytes, for crying out loud! Now in the new layout, each index in the tree takes only 13 bits, the rest 5 bits talks about the type, where/what the index really points to… And it indirectly means that now the DB takes the same size both for x86 and x64 images, since it is not based on pointers.

Thanks for your time, I surely had pure fun :)

Binary Hooking Problems

May 14th, 2011

Most binary hooking engines write a detour in the entry point of the target function. Other hooking engines patch the IAT table, and so on. One of the problems with overwriting the entry point with a JMP instruction is that you need enough room for that instruction, usually a mere 5 bytes will suffice.

How do the hooking algorithms decide how much is “enough”?

Pretty easy, they use a dissasembler to query the size of each instruction they scan, so if the total size of the instructions that were scanned is more than 5 bytes, they’re done.
As an example, usually, functions start with these two instructions:
PUSH EBP
MOV EBP, ESP

which take only 3 bytes. And we already said that we need 5 bytes total, in order to replace the first instructions with our JMP instruction. Hence the scan will have to continue to the next instruction or so, till we got at least 5 bytes.

So 5 bytes in x86 could contain from one instruction to 5 instructions (where each takes a single byte, obviously). Or even a single instruction whose size is longer than 5 bytes. (In x64 you might need 12-14 bytes for a whole-address-space JMP, and it only makes matter worse).

It is clear why we need to know the size of instructions, since we overwrite the first 5 bytes, we need to relocate them to another location, the trampoline. There we want to continue the execution of the original function we hooked and therefore we need to continue from the next instruction that we haven’t override. And it is not necessarily the instruction at offset 5… otherwise we might continue execution in the middle of an instruction, which is pretty bad.

Lame hooking engines don’t use disassemblers, they just have a predefined table of popular prologue instructions. Come a different compiled code, they won’t be able to hook a function. Anyway, we also need a disassembler for another reason, to tell whether we hit a dead end instruction, such as: RET, INT 3, JMP, etc. These are hooking spoilers, because if the first instruction of the target function is a simple RET (thus the function doesn’t do anything, leave aside cache side effects for now), or even a “return 0″ function, which usually translates into “xor eax, eax; ret”, still takes only 3 bytes and we can’t plant a detour. So we find ourselves trying to override 5 bytes where the whole function takes several bytes (< 5 bytes), and we cannot override past that instruction since we don’t know what’s there. It might be another function’s entry point, data, NOP slide, or what not. The point is that we are not allowed to do that and eventually cannot hook the function, fail.

Another problem is relative-offset instructions. Suppose any of the first 5 bytes is a conditional branch instruction, we will have to relocate that instruction. Usually conditional branch instruction are only 2 bytes. And if we copy them to the trampoline, we will have to convert them into the longer variation which is 6 bytes and fix the offset. And that would work well. In x64, RIP-relative instructions are also pain in the butt, as well as any other relative-offset instruction which requires a fix. So there’s quiet a long list of those and a good hooking engine has to support them all, especially in x64 where there’s no standard prologue for a function.

I noticed a specific case where WaitForSingleObject is being compiled to:
XOR R8D, R8D
JMP short WaitForSingleObjectEx

in x64 of course; the xor takes 3 bytes and the jmp is a short one, which takes 2 bytes. And so you got 5 bytes total and should be able to hook it (it’s totally legal), but the hook engine I use sucked and didn’t allow that.

So you might say, ok, I got a generic solution for that, let’s follow the unconditional branch and hook that point. So I will hook WaitForSingleObjectEx instead, right? But now you got to the dreaded entry points problem. You might get called for a different entry point that you never meant to hook. You wanted to hook WaitForSingleObject and now you end up hooking WaitForSingleObjectEx, so all callers to WaitForSingleObject get to you, that’s true. In addition, now all callers to WaitForSingleObjectEx get to you too. That’s a big problem. And you can’t ever realize on whose behalf you were called (with a quick and legitimate solution).

The funny thing about the implementation of WaitForSingleObject is that it was followed immediately by an alignment NOP slide, which is never executed really, but the hooking engine can’t make a use of it, because it doesn’t know what we know. So unconditional branches screw up hooking engines if they show up before 5 bytes from the entry point, and we just saw that following the unconditional branch might screw us up as well with wrong context. So if you do that, it’s ill.

What would you do then in such cases? Cause I got some solutions, although nothing is perfect.