IsDebuggerPresent – When To Attach a Debugger

October 12th, 2011

This API is really helpful sometimes. And no, I’m not talking about using it for anti-debugging, com’on.
Suppose you have a complicated application that you would like to debug on special occasions.
Two concerns arise:
1 – you don’t want to always DebugBreak() at a certain point, which will nuke the application every time the code is running at that point (because 99% of the times you don’t have a debugger attached, or it’s a release code, obviously).
2 – on the other hand, you don’t want to miss that point in execution, if you choose you want to debug it.

An example would be, to set a key in the registry that each time it will be checked and if it is set (no matter the value), the code will DebugBreak().
A similar one would be to set a timeout, that on points of your interest inside the code, it will be read and wait for that amount of time, thus giving you enough time for attaching a debugger to the process.
Or setting an environment variable to indicate the need for a DebugBreak, but that might be a pain as well, cause environment blocks are inherited from parent process, and if you set a system one, it doesn’t mean your process will be affected, etc.
Another idea I can think of is pretty obvious, to create a file in some directory, say, c:\debugme, that the application will check for existence, and if so it will wait for attaching a debugger.

What’s in common for all the approaches above? Eventually they will DebugBreak or get stuck waiting for you to do the work (attaching a debugger).

But here I’m suggesting a different flow, check that a debugger is currently present, using IsDebuggerPresent (or thousands of other tricks, why bother?) and only then fire the DebugBreak. This way you can extend it to wait in certain points for a debugger-attach.

The algorithm would be:

read timeout from registry (or check an existence of a file, or whatever you’re up to. Which is most convenient for you)
if exists, while (timeout not passed)
if IsDebuggerPresent DebugBreak()
sleep(100) – or just wait a bit not to hog CPU

That’s it, so the application would always run normally, unless there’s some value set to hint you would like to attach a debugger in certain points, and if you don’t want to, it will timeout and continue normally. Of course, it’s possible to add some log messages, you will know it’s time to attach a debugger, in case you haven’t attached it earlier…

It’s always funny to see people call MessageBox, and then they attach a debugger, they then want to set a breakpoint at some function or instruction or even straight away at the caller itself, but can’t find that place easily without symbols or expertise. Instead, put a breakpoint at the end of the MessageBox function and step out of it.

Thanks to Yuval Kokhavi for this great idea. If you have a better idea or implementation please share it with us ;)

isX64 Gem

July 13th, 2011

I needed a multi-arch shellcode for both x86 and x64 in the same code. Suppose you want to attack a platform, which can either be x86 or x64 where you don’t know in advance which it is. The problem is which version you really need to use at runtime then, right?

This is a tiny trick I’ve been using for a long while now which tells whether you run on x64 or not:

XOR EAX, EAX
INC EAX ; = DB 0x40
NOP
JZ x64_code
x86_code:
bits 32
.
.
.
RET
x64_code:
bits 64
.
.

The idea is very simple, since x64 and x86 share most opcodes’ values, there is a small in-similarity with the range of 0x40-0x50, in x86 it used for one byte INC and DEC opcodes. Since there’re 8 GPRs (General Purpose Register), and 2 opcodes, it spans over the whole range of 0x40-0x50.
Now when AMD64’s ISA (Instruction Set Architecture) was designed, they added another set of 8 GPRs, making it a total of whopping 16 GPRs. In a world where x86 ruled, you only needed 3 bits in the ModRM byte (some byte in the instruction that tells the processor how to read its operands) to access a specific register from 0 to 8. With the new ISA, an extra bit was required in order to be able to address all 16 registers. Therefore, a new prefix (called the REX prefix) was added to solve this problem with an extra bit (and there’s more to it, not relevant for now). The new prefix used the range of 0x40-0x50, thus eliminating old one byte INC/DEC (no worries however, now compilers use the 2 bytes existent variation for these instructions).

Back to our assembly code, it depends on the fact that in x86 the INC EAX, really increments EAX by one, and so it will become 1 if the code runs on x86. And when it’s run on x64, it becomes a prefix to the NOP instruction, which doesn’t do anything anyway. And hence, EAX stays zero. Just a final note for the inexperienced that in x64, operations on 32 bit registers are automatically promoted to 64 bit registers, so RAX is also 0.

Finding Kernel32 Base Address Shellcode

July 7th, 2011

Yet another one…
This time, smaller, more correct, and still null-free.
I looked a bit at some shellcodes at exploit-db and googled too, to see whether anyone got a smaller way to no avail.

I based my code on:
http://skypher.com/index.php/2009/07/22/shellcode-finding-kernel32-in-windows-7/
AFAIK, who based his post on:
http://blog.harmonysecurity.com/2009_06_01_archive.html

And this is my version:

00000000 (02) 6a30                     PUSH 0x30
00000002 (01) 5e                       POP ESI
; Use DB 0x64; LODSD
00000003 (02) 64ad                     LODS EAX, [FS:ESI]
00000005 (03) 8b700c                   MOV ESI, [EAX+0xc]
00000008 (03) 8b761c                   MOV ESI, [ESI+0x1c]
0000000b (03) 8b5608                   MOV EDX, [ESI+0x8]
0000000e (04) 807e1c18                 CMP BYTE [ESI+0x1c], 0x18
00000012 (02) 8b36                     MOV ESI, [ESI]
00000014 (02) 75f5                     JNZ 0xb

The tricky part was how to read from FS:0x30, and the way I use is the smallest one, at least from what I checked.
Another issue that was fixed is the check for kernel32.dll, usually the variation of this shellcode checks for a null byte, but it turned out to be bogous on W2k machines, so it was changed to check for a null word. Getting the shellcode by a byte or two longer.

This way, it’s only 22 bytes, it doesn’t assume that kernel32.dll is the second/third entry in the list, it actually loops till it finds the correct module length (len of ‘kernel32.dll’ * 2 bytes). Also since kernelbase.dll can come first and that renders lots of implementations of this technique unusable.
And obviously the resulting base address of kernel32.dll is in EDX.

Enjoy

[Update July 9th:]
Here’s a link to an explanation about PEB/LDR lists.
See first comment for a better version which is only 17 bytes.

Private Symbols Look Up by Binary Signatures

July 1st, 2011

This post could really be extended and divided into a few posts, but I decided to try and keep it small as much as I can. If I see it draws serious attention I might elaborate on the topic.

Signature matching for finding functions is a very old technique, but I haven’t found anyone who talks about it with juicy details or at all, and decided to show you a real life example. It is related to the last post about finding service functions in the kernel. The problem is that sometimes inside the kernel you want to use internal functions, which are not exported. Don’t start with “this is not documented story”, I don’t care, sometimes we need to get things done no matter what. Sometimes there is no documented way to do what you want. Even in legitimate code, it doesn’t have to be a rootkit, alright? I can say, however, that when you wanna add new functionality to an existing and working system, in whatever level it might be, you would better depend as much as you can on the existing functionality that was written by the original programmers of that system. So yes, it requires lot of good reversing, before injecting more code and mess up with the process.
The example of a signature I’m going to talk about is again about getting the function ZwProtectVirtualMemory address in the kernel. See the old post here to remember what’s going on. Obviously the solution in the older post is almost 100% reliable, because we have anchors to rely upon. But sometimes with signature matching the only anchors you have are binary pieces of:
* immediate operand values
* strings
* xrefs
* disassembled instructions
* a call graph to walk on
and the list gets longer and can really get crazy and it does, but that’s another story.

I don’t wanna convert this post into a guideline of how to write a good signature, though I have lots of experience with it, even for various archs, though I will just say that you never wanna put binary code as part of your signature, only in extreme cases (I am talking about the actual opcode bytes), simply because you usually don’t know what the compiler is going to do with the source code, how it’s going to look in assembly, etc. The idea of a good signature is that it will be as generic as possible so it will survive (hopefully) the updates of the target binary you’re searching in. This is probably the most important rule about binary signatures. Unfortunately we can never guarantee a signature is to be future compatible with new updates. But always test that the signature matches on a few versions of the binary file. Suppose it’s a .DLL, then try to get as many versions of that DLL file as possible and make a script to try it out on all of them, a must. The more DLLs the signature is able to work on successfully, the better the signature is! Usually the goal is to write a single signature that covers all versions at once.
The reason you can’t rely on opcodes in your binary signature is because they get changed many times, almost in every compilation of the code in a different version, the compiler will allocate new registers for the instructions and thus change the instructions. Or since code might get compiled to many variations which effectively do the same thing, I.E: MOV EAX, 0 and XOR EAX, EAX.
One more note, a good signature is one that you can find FAST. We don’t really wanna disassemble the whole file and run on the listing it generated. Anyway, caching is always a good idea and if you have many passes to do for many signatures to find different things, you can always cache lots of stuff, and save precious loading time. So think well before you write a signature and be sure you’re using a good algorithm. Finding an XREF for a relative branch takes lots of time, try to avoid that, that should be cached, in one pass of scanning the whole code section of the file, into a dictionary of “target:source” pairs, with false positives (another long story) that can be looked up for a range of addresses…

I almost forgot to mention, I used such a binary signature inside the patch I wrote as a member of ZERT, for closing a vulnerability in Internet Explorer, I needed to find the weak function and patch it in memory, so you can both grab the source code and see for yourself. Though the example does use opcodes (and lots of them) as part of the signature, but there’s special reason for it. Long story made short: The signature won’t match once the function will get officially patched by MS (recall that we published that solution before MS acted), and then this way the patcher will know that it didn’t find the signature and probably the function was already patched well, so we don’t need to patch it on top of the new patch.. confusing shit.

The reason I find signatures amazing is because only reversers can do them well and it takes lots of skills to generate good ones,
happy signaturing :)

And surprisingly I found the following link which is interesting: http://wiki.amxmodx.org/Signature_Scanning

So let’s delve into my example, at last.
Here’s a real example of a signature for ZwProtectVirtualMemory in a Kernel driver.

Signature Source Code

From my tests this signature worked well on many versions…though always expect it might be broken.

diStorm Goes on Diet

June 11th, 2011

I just wanted to share my happiness with you guys. After a long hard work (over a month in my free time, which ain’t much these days), I managed to refactor all the data-structures of the instructions DB in diStorm3. As the title says, I spared around 40kb in data! The original distorm3.dll file took around 130kb and currently it takes 90kb. I then went ahead and reconfigured the settings of the project in Visual Studio and instructed the compiler not to include the CRT shits. Then it bitched about “static constructors won’t be called and the like”, well duh. But since diStorm is written in C and I don’t have anything static to initialize (which is based on code) before the program starts, I didn’t mind it at all. And eventually I got the .dll file size to ~65kb. That’s really 50% of the original file size. This is sick.

I really don’t want to elaborate with the details of what I did, it’s really deep shit into diStorm. Hey, actually I can give you an simple example. Suppose I have around 900 mnemonics. Do not confuse mnemonic with opcodes – some opcodes share the same mnemonic although they are completely different in their behavior. You have so many variations of the instruction ‘ADD’, for instance. Just to clarify: mnemonic=display name of an opcode, opcode: the binary byte code which identifies the operation to do, instruction: all the bytes which represent the opcode and the operands, the whole.
Anyway, so there are 900 mnemonics, and the longest mnemonic by length takes 17 characters, some AVX mofo. Now since we want a quick look up in the mnemonics table, it was an array of [900][19], which means 900 mnemonics X 19 characters per mnemonic. Why 19? An extra character for the null terminating char, right? And another one for the Pascal string style – means there’s a leading length byte in front of the string data. Now you ask why I need them both: C string and Pascal string together. That’s because in diStorm all strings are concatenated very fast by using Pascal strings. And also because the guy who uses diStorm wants to use printf to display the mnemonic too, which uses C string, he will need a null terminating character at the end of the string, right?
So back to business, remember we have to allocate 19 bytes per mnemonic, even if the mnemonic is as short as ‘OR’, or ‘JZ’, we waste tons of space, right? 900×19=~17kb. And this is where you get the CPU vs. MEMORY issue once again, you get random access into the mnemonic, which is very important but it takes lots of space. Fortunately I came up with a cooler idea. I packed all the strings into a very long string, which looks something like this (copied from the source):

“\x09” “UNDEFINED\0” “\x03” “ADD\0” “\x04” “PUSH\0” “\x03” “POP\0” “\x02” “OR\0” \
“\x03” “ADC\0” “\x03” “SBB\0” “\x03” “AND\0” “\x03” “DAA\0” “\x03” “SUB\0”
and so on… 

 

You can see the leading length and the extra null terminating character for each mnemonic, and then it’s being followed by another mnemonic. And now it seems like we’re lost with random-access cause each string has a varying length and we can never get to the one we want… but lo and behold! Each instruction in the DB contains a field ‘opcodeId’ which denotes the index in the mnemonics array, the offset into the new mnemonics uber string. And now if you use the macro mnemonics.h supplies, you will get to the same mnemonic nevertheless. And all in all I spared around 10kb only on mnemonic strings!

FYI the macro is:

#define GET_MNEMONIC_NAME(m) ((_WMnemonic*)&_MNEMONICS[(m)])->p

As you can see, I access the mnemonics string with the given OpcodeId field which is taken from the decoded instruction and returns a WMnemonic structure, which is a Pascal string (char length; char bytes[1])…

The DB was much harder to compact, but one thing I can tell you when you serialize trees is that you can (and should) use integer-indices, rather than pointers! In x64, each pointer takes 8 bytes, for crying out loud! Now in the new layout, each index in the tree takes only 13 bits, the rest 5 bits talks about the type, where/what the index really points to… And it indirectly means that now the DB takes the same size both for x86 and x64 images, since it is not based on pointers.

Thanks for your time, I surely had pure fun :)

Binary Hooking Problems

May 14th, 2011

Most binary hooking engines write a detour in the entry point of the target function. Other hooking engines patch the IAT table, and so on. One of the problems with overwriting the entry point with a JMP instruction is that you need enough room for that instruction, usually a mere 5 bytes will suffice.

How do the hooking algorithms decide how much is “enough”?

Pretty easy, they use a dissasembler to query the size of each instruction they scan, so if the total size of the instructions that were scanned is more than 5 bytes, they’re done.
As an example, usually, functions start with these two instructions:
PUSH EBP
MOV EBP, ESP

which take only 3 bytes. And we already said that we need 5 bytes total, in order to replace the first instructions with our JMP instruction. Hence the scan will have to continue to the next instruction or so, till we got at least 5 bytes.

So 5 bytes in x86 could contain from one instruction to 5 instructions (where each takes a single byte, obviously). Or even a single instruction whose size is longer than 5 bytes. (In x64 you might need 12-14 bytes for a whole-address-space JMP, and it only makes matter worse).

It is clear why we need to know the size of instructions, since we overwrite the first 5 bytes, we need to relocate them to another location, the trampoline. There we want to continue the execution of the original function we hooked and therefore we need to continue from the next instruction that we haven’t override. And it is not necessarily the instruction at offset 5… otherwise we might continue execution in the middle of an instruction, which is pretty bad.

Lame hooking engines don’t use disassemblers, they just have a predefined table of popular prologue instructions. Come a different compiled code, they won’t be able to hook a function. Anyway, we also need a disassembler for another reason, to tell whether we hit a dead end instruction, such as: RET, INT 3, JMP, etc. These are hooking spoilers, because if the first instruction of the target function is a simple RET (thus the function doesn’t do anything, leave aside cache side effects for now), or even a “return 0” function, which usually translates into “xor eax, eax; ret”, still takes only 3 bytes and we can’t plant a detour. So we find ourselves trying to override 5 bytes where the whole function takes several bytes (< 5 bytes), and we cannot override past that instruction since we don’t know what’s there. It might be another function’s entry point, data, NOP slide, or what not. The point is that we are not allowed to do that and eventually cannot hook the function, fail.

Another problem is relative-offset instructions. Suppose any of the first 5 bytes is a conditional branch instruction, we will have to relocate that instruction. Usually conditional branch instruction are only 2 bytes. And if we copy them to the trampoline, we will have to convert them into the longer variation which is 6 bytes and fix the offset. And that would work well. In x64, RIP-relative instructions are also pain in the butt, as well as any other relative-offset instruction which requires a fix. So there’s quiet a long list of those and a good hooking engine has to support them all, especially in x64 where there’s no standard prologue for a function.

I noticed a specific case where WaitForSingleObject is being compiled to:
XOR R8D, R8D
JMP short WaitForSingleObjectEx

in x64 of course; the xor takes 3 bytes and the jmp is a short one, which takes 2 bytes. And so you got 5 bytes total and should be able to hook it (it’s totally legal), but the hook engine I use sucked and didn’t allow that.

So you might say, ok, I got a generic solution for that, let’s follow the unconditional branch and hook that point. So I will hook WaitForSingleObjectEx instead, right? But now you got to the dreaded entry points problem. You might get called for a different entry point that you never meant to hook. You wanted to hook WaitForSingleObject and now you end up hooking WaitForSingleObjectEx, so all callers to WaitForSingleObject get to you, that’s true. In addition, now all callers to WaitForSingleObjectEx get to you too. That’s a big problem. And you can’t ever realize on whose behalf you were called (with a quick and legitimate solution).

The funny thing about the implementation of WaitForSingleObject is that it was followed immediately by an alignment NOP slide, which is never executed really, but the hooking engine can’t make a use of it, because it doesn’t know what we know. So unconditional branches screw up hooking engines if they show up before 5 bytes from the entry point, and we just saw that following the unconditional branch might screw us up as well with wrong context. So if you do that, it’s ill.

What would you do then in such cases? Cause I got some solutions, although nothing is perfect.

Executing .PYC Files in Python

April 29th, 2011

I got this .PYC (compiled Python script) file that I used with command line Python, ala: “python.exe script.pyc”. And that would run the script.pyc file and let it do its job. The problem was that I wanted to run a few Python lines before running script.pyc itself. But apparently, it’s not really possible.

Suppose script.pyc does the familiar name check:
if __name__ == “__main__”:
main()

I guess everyone who wrote a module or two in Python knows this trick. This way you can tell in your script whether it was imported or executed and do whatever you are up to correspondingly. Thus, if it was executed, you will probably want to run a test case for the module, hence calling main() usually, and you could even pass command line arguments to it the normal way.

Since I wanted to do a few things before the script gets to run in Python, it means I had to open Python myself, do my stuff, then execute the script. Unfortunately, it is not possible to execfile() a .PYC file, beat me. Again, I couldn’t just import the file since then that if statement I presented above would fail and won’t call the main() function and nothing would happen, fail. execfile() also doesn’t work, simply because it runs only pure Python source code and not a compiled script.

What I eventually came up with was to import the file myself, but that required fooling a bit :)
You can __import__ file in the code in runtime. What I mean is that when you don’t have the filename to import in static time, you can use that function which receives a filename and dynamically loads the module in runtime.
I tried that on the script.pyc file and obviously it didn’t work as well, because the __name__ was wrong, so the main() didn’t get executed. That made me realize that I need to do the __import__’s internals on my own and only then I will be able to change the __name__ for the module (if that’s possible, but it has to be, since the distinction exists, right?)

Then after a bit of googling around I sumbled upon: imp.
Which shows how to import a file ourselves, then I changed it to:
import imp
fp, pathname, description = imp.find_module(“script”)
imp.load_module(“__main__”, fp, pathname, description)

Notice how I pass __main__ instead of the module’s real name, then the check for main() inside script.py would really work and execute the main() function and I’m all happy once again.

JavaScript Once Again

April 22nd, 2011

I stumbled upon this JavaScript Garden page.
It’s one of the best resources about in-depth JS knowledge IMHO. And if you wanna be a JS guru it’s a must read. It shows all the more why JS sucks though :) So many fugly syntax problems, non-intuitive language. Weird scope management. And what not… Just read it anyway.

Uh Ah! I Happened To Use POP ESP

April 15th, 2011

I was telling the story to a friend of mine about me using POP ESP in some code I wrote, and then he noted how special it is to use such an instruction and probably I’m the first one whom he’s heard of that used it. So I decided to share. I’m sorry to be mystical about my recent posts, it’s just that they are connected to the place I work at, and I can’t talk really elaborate about everything.

Here we go.
I had to call a C++ function from my Assembly code and keep the return value untouched so the caller will get it. Usually return values are passed on EAX, in x86 that is. But that’s not the whole truth, they might be passed on EDX:EAX, if you want to return 64 bits integer, for instance.
My Assembly code was a wrapper to the C++ function, so once the C++ function returned, it got back to me, and so I couldn’t touch both EDX and EAX. The problem was that I had to clean the stack, as my wrapper function acted as STDCALL calling convention. Cleaning the stack is pretty easy, after you popped EBP and the stack pointer points to the return address, you still have to do POPs as the number of arguments your function receives. The calling convention also specifies which registers are to be preserved between calls, and which registers are scratch. Therefore I decided to use ECX for my part, because it’s a scratch register, and I didn’t want to dirty any other register. Note that by the time you need to return to the caller and both clean the arguments on the stack, it’s pretty hard to use push and pop instructions to back up a register so you can freely use it. Again, because you’re in the middle of cleaning the stack, so by the time you POP that register, the ESP moved already. Therefore I got stuck with ECX only, but that’s fine with me. After the C++ function returned to me, I read from some structure the number of arguments to clean. Suppose I had the pointer to that structure in my frame and it was easily accessible as a local variable. Then I cleaned my own stack frame, mov esp, ebp and pop ebp. Then ESP pointed the return address.
This is where it gets tricky:

Assume ECX holds the number of arguments to clean:
lea ecx, [esp + ecx*4 + 4]

That calculation gets the fixed stack address, like the ESP that a RET N instruction would get it to. So it needs to skip the number of arguments multiplied by 4, 4 bytes per argument, and add to that the return value itself.

Going on with:
xchg [esp], ecx

Which puts on the stack the fixed stack address, and getting ECX with the return address. This is where usually people get confused, take your time. I’m waiting ;)

And then the almighty:
pop esp
jmp ecx

We actually popped the fixed stack pointer from the stack itself into the stack pointer. LOL
And since we got ECX loaded with the return address, we just have to branch to it.

What I was actually doing is to simulate the RET N instruction, using only ECX. And ESP should be used anyway. Now the function I was returning to, could access both the optional EDX and EAX as return values from the C++ function.

It seems that the solution begged a SMC (self modifying code) so I could just patch the N, in the RET N instruction, which is a 16 bits immediate value. But SMC is bad for performance, and obviously for multi threading…

Also note that I could just clean the stack, and then branched to something like: jmp [esp – argsCount*4 – 4],
but I don’t like reading off my stack pointer, that’s a bad practice (mostly from the days of interrupts…).

POP ESP FTW

Getting RAX Register in C/C++ or coders that should be killed ;)

April 13th, 2011

Just a weird story, suppose I need to get the value of RAX register in x64 inside some function I wrote.
I was using Visual Studio and as you might know in x64 you are not allowed to use the declspec(naked) or the inline __asm keyword anymore, what a shame. So obviously, I could write some .asm file and link it in. But I prefered to come up with more elegant idea. Anyway, I just wanna show you the solution.

if I set up a function such as:
uint64_t getRax() { }
An empty function which doesn’t do anything. The compiler will shout at me that it cannot compile such a function because there’s no any return statement. What a shame. But suppose we could compile that function, we could then just call it and it would immediately return to the caller without doing anything, then we could read the return value, which wasn’t changed cause the function is empty, thus we could get RAX. Following so far?

A single cast should do the trick. So first, we will have to change the function into:
void getRax() { }
Now the compiler will actually compile it. And now we will add a new pointer to a function that returns an actual uint64_t.

Defining a pointer to a function as follows:
uint64_t (*_getRax)()) = getRax;
But that wouldn’t compile as well, because the compiler is smart enough to know that we are messing up with types here. We will either end up with a warning or an error, too bad. As we know everything can be casted to void* and that’s why we need to cast through void* for success, such as:
uint64_t (*_getRax)() = (void*) getRax;

This way we got rid off the warnings when using W3/4 or treat warning as errors, the way I usually work. Though I could also disable the warning for the region of that fugly code. However, this is one of the reasons C/C++ is probably one of the strongest programming languages, this flexibility…
Surprisingly, I just found that GCC is more permissive in this case than VS.

And then we can simply use it:
uint64_t rax = _getRax();

I wish this could also work:
unsigned char getRax[] = { 0xc3 };
uint64_t (*_getRax)() = (void*) getRax;
_getRax();
But obviously since DEP is enabled nowadays, it will fail with an awful death :(

It really bothers me that I have to CALL to a RET in order to get RAX accessible in C, LOL. Anyone aware of any intrinsic to do similar things with registers? For some reason I remember something like __EAX in VS, but I couldn’t find it.