Archive for the ‘Virii’ Category

isX64 Gem

Wednesday, July 13th, 2011

I needed a multi-arch shellcode for both x86 and x64 in the same code. Suppose you want to attack a platform, which can either be x86 or x64 where you don’t know in advance which it is. The problem is which version you really need to use at runtime then, right?

This is a tiny trick I’ve been using for a long while now which tells whether you run on x64 or not:

INC EAX ; = DB 0x40
JZ x64_code
bits 32
bits 64

The idea is very simple, since x64 and x86 share most opcodes’ values, there is a small in-similarity with the range of 0x40-0x50, in x86 it used for one byte INC and DEC opcodes. Since there’re 8 GPRs (General Purpose Register), and 2 opcodes, it spans over the whole range of 0x40-0x50.
Now when AMD64’s ISA (Instruction Set Architecture) was designed, they added another set of 8 GPRs, making it a total of whopping 16 GPRs. In a world where x86 ruled, you only needed 3 bits in the ModRM byte (some byte in the instruction that tells the processor how to read its operands) to access a specific register from 0 to 8. With the new ISA, an extra bit was required in order to be able to address all 16 registers. Therefore, a new prefix (called the REX prefix) was added to solve this problem with an extra bit (and there’s more to it, not relevant for now). The new prefix used the range of 0x40-0x50, thus eliminating old one byte INC/DEC (no worries however, now compilers use the 2 bytes existent variation for these instructions).

Back to our assembly code, it depends on the fact that in x86 the INC EAX, really increments EAX by one, and so it will become 1 if the code runs on x86. And when it’s run on x64, it becomes a prefix to the NOP instruction, which doesn’t do anything anyway. And hence, EAX stays zero. Just a final note for the inexperienced that in x64, operations on 32 bit registers are automatically promoted to 64 bit registers, so RAX is also 0.

Finding Kernel32 Base Address Shellcode

Thursday, July 7th, 2011

Yet another one…
This time, smaller, more correct, and still null-free.
I looked a bit at some shellcodes at exploit-db and googled too, to see whether anyone got a smaller way to no avail.

I based my code on:
AFAIK, who based his post on:

And this is my version:

00000000 (02) 6a30                     PUSH 0x30
00000002 (01) 5e                       POP ESI
; Use DB 0x64; LODSD
00000003 (02) 64ad                     LODS EAX, [FS:ESI]
00000005 (03) 8b700c                   MOV ESI, [EAX+0xc]
00000008 (03) 8b761c                   MOV ESI, [ESI+0x1c]
0000000b (03) 8b5608                   MOV EDX, [ESI+0x8]
0000000e (04) 807e1c18                 CMP BYTE [ESI+0x1c], 0x18
00000012 (02) 8b36                     MOV ESI, [ESI]
00000014 (02) 75f5                     JNZ 0xb

The tricky part was how to read from FS:0x30, and the way I use is the smallest one, at least from what I checked.
Another issue that was fixed is the check for kernel32.dll, usually the variation of this shellcode checks for a null byte, but it turned out to be bogous on W2k machines, so it was changed to check for a null word. Getting the shellcode by a byte or two longer.

This way, it’s only 22 bytes, it doesn’t assume that kernel32.dll is the second/third entry in the list, it actually loops till it finds the correct module length (len of ‘kernel32.dll’ * 2 bytes). Also since kernelbase.dll can come first and that renders lots of implementations of this technique unusable.
And obviously the resulting base address of kernel32.dll is in EDX.


[Update July 9th:]
Here’s a link to an explanation about PEB/LDR lists.
See first comment for a better version which is only 17 bytes.

Binary Hooking Problems

Saturday, May 14th, 2011

Most binary hooking engines write a detour in the entry point of the target function. Other hooking engines patch the IAT table, and so on. One of the problems with overwriting the entry point with a JMP instruction is that you need enough room for that instruction, usually a mere 5 bytes will suffice.

How do the hooking algorithms decide how much is “enough”?

Pretty easy, they use a dissasembler to query the size of each instruction they scan, so if the total size of the instructions that were scanned is more than 5 bytes, they’re done.
As an example, usually, functions start with these two instructions:

which take only 3 bytes. And we already said that we need 5 bytes total, in order to replace the first instructions with our JMP instruction. Hence the scan will have to continue to the next instruction or so, till we got at least 5 bytes.

So 5 bytes in x86 could contain from one instruction to 5 instructions (where each takes a single byte, obviously). Or even a single instruction whose size is longer than 5 bytes. (In x64 you might need 12-14 bytes for a whole-address-space JMP, and it only makes matter worse).

It is clear why we need to know the size of instructions, since we overwrite the first 5 bytes, we need to relocate them to another location, the trampoline. There we want to continue the execution of the original function we hooked and therefore we need to continue from the next instruction that we haven’t override. And it is not necessarily the instruction at offset 5… otherwise we might continue execution in the middle of an instruction, which is pretty bad.

Lame hooking engines don’t use disassemblers, they just have a predefined table of popular prologue instructions. Come a different compiled code, they won’t be able to hook a function. Anyway, we also need a disassembler for another reason, to tell whether we hit a dead end instruction, such as: RET, INT 3, JMP, etc. These are hooking spoilers, because if the first instruction of the target function is a simple RET (thus the function doesn’t do anything, leave aside cache side effects for now), or even a “return 0” function, which usually translates into “xor eax, eax; ret”, still takes only 3 bytes and we can’t plant a detour. So we find ourselves trying to override 5 bytes where the whole function takes several bytes (< 5 bytes), and we cannot override past that instruction since we don’t know what’s there. It might be another function’s entry point, data, NOP slide, or what not. The point is that we are not allowed to do that and eventually cannot hook the function, fail.

Another problem is relative-offset instructions. Suppose any of the first 5 bytes is a conditional branch instruction, we will have to relocate that instruction. Usually conditional branch instruction are only 2 bytes. And if we copy them to the trampoline, we will have to convert them into the longer variation which is 6 bytes and fix the offset. And that would work well. In x64, RIP-relative instructions are also pain in the butt, as well as any other relative-offset instruction which requires a fix. So there’s quiet a long list of those and a good hooking engine has to support them all, especially in x64 where there’s no standard prologue for a function.

I noticed a specific case where WaitForSingleObject is being compiled to:
JMP short WaitForSingleObjectEx

in x64 of course; the xor takes 3 bytes and the jmp is a short one, which takes 2 bytes. And so you got 5 bytes total and should be able to hook it (it’s totally legal), but the hook engine I use sucked and didn’t allow that.

So you might say, ok, I got a generic solution for that, let’s follow the unconditional branch and hook that point. So I will hook WaitForSingleObjectEx instead, right? But now you got to the dreaded entry points problem. You might get called for a different entry point that you never meant to hook. You wanted to hook WaitForSingleObject and now you end up hooking WaitForSingleObjectEx, so all callers to WaitForSingleObject get to you, that’s true. In addition, now all callers to WaitForSingleObjectEx get to you too. That’s a big problem. And you can’t ever realize on whose behalf you were called (with a quick and legitimate solution).

The funny thing about the implementation of WaitForSingleObject is that it was followed immediately by an alignment NOP slide, which is never executed really, but the hooking engine can’t make a use of it, because it doesn’t know what we know. So unconditional branches screw up hooking engines if they show up before 5 bytes from the entry point, and we just saw that following the unconditional branch might screw us up as well with wrong context. So if you do that, it’s ill.

What would you do then in such cases? Cause I got some solutions, although nothing is perfect.

Calling System Service APIs in Kernel

Wednesday, January 26th, 2011

In this post I am not going to shed any new light about this topic, but I didn’t find anything like this organized in one place, so I decided to write it down, hope you will find it useful.

Sometimes when you develop a kernel driver you need to use some internal API that cannot be accessed normally through the DDK. Though you may say “but it’s not an API if it’s not officially exported and supported by MS”. Well that’s kinda true, the point is that some functions like that which are not accessible from the kernel, are really accessible from usermode, hence they are called API. After all, if you can call NtCreateFile from usermode, eventually you’re supposed to be able to do that from kernel, cause it really happens in kernel, right? Obviously, NtCreateFile is an official API in the kernel too.

When I mean using system service APIs, I really mean by doing it platform/version independent, so it will work on all versions of Windows. Except when MS changes the interface (number of parameters for instance, or their type) to the services themselves, but that rarely happens.

I am not going to explain how the architecture of the SSDT and the transitions from user to kernel or how syscalls, etc work. Just how to use it to our advantage. It is clear that MS doesn’t want you to use some of its APIs in the kernel. But sometimes it’s unavoidable, and using undocumented API is fine with me, even in production(!) if you know how to do it well and as robust as possible, but that’s another story. We know that MS doesn’t want you to use some of these APIs because a) they just don’t export it in kernel on purpose, that is. b) starting with 64 bits versions of Windows they made it harder on purpose to use or manipulate the kernel, by removing previously exported symbols from kernel, we will get to that later on.

Specifically I needed ZwProtectVirtualMemory, because I wanted to change the protection of some page in the user address space. And that function isn’t exported by the DDK, bummer. Now remember that it is accessible to usermode (as VirtualProtectMemory through kernel32.dll syscall…), therefore there ought to be a way to get it (the address of the function in kernel) in a reliable manner inside a kernel mode driver in order to use it too. And this is what I’m going to talk about in this post. I’m going to assume that you already run code in the kernel and that you are a legitimate driver because it’s really going to help us with some exported symbols, not talking about shellcodes here, although shellcodes can use this technique by changing it a bit.

We have a few major tasks in order to achieve our goal: Map the usermode equivalent .dll file. We need to get the index number of the service we want to call. Then we need to get the base address of ntos and the address of the (service) table of pointers (the SSDT itself) to the functions in the kernel. And voila…

The first one is easy both in 32 and 64 bits systems. There are mainly 3 files which make the syscalls in usermode, such as: ntdll, kernel32 and user32 (for GDI calls). For each API you want to call in kernel, you have to know its prototype and in which file you will find it (MSDN supplies some of this or just Google it). The idea is to map the file to the address space as an (executable) image. Note that the cool thing about this mapping is that you will get the address of the required file in usermode. Remember that these files are physically shared among all processes after boot time (For instance, addresses might change because of ASLR but stay consistent as long as the machine is up). Following that we will use a similar functionality to GetProcAddress, but one that you have to write yourself in kernel, which is really easy for PE and PE+ (64 bits).

Alright, so we got the image mapped, we can now get some usermode API function’s address using our GetProcAddress, now what? Well, now we have to get the index number of the syscall we want. Before I continue, this is the right place to say that I’ve seen so many approaches to this problem, disassemblers, binary patterns matching, etc. And I decided to come up with something really simple and maybe new. You take two functions that you know for sure that are going to be inside kernel32.dll (for instance), say, CreateFile and CloseHandle. And then simply compare byte after byte from both functions to find the first different byte, that byte contains the index number of the syscall (or the low byte out of the 4 bytes integer really). Probably you have no idea what I’m talking about, let me show you some usermode API’s that directly do syscalls:

XP SP3 ntdll.dll
B8 25 00 00 00                    mov     eax, 25h        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 19 00 00 00                    mov     eax, 19h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP1 32 bits ntdll.dll

B8 3C 00 00 00                    mov     eax, 3Ch        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 30 00 00 00                    mov     eax, 30h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

2008 sp2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

Win7 64bits syswow64 ntdll.dll

B8 52 00 00 00                    mov     eax, 52h        ; NtCreateFile
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 2C 00                          retn    2Ch

B8 0C 00 00 00                    mov     eax, 0Ch        ; NtClose
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 04 00                          retn    4

These are a few snippets to show you how the syscall function templates look like. They are generated automatically by some tool MS wrote and they don’t change a lot as you can see from the various architectures I gathered here. Anyway, if you take a look at the bytes block of each function, you will see that you can easily spot the correct place where you can read the index of the syscall we are going to use. That’s why doing a diff on two functions from the same .dll would work well and reliably. Needless to say that we are going to use the index number we get with the table inside the kernel in order to get the corresponding function in the kernel.

This technique gives us the index number of the syscall of any exported function in any one of the .dlls mentioned above. This is valid both for 32 and 64 bits. And by the way, notice that the operand type (=immediate) that represents the index number is always a 4 bytes integer (dword) in the ‘mov’ instruction, just makes life easier.

To the next task, in order to find the base address of the service table or what is known as the system service descriptor table (in short SSDT), we will have to get the base address of the ntoskrnl.exe image first. There might be different kernel image loaded in the system (with or without PAE, uni-processor or multi-processor), but it doesn’t matter in the following technique I’m going to use, because it’s based on memory and not files… This task is really easy when you are a driver, means that if you want some exported symbol from the kernel that the DDK supplies – the PE loader will get it for you. So it means we get, without any work, the address of any function like NtClose or NtCreateFile, etc. Both are inside ntos, obviously. Starting with that address we will round down the address to the nearest page and scan downwards to find an ‘MZ’ signature, which will mark the base address of the whole image in memory. If you’re afraid from false positives using this technique you’re welcome to go further and check for a ‘PE’ signature, or use other techniques.

This should do the trick:

PVOID FindNtoskrnlBase(PVOID Addr)
    /// Scandown from a given symbol’s address.
    Addr = (PVOID)((ULONG_PTR)Addr &amp; ~0xfff);
    __try {
        while ((*(PUSHORT)Addr != IMAGE_DOS_SIGNATURE)) {
            Addr = (PVOID) ((ULONG_PTR)Addr – PAGE_SIZE);
        return Addr;
    __except(1) { }
    return NULL;

And you can call it with a parameter like FindNtoskrnlBase(ZwClose). This is what I meant that you know the address of ZwClose or any other symbol in the image which will give you some “anchor”.

After we got the base address of ntos, we need to retrieve the address of the service table in kernel. That can be done using the same GetProcAddress we used earlier on the mapped user mode .dll files. But this time we will be looking for the “KeServiceDescriptorTable” exported symbol.

So far you can see that we got anchors (what I call for a reliable way to get an address of anything in memory) and we are good to go, this will work in production without the need to worry. If you wanna start the flame war about the unlegitimate use of undocumented APIs, etc. I’m clearly not interested. :)
Anyway, in Windows 32 bits, the latter symbol is exported, but it is not exported in 64 bits! This is part of the PatchGuard system, to make life harder for rootkits, 3rd party drivers doing exactly what I’m talking about, etc. I’m not going to cover how to get that address in 64 bits in this post.

The KeServiceDescriptorTable is a table that holds a few pointers to other service tables which contain the real addresses of the service functions the OS supplies to usermode. So a simple dereference to the table and you get the pointer to the first table which is the one you are looking for. Using that pointer, which is really the base address of the pointers table, you use the index we read earlier from the required function and you got, at last, the pointer to that function in kernel, which you can now use.

The bottom line is that now you can use any API that is given to usermode also in kernelmode and you’re not limited to a specific Windows version, nor updates, etc. and you can do it in a reliable manner which is the most important thing. Also we didn’t require any special algorithms nor disassemblers (as much as I like diStorm…). Doing so in shellcodes make life a bit harder, because we had the assumption that we got some reliable way to find the ntos base address. But every kid around the block knows it’s easy to do it anyway.

Happy coding :)

References I found interesting about this topic:

And how to do it in 64 bits:

VML + ANI ZERT Patches

Tuesday, February 3rd, 2009

It is time to release an old presentation about the VML and ANI vulnerabilities that were patched by ZERT. It explains the vulnerabilities and how they were closed. It is somewhat very technical, Assembly is required if you wanna really enjoy it. I also gave a talk using this presentation in CCC 2007. It so happened that I wrote the patches, with the extensive help of the team, of course.

ZERT Patches.ppt

Anti-Unpacker Tricks

Friday, July 18th, 2008

Peter Ferrie, a former employee of Symantec, who now works for MS wrote a paper about Anti Unpacker tricks. I was really fascinated reading that paper. There were so many examples in there for tricks that still work nowadays. Some I knew already some were new to me, he covers so many tricks. The useful thing is that every trick has a thorough description and a code snippet (mostly Assembly). So now it becomes one of the most valueable papers in the subject and you should really read it to get up to date. The paper can be found here.

One idea I that I really like from the paper, is something that Peter himself found, that you can use ReadFile (or WriteProcessMemory) to override a memory block so no software breakpoints will be raised when you execute it. But on a second thought, why a simple memcpy won’t do the same trick?

If you guys remember the Tiny PE challenge I posted 2 years ago in, then Peter was the only one who kicked my ass with a version of 232 byts,  where I came with 274 bytes. But no worries, after a long while I came back with a version of 213(!) bytes (over here) and used some new tricks. Today I still wait for Peter’s last word…

Have fun

Anti Debugging

Monday, January 14th, 2008

I found this nice page about Anti-Debugging tricks. It covers so many of them and if you know the techniques it’s really fun to read it quickly one by one. You can take a look yourself here: Window Anti-Debug Reference. One of the tricks really attracted my focus and it was soemthing like this:

push ss
pop ss

What really happens is that you write to SS and the processor has a protection mechanism, so you can safely update rSP immediately as well. Because it could have led to catastrophic results if an interrupt would occur precisely after only SS is updated but rSP wasn’t yet. Therefore the processor locks all interrupts until the end of the next instruction, whatever it is. However, it locks interrupts only once during the next instruction no matter what, and it won’t work if you pop ss and then do it again… This issue means that if you are under a debugger or a tracer, the above code will push onto the stack the real flags of the processor’s current execution context.

Thus doing this:
pop eax
and eax, 0x100
jnz under_debugging

Anding the flags we just popped with 0x100 actually examines the trap flag which if you simply try to pushf and then pop eax, will show that the trap flag is clear and you’re not being debugged, which is a potential lie. So even the trap flag is getting pended or just stalled ’till next instruction and then the debugger engine can’t get to recognize a pushf instruction and fix it. How lovely.

I really agree with some other posts I saw that claim that an anti-debugging trick is just like a zero-day, if you’re the first to use it – you will win and use it well, until it is taken care of and gets known. Although, to be honest, a zero-day is way cooler and another different story, but oh well… Besides anti-debugging can’t really harm, just waste some time for the reverser.

Since I wrote diStorm and read the specs of both Intel and AMD regarding most instructions upside down, I immediately knew about “mov ss” too. Even the docs state about this special behavior. But it never occurred to me to use this trick. Anyway, another way to do the same is:

mov eax, ss
mov ss, eax

A weird issue was that the mov ss, eax, must really be mov ss, ax. Although all disassemblers will show them all as mov ss, ax (as if it were in 16 bits). In truth you will need a db 0x66 to make this mov to work… You can do also lots of fooling around with this instruction, like mov ss, ax; jmp $-2; and if you single step that, without seeing the next instruction you might get crazy before you realize what’s going on. :)

I even went further and tried to use a priviliged instruction like CLI after the writing to SS in the hope that the processor is executing in a special mode and there might be a weird bug. And guess what? It didn’t work and an exception was raised, of course. Probably otherwise I won’t have written about it here :). It seems the processors’ logic have a kind of an internal flag to pend interrupts till end of next instruction and that’s all. To find bugs you need to be creative…never harm to try even if it sounds stupid. Maybe with another privileged instruction in different rings and modes (like pmode/realmode/etc) it can lead to something weird, but I doubt it, and I’m too lazy to check it out myself. But imagine you can run a privileged instruction from ring3…now stop.

TinyPE Made The World a Safer Place, did it?

Saturday, August 25th, 2007

It’s pretty cool to see after a long while since I’ve started that project that many AV’s now find the concept of Tiny PE as a virus or a risky application. On the other hand, it’s not a virus, so why do you alert about it? But most people think of the Tiny PE project, specifically what I started – was to download a file from the Internet and execute it. So it came out that the PE header was really fragile and yet it worked for Windows. So most AV’s and disassemblers didn’t even manage to parse it. That was only a side effect, later on, it was used with WebDAV to download the file directly by the Windows Loader using the name of a DLL as a URL(!), a real ownage.

So now I see that the link to the file my proof of concept code downloads is “censored” by some AV’s. My code is really inocent, will open a mere message box. But I guess you can imagine where it can end. Here’s the output of some AV:    # void
Where the original file URL is:

So it seems like it really made the world, or to be accurate the Internet, a safer place…although it wasn’t my real intention, because it was all started as a small bet with a friend and now see where it ended. Respect.

PS: to be really accurate when I say AV I mean malware scanning systems.

Common PE-Parsing Pitfalls

Sunday, June 3rd, 2007

PE, or Portable-Executable is Windows’ executable format. Looking only at the PE , as opposed to the code inside, can teach you alot about the application. But sometimes the tools you use to parse the file don’t do their work well. I, hereby, want to show a few problems about this kind of tools. As a matter of fact, .DLL and .SYS are also PE files under Windows, so I consider them the same when talking about PE files.

  1. If the Export-Directory offset points to a garbage pointer, Windows will still load the PE and run it successfully. It will get crazy and probably crash only when you try to GetProcAddress on this file. You can use this to render some tools unuseful but the file is still valid and runnable. Don’t confuse with Import-Directory which is necessary for loading the PE.
  2. Another annoying thing, whether the so-called “PE-Standard” states it or not, is the way of RVA (relative-virtual -address) offsets. RVA offset is the distance from the beginning of the file in memory to some specific field or section. Most people treat these RVA’s as if they must point into a section. But in reality RVA’s can point anywhere in the memory mapped file. They can even be negative numbers, (at least, as long as they still point to valid memory of the file). The problem is, most tools try to find the pointee field/section by scanning the sections list, but alas, it might be a valid RVA, which points to no section, but another part in the memory mapped file, for example, the MZ header… While Windows load these files well, some tools cannot parse them.
  3.  The most interesting problem that I found myself, not sure if anyone used it before, was changing the section’s File-Offset to be mis-aligned. The File-Offset is actually rounded down to a sector size (that’s 512 bytes) no matter what. So adding some number to the original valid File-Offset of the code section will fool some tools to read the code from that offset, instead of the rounded offset. Guess what happens? You disassemble the code from a wrong address and everything messes up. Even the mighty IDA had this bug. I introduced this technique in my Tiny PE Challenge. It seems most anti-virus software couldn’t chew up this file back then when I released it…Not sure about nowadays.
  4.  While researching for Tiny PE, Matthew Murphy hinted out that you can load files from the Internet with feeding it with a raw URL of the PE file. Later on it was extended such that Windows’ loader will use WebDAV to load an imported .DLL from the Internet! Think of an imported library with the following name \\\my.dll inside the PE file itself. This one seemed to be a real blow to the AV industry. It means you can write an empty PE file which will have this special import thingy and gets it off the Internet. For samples you can check it out here, which covers Tiny PE (not my ones) very nicely.

The bottom line is that the Windows’ loader is kinda robust and yet very permissive. It seems as virii can exploit many features the PE format has to offer while AV’s still don’t support some. I guess some of the tools (some are really popular) will get better with time. As for now, my PE parser library for Python, diSlib64 seems to do the job quite well.