Binary Hooking Problems

Most binary hooking engines write a detour in the entry point of the target function. Other hooking engines patch the IAT table, and so on. One of the problems with overwriting the entry point with a JMP instruction is that you need enough room for that instruction, usually a mere 5 bytes will suffice.

How do the hooking algorithms decide how much is “enough”?

Pretty easy, they use a dissasembler to query the size of each instruction they scan, so if the total size of the instructions that were scanned is more than 5 bytes, they’re done.
As an example, usually, functions start with these two instructions:

which take only 3 bytes. And we already said that we need 5 bytes total, in order to replace the first instructions with our JMP instruction. Hence the scan will have to continue to the next instruction or so, till we got at least 5 bytes.

So 5 bytes in x86 could contain from one instruction to 5 instructions (where each takes a single byte, obviously). Or even a single instruction whose size is longer than 5 bytes. (In x64 you might need 12-14 bytes for a whole-address-space JMP, and it only makes matter worse).

It is clear why we need to know the size of instructions, since we overwrite the first 5 bytes, we need to relocate them to another location, the trampoline. There we want to continue the execution of the original function we hooked and therefore we need to continue from the next instruction that we haven’t override. And it is not necessarily the instruction at offset 5… otherwise we might continue execution in the middle of an instruction, which is pretty bad.

Lame hooking engines don’t use disassemblers, they just have a predefined table of popular prologue instructions. Come a different compiled code, they won’t be able to hook a function. Anyway, we also need a disassembler for another reason, to tell whether we hit a dead end instruction, such as: RET, INT 3, JMP, etc. These are hooking spoilers, because if the first instruction of the target function is a simple RET (thus the function doesn’t do anything, leave aside cache side effects for now), or even a “return 0″ function, which usually translates into “xor eax, eax; ret”, still takes only 3 bytes and we can’t plant a detour. So we find ourselves trying to override 5 bytes where the whole function takes several bytes (< 5 bytes), and we cannot override past that instruction since we don’t know what’s there. It might be another function’s entry point, data, NOP slide, or what not. The point is that we are not allowed to do that and eventually cannot hook the function, fail.

Another problem is relative-offset instructions. Suppose any of the first 5 bytes is a conditional branch instruction, we will have to relocate that instruction. Usually conditional branch instruction are only 2 bytes. And if we copy them to the trampoline, we will have to convert them into the longer variation which is 6 bytes and fix the offset. And that would work well. In x64, RIP-relative instructions are also pain in the butt, as well as any other relative-offset instruction which requires a fix. So there’s quiet a long list of those and a good hooking engine has to support them all, especially in x64 where there’s no standard prologue for a function.

I noticed a specific case where WaitForSingleObject is being compiled to:
JMP short WaitForSingleObjectEx

in x64 of course; the xor takes 3 bytes and the jmp is a short one, which takes 2 bytes. And so you got 5 bytes total and should be able to hook it (it’s totally legal), but the hook engine I use sucked and didn’t allow that.

So you might say, ok, I got a generic solution for that, let’s follow the unconditional branch and hook that point. So I will hook WaitForSingleObjectEx instead, right? But now you got to the dreaded entry points problem. You might get called for a different entry point that you never meant to hook. You wanted to hook WaitForSingleObject and now you end up hooking WaitForSingleObjectEx, so all callers to WaitForSingleObject get to you, that’s true. In addition, now all callers to WaitForSingleObjectEx get to you too. That’s a big problem. And you can’t ever realize on whose behalf you were called (with a quick and legitimate solution).

The funny thing about the implementation of WaitForSingleObject is that it was followed immediately by an alignment NOP slide, which is never executed really, but the hooking engine can’t make a use of it, because it doesn’t know what we know. So unconditional branches screw up hooking engines if they show up before 5 bytes from the entry point, and we just saw that following the unconditional branch might screw us up as well with wrong context. So if you do that, it’s ill.

What would you do then in such cases? Cause I got some solutions, although nothing is perfect.

7 Responses to “Binary Hooking Problems”

  1. cketti says:

    In addition to the problem(s) you describe the entry point patching method also has the drawback that there is no way of knowing whether the first five bytes contain more than one basic block (short of correctly[!] disassembling the whole module). There could be a branch somewhere in the module, e.g. to the instruction at offset 2. So if you overwrite the first five bytes with your hooking JMP and this branch is taken at some point it lands inside the inserted JMP instruction. And now you’ve unintentionally altered the program logic.

    There are so many things that have to be considered when patching the entry point. So I’d take the IAT hooking route if possible.

  2. Anonymous says:

    When I did this, I (1) push the hook id, push the address of my logging code, then return (2) copy the original code back into the function (3) push return address and call the original function (4) copy the hook back over the function

    Unfortunately the hook code was like 11 bytes, if the function was smaller than that I’m screwed

  3. Gal Diskin says:

    Speaking from dynamic binary instrumentation perspective we’ve seen this often. In Pin ( which is a DBI engine we have a mode (probe) that does what you described and we implemented a function to check the safety of putting a probe == jmp.
    I will argue that even in x64 you can find creative ways to save bytes and not require the full long jmp.
    Our final conclusion was that there is no perfect way to be sure short of doing full binary translation (which is what the JIT mode in Pin does). But the performance penalty becomes significant.

    BTW if you are doing this manually or writing your own engine I recommend trying Pin.

    – Gal

  4. og says:

    @anonymous..the method you were using does not support concurrent entrance into the function. If you do not need to worry with that, then you can restore the original code and put back your own on exit. In practice, that is only seldomly allowed. In other words, you must be getting old :-)

  5. Ron says:

    My solution to this has been doing it all by hand. Look at what I need to hook, and decide the best place/method. It always works!

  6. anonymous says:

    If you are hooking Windows APIs you would notice that most of them are designed to be hooked. They are even designed to be hooked safely while code is executing. There are 5 NOP instructions before the the function, and the first instruction is at least 2 bytes long. This is why many 32-bit functions begin with a 2-byte no-op such as MOV EDI, EDI.

  7. Sirmabus says:

    I have a related post on my blog here:
    “Knowing if and when you can fit a JMP5 binary hook.”

    Like you I use a disassembler/code analyzer for proper hooking.

    Some comment to other comments:
    IAT hooking has it’s usefulness. In particular where you want to control hooking at a module level.
    At the same time this is also it’s problem. You have to do it for every loaded module. Existing then an use an internal hook to catch newly loaded modules. And then this doesn’t cover the cases where the API is accessed via “GetProcAddress()”.
    Although you can replace the EAT vector too that will at least cover the 2nd case.

    This is what madCodeHook does for problem API cases.
    If IMHO you want to see an excellent API hooking engine in practice is learn how MCH does it.
    It’s been around a long time, and is very complete and robust.

    I don’t claim to know all the answers but I feel people should look at each technique for it’s usefulness. Add them to your “toolbox” to use fitting the situations. The main thing is to have an understanding of what they do, what their side effects are, etc.
    As in most things science it’s not always “black and white”. Grey area’s where there is more then one solution et al. And some things fitting more then other depending on your choice of paradigms, etc.
    For instance if your target is a game and it has an anti-cheat then maybe you have (if it will still work for you) to hook the main exit point/block of a function. Or maybe you can hook the IAT vector because it’s not being checked (when the entry point is).

    I prefer the binary hook method as it is the most complete for general purpose use. And the same paradigm applies to general purpose, other then API hooking, too.

    From my blog post you will see I made a little program (that has since been expanded) to walk through a single or many DLLs to get some statistics and metrics of some common hooking considerations.

    The problem cases are actually pretty rare. So rare that for general purpose you probably don’t even need to compensate for them.
    Out of all the DLL exports in a XP 32bit “system32″ folder less then 5% where a problem.
    With the size increased to 6 bytes (for absolute jumps, etc.) it’s still only about 10%.
    And these percentages are actually less due to various error and special cases.

    Just 32bit so far, I haven’t had the need to do actual native 64bit hooks when actually even on a 64bit OS most of my target applications are still 32bit (using the OS WoW64 emulation internally).

    Those NOP’s (or the single byte int3/0xCC equivalent) are actually alignment padding bytes. To get the big performance benefit of having function entry points 16 byte (partial cache line) aligned.
    Incidentally, you want to have your bridge/trampoline code stubs aligned too. The difference is big between a few cycles for aligned or some ~100 cycles more not aligned. So use “_aligned_malloc(p, 16)” or similar when you allocate those.

    What I found in my tests is that it’s fine to assume if you see at least two of these NOPs in series you can assume them to be alignment bytes and overwrite them for hooking.
    Example, as you analyze a function entry for hooking and step over a “retn” and see these alignment bytes you can just overwrite them to place your JMP (or what ever) hook there.
    It might not always be the case, there is some chance for error here (that will probably result in a crash). But again we are talking a small percentage here and you should have exception handling both in your hooks and setup some exception monitoring in your engine.

    Furthermore what I added to my engine is to take advantage of these NOP padding to use them as a “code cave” and place a relative short (2 byte) JMP from the entry point to a NOP space where I can fit my JMP5 hook. It’s an extra level of indirection, although a relativity few extra cycles compared to many alternatives.

    Also I have an option to place a single byte int3 instead to form an exception hook for more problem cases (requires your engine to handle exceptions of course).
    If there is an actual function, you should be able to put a one byte hook there.
    You can also do a HWBP as well (although limited to four per thread).
    Although not ideal the as there is some ~100 cycle overhead per exception. Maybe only an issue if the hook is on some very commonly called function. An exception hook handling kernel driver could reduce the overhead quite a bit too.

    Finally what my engine tries to do in every case where possible and feasible is to just copy the whole function (at least the first visible block from it anyhow. Another technique borrowed from MH). This works particularly well for those short functions that are usually just a “call xxx” followed by a “retn” et al.
    This has a performance advantage where your bridge/trampoline code doesn’t need to branch back to original code space to continue.

    These things will get the already low 5% problem area down even smaller to just a few percent. Plus add to the tool box techniques that can be used as needed.

Leave a Reply