Trampolines In x64

We got a few nice features from the new architecture of x64, like larger memory addressing, more registers (so fast call is the standard up to three registers and the rest get on the stack), and of course, a wider bandwidth of 64 bits, etc. AMD had a once in a life opportunity to change the ISA (instruction set architecture) a bit and to make it much better, but instead, they only added a very few new instructions, canceled a lot, and left the decoding as hard as before. Probably they were in a crazy rush, so that time Intel had to catch up with them!

The problem we face when hooking a function is how many bytes we will need to override. I already talked about Hot Patching and branching in x86. But I have never talked at length about x64. Usually most hookers use the JMP relative instruction (0xE9), which is possibly useful for x64 as well. And once again we are limited to a range of 2GB from the JMP instruction’s address, while in x64, 2GB is really nothing much. I also searched a bit over the inet to look for more info and found some interesting approaches. I decided to talk about them here and describe how they work.

1)
JMP relative instruction, when you know in advance the difference from the hooked function to the target trampoline is less than 2GB, is a very good method, only 5 bytes.

2)

MOV RAX, 
JMP RAX

This one is almost optimal, you can branch everywhere in the address space, it takes only 12 bytes. It suffers a destruction of a register. Of course, by the ABI (Application Binary Interface) which the compiler implements, some registers are defined as volatile, means you can use them almost any time without worrying or needing to restore them. Analyzing the function (using a disassembler) you may be able to know which register you can use safely. That’s a big headache though.
Note that you could replace the JMP RAX with PUSH RAX;RET, but it’s still the same size.

3)

PUSH 
; Only if required (when High half is non zero):
MOV [RSP+4], DWORD 
RET

This one was found on Nikolay Igotti’s blog. You split the QWORD address value into two DWORDs. The first push, although pushes a 32 bits value, really allocates 64 bits value on the stack. Then if the high half of the address is non zero, you will have to write it directly to the stack. This will write a full QWORD to the stack which you can then branch to by RETting. It takes 14 (=5+8+1) bytes, and doesn’t dirty any register.
Note that it’s possible to shave some bytes in a case where the high half is -1, by OR [RSP+4], -1 and the like.
4)
This one takes advantage of RIP relative addressing mode, check it out, yo

JMP [RIP+0]
DQ:

This one is cool, it will branch to the full address in the QWORD value. So it takes 14 bytes as well. On the other hand, I don’t like to mix code and data. In the world of firewalls (which do tons of hookings), for instance, hooking this function twice with two different methods, will probably lead to a crash, since most hooking engines disassemble the instructions, you will get garbage beginning with the second instruction.
This method leads to an interesting idea, you can move the address value around within a range of +-2GB, and then you will need only 6 bytes. In the instruction level, anytime you mention RIP it’s a waste of 4 bytes for a required 32 bits displacement, even if it’s 0, that sucks.
Unlikely, but still possible method sometimes is, JMP <Abs Addr>. Some OS API’s let you choose the allocated address, so you can make sure it’s in the first 2GB.
5)

MOV RAX, [ PointerToAddress64 ]
JMP RAX
;;;
PointerToAdderss64:
DQ

This one is similar to the former, but it can read the QWORD from everywhere. Because the addressing mode really supports 64 bit values. It takes 12 bytes. Needless to say, it dirties a register.
Note that this MOV RAX is a special instruction that supports a 64 bit addresses.

Each method has its own pros and cons. It seems you can do best by choosing a specific method according to the difference from the hooked function to the target trampoline address.
One crazy idea which I haven’t seen before is to use simple JMP relative (0xE9) instruction, and to jump into the MIDDLE of some function nearby, the function will be hooked in the middle to recover the hole (like a transparent proxy) and the hole will be filled with the full jump to any address.

Any other tricks you wanna share with us? Leave a comment.

17 Responses to “Trampolines In x64”

  1. Yuhong Bao says:

    “Note that it’s possible to shave some bytes in a case where the high half is -1, by OR [RSP+4], -1 and the like.”
    Or if bit 31 is one, take advantage of the fact that the PUSH imm instruction *sign*-extends.

  2. serg says:

    I am about 2 method –
    May be available save rax to memory then restore it ?

  3. arkon says:

    serg – well, it would take a few more bytes, a total of 16.
    push rax ; 1 byte
    mov rax, < addr > ; 10 bytes
    xchg rax, [rsp] ; 4 bytes
    ret ; 1 byte

    then you’re better with method # 3, since it takes only 14…

  4. serg says:

    arcon
    I guess push rax ; 1 byte
    not so good , because it does the stack wrong for the replaced function .
    Probably need store rax to memory.
    And why are you so worried about the number of writable memory?
    And yet – what is kosher opcode for “MOV RAX, “?
    Thanks serge.

  5. arkon says:

    serg – the stack is totally balanced, let me show you.

    1) you push the original rax value on stack
    2) then you load the full addr into rax
    3) now you exchange the current memory of what rsp points to – to contain the address, and since you do an xchg, it puts back the original value of rax into rax itself, right? so you didn’t dirty rax and you got the address pushed on stack…
    4) now you ret, thus it pops the address from the stack and branches to it.

    didn’t understand your mov rax question…

  6. serg says:

    ok , probably you are right , i am not enough mastered in assembler :)
    Thanks for answer

  7. hc says:

    Nice tips!

    Well my idea is allocate a block of memory for each hook, the first slice of the block contains the far jmp table (support for +-2GB), the next slice of the hook table will contain the stolen bytes with relative [conditional] jumps converted to [conditional] jumps into the table with the correct absolute far jump, using only 2 bytes (ex: 0xE9 XX) since the stolen bytes are only a few bytes :) (oh JCXZ is the exception is uses 3 bytes)

    ——————————————–
    HOOK 1 – FAR JUMP TABLE (this must be near stolen bytes)
    1 – JMP FAR
    2 – JMP FAR
    ….
    ——————————————–
    HOOK 1 – STOLEN BYTES (some relative jumps)
    test rax, rax <- example of stolen code from original API
    jz @1 <- jump to far jump table at index 1
    ..
    jnz @2 <- jump to far jump table at index 2

    etc

    But I've a question, how to write the DQ: ?

    Regards

  8. Sönke says:

    Hi guys,

    I’m currently experimenting with the subject and found that 2) can be cured from the destruction of the register if you do the following:

    push rax;
    mov rax, ;
    xchg rax, [rsp];
    ret;

    unfortunately the price is a 4 bytes longer code but seems to work fine.

    Greetings

  9. Thierry says:

    What are the opcodes for method 5 ? I end up with 12 bytes and not 11 :-(

    Regards

  10. arkon says:

    0000 (10) 48a10102030405060708 MOV RAX, [0x807060504030201]
    000a (02) ffe0 JMP RAX

    This is 12 bytes indeed!
    At the time I thought that’s we don’t need the 48 byte prefix, as diStorm fix says:
    “MOV MEM-OFFSET instructions are NOT automatically promoted to 64bits, only with a REX.”
    Sorry, haven’t updated this post since…
    Thanks

  11. Jo says:

    Note that a RET without a corresponding CALL; i.e.

    Mov [RSP+4], ..
    RET

    Is an extremly BAD idea, because it is guaranteed to generate a mispredicted branch. All x86 processors have a Call-RET table. so that not too deeply nested subroutines (8 should be ok) are always correctly predicted.

    This so called trick will mess up that mechanism and cost you 50 cycles. (instead of 1 or 0).

  12. prazetto says:

    Register RCX, RDX, R8, R9 are in some case for param.

    So use R10, R11, R12, R13, R14, R15 should be safe.

    MOV R15, [ PointerToAddress64 ]
    JMP R15
    ;;;
    PointerToAdderss64:
    DQ

  13. Sergiy says:

    I can’t help but comment…

    <>

    Intel already tried a fresh start with Itanium IA64. It totally faceplanted. I mean, talk about crash and burn. AMD64 is incremental move, and even then it was an extremely risky business proposition. AMD bosses either had nuts of steel or something forced their hand.

    If AMD tried to change the ISA significantly (again, like Intel tried with Itanium/IA64 before AMD64, and then again with Larrabee after AMD64), it would be playing Russian roulette. It sucks for programmers, but business considerations are always first and foremost, without them there’s no industry.

  14. Daniel says:

    It appears to me that

    MOV RAX,…
    JMP RAX

    does not jump to the address stored in RAX. Instead, it looks in the memory at address RAX and from there, it reads the destination address to jump to.

    Sorry for the AT&T-syntax. I’m more familiar with this:

    mov $0x1122334455667788,%rax # 48b88877665544332211
    jmp *%rax # ffe0

    VS

    mov $0x1122334455667788,%rax # 48b88877665544332211
    mov %rax,-0x8(%rsp) # 48894424f8
    jmp *-0x8(%rsp) # ff6424f8

  15. zela says:

    Hi guys,

    a quick question, what are the correponding opcode sequence of instruction “xchg rax, [rsp];”? I just cannot find such instruction after disassembling a bunch of binaries.

  16. […] of up to 128TB. When searching for alternatives, we bumped into a post by Gil Dabah listing a few possible options. After disqualifying every option that “dirties” a register, we were left with only a couple of […]

  17. Sirmabus says:

    Love this post, saved it in my hook engine design pages set.

    Note in “minhook” github dot com/TsudaKageyu/minhook Tsuda uses the good old JMP5 hooks for 64bit just fine. What his engine does is find a trampoline page within 2GB of the hook target.
    Then on return from the trampoline uses an absolute 64bit JMP.

    In theory this will break if say an executable is 2GB or larger in size, but have anyone ever run into such a size on a desktop PC?
    I think the largest I’ve ever seen is recently on Windows is mapped PE file of inordinate size of ~350MB.
    Have seen loaded executable sizes of 4GB’ish but it’s data size, not mapped exe size.

    In real-world terms, in practicality, JMP5 hooks will work just fine 99.999% of the time.

Leave a Reply