Archive for the ‘Optimization’ Category

Optimizing diStorm for 2x

Wednesday, May 27th, 2020

I always loved code optimization, because it’s easy to see the goal getting closer every iteration in an objective way, and it’s a hard challenge. You set a bar and then you start to sweat getting there. I decided I want to take latest distorm version and make it 2x faster in processing time. I find it deeply fascinating that the same mechanism with the same input/output can run much faster.

diStorm was designed to be fast from day one, and it was pretty much the case for many (~15) years when comparing diStorm to other disassemblers. Until recently (last few years) a new one claimed to be faster from some benchmarking I saw, although I’ve never validated it on my own. Amyway getting 2x faster wasn’t a small feat.

By designed-for-speed I mean:
1) single threaded!
2) diStorm doesn’t allocate memory at all (it takes buffers from the user)
3) doesn’t call any syscalls
4) it’s very simple in terms of calling-graph/functions
5) works with length’ed strings (as opposed to C strings)
6) parsing/decoding binary is fully separated from formatting the output strings
7) no initialization code is required (imagine distorm_init)

The test case consists of running the disasm tool (only commenting out the printf call) on some input file to decode it in 32 bits. Compiling the disasm itself with full-optimizations on and for x64 arch. My CPU is old: Intel SandyBridge Core i7-2600 @ 3.4 GHz. OS: Windows 10. The input is a dummy ~120MB file of code (c:\windows\system32\mrt.exe), which is good enough for my test. Starting above 5000 milisec and going down all the way to 2500 for a single thread!

The process of researching the code was done in two big steps (mind you, I had lotsa free time):
1) Research – Quick and dirty changes to the code, testing only once in a while for validness of output. Seeing progress fast. Eventually reaching goal (1 month almost full time).
2) Development – Starting coding all over again, verifying every new piece of code, adding more tests/coverage and making it production grade.

Code optimization falls under two categories, the way I see it, the obvious stuff that a profiler points to you and you can easily fix (normally not a great boost in performance, very specific touch). And the non-obvious stuff that can be refactored to get better performance but nothing will guide you the way to achieve it. And this is why I believe code optimization is really hard and more of an art form, as there’s no method directing you how to do it. Normally, one should make sure their code is tight in the algorithmic level and only then try to exhaust all the obvious fixes the profiler hints about. In reality, you go back and forth and after a while that you believe there’s nothing more to squeeze or re-write, you find new creative ways for the new code to work faster, and these are the big boosts.

The tools I used for my performance research were Intel vTune which is now free, and linux’ perf tool. Note that flamegraph wasn’t helping as the major code loop has 2 levels deep call stack. The advantage of using perf on linux is that you can query for the exact HW event, unlike the free version of vTune which queries all events altogether and their statistics are a bit more noisy thereof.

In the rest of this post I’m going to show a few of the best optimizations that had the biggest impact on performance (nonadjacent to research order):

  1. PGOProfile Guided Optimization (feature is courtesy of Visual Studio). This one easily gets your code 10-20% faster and it’s ridiculous that I’ve never heard of it before, and neither many acquainted developers. The drawback is that it requires certain input to ‘learn’ how to prioritize/lay-out the code in order to make the code faster, in a disassembler’s case it’s pretty obvious anyway, so nothing to worry about. Basically, what it does is ordering basic-blocks and instructions to be in hot path (the instructions that mostly happen in a loop, for instance) and the cold paths (like error-handling that doesn’t happen often).
    For example, for “if-else” ladder (or switch-case), it’s preferred that the more likely condition appear first. As to get to it you have less branches. This is one of the optimizations done automatically by PGO if it has sufficient data.
    Now a stream of instructions can run consequently one after the other without branching and that’s amazing. See this picture of decode_inst disassembly of before-PGO here, and after-PGO here, notice the code layout and the latter function’s core is within one page.

    Note that in order to use this feature it’s important to copy some DLL to the release directory, run the tool with the wanted input, and then re-compile the code with ‘optimizing for pgo’ in VS.
    (exemplary path:
    C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\Hostx64\x64\pgort140.dll).

    A tip: I created a compilation target that copies this file, compiles for PGO and runs the input file, and then compiles again with the created profile, this saved a lot of time.
  2. Textual formatting a binary structure from _DInst structure to a textual structure (_DecodedInst) requires to do lots of string copyings of both mnemonic (ADD, MOV, XOR, …) and registers (AL, EAX, XMM0,…) and other constant string literals.

    The main trick that boosted performance around 10% is to change the calls to memcpy with varied length to a fixed length.
    Instead of calling
    memcpy(output->mnemonicText, mnemonicsTable[mnemonicId].buffer, mnemonicsTable[mnemonicId].length);
    We better do:
    memcpy(output->mnemonicText, mnemonicsTable[mnemonicId].buffer,
    16);

    Now the compiler can fully omit the call to memcpy and just do a single instruction 16 bytes copy!
    The reason it always copies 16 is because the longest mnemonic length is 16 of the instruction VAESKEYGENASSIST.
    Caveats: 1) Adding padding in the table to avoid out of bounds is crucial. 2) All mnemonics are also zero terminated, so it’s okay to copy full 16 bytes even if the mnemonic is ‘MOV’, it’s really ‘MOV\0’, and yes it will copy other stuff in (but it’s ignored). 3) All final strings also have an integer length accompanying them!

    For copying registers, making sure the table is as tight as possible and also copying a fixed length did the trick. Meaning that if there’s a table of registers in the form of:
    char registersTable[100][10] = {"eax", "ecx", ...};
    It means that the access to each entry is a multiplication of 10, making it a power of 2 is easier for the compiler to optimize and it also saves a lot of memory space too. Biggest register string literal length is 5 (YMM15) and a null termination char is needed, so we can use 6, but better use 8 already. Then the memcpy of a register always copies 8 bytes no matter what.

    Since x86 instructions have both string literal prefixes (rep, repnz, lock) and postfixes (eg CMPSD, CMPSW, CMPSB, etc), doing lookup tables also quickened stuff.

    See code diff here (scroll down all the way).

    Another trick that really helped, surprisingly, a lot is that I noticed string concatenation can work even faster. Note that all distorm strings are already length’ed, meaning it’s a structure with length and a buffer. And when a proprietary version of strcat is used it appends to the string and instead of searching the null termination it knows its length. But the surprise was that appending to the string with
    memcpy(&str->buf[str->length], inputStr, inputLen);
    was very slow according to just use a normal pointer, eg:
    memcpy(str, inputStr, inputLen); str += inputLen;
    So I had to change all distorm string functionality to work with a single pointer, and only at the end calculate the length and stuffing a null termination character.
    Also note that at the same breadth I figured that distorm’s code has fixed sized string literals like the DWORD in ‘mov dword [eax], 8’, etc. So I forced distorm to use fixed sized memcpy’s too, which helped to compiler to eliminate them all and turn them into a simple qword-move. If you open distorm’s disasm code now you won’t see a single call to a memcpy function. This also saved a lot of time, 5-10%.
    See relevant code diff here.
  3. Another set of optimizations that shaved 10% further were to re-arrange if statements and code blocks that happen to run per instruction, but in reality they aren’t always needed.

    The obvious guideline is that if the code does less stuff per iteration it’s likely to run faster.

    Imagine the code checks something in the instruction. For example, do something for privileged instructions only. The first issue is that such instructions are input dependent (i.e out of our hands), the second one is that 99% instructions out of all instruction sets aren’t privileged to begin with. This means that the code should avoid doing checks for them, if possible. So I managed to omit a conditional statement…

    One thing was to move things to that origin, for example, if something was dependent on the optional operand of an instruction, and the test was done in the per-instruction code, obviously if we move the test to the origin (inside the operand parser), then things will run faster.

    Another was that some extra features were done naively (distorm tracks all used/unused prefixes and generates relevant masks), so optimizing these boosted the performance too. Sometimes a profiler makes things clear where it’s worth to optimize stuff.

    Sometimes it’s better to do a=x; if (y) a=z; instead of if (y) a=z; else a=x; which also might help for performance.

    Assuming happy flow wins and optimize for that can also drastically help. Sometimes you can check for faults at the very end of the step instead of every sub step (if the mechanism allows to do so).

    Taking into account data cache locality (which I haven’t ever before dealt with till now). If you touch some memory, be sure to exhaust that piece of data until you don’t need it anymore. Instead of re-visiting that memory again and again. So re-ordering code accordingly saves a lot of time loading stuff to the cache.

    Trying to re-order the multiple conditions in an if statement might help, perhaps PGO does it on its own. But always start with the less likely condition to happen, so you quit checking further and sooner.

    Another example is that sometimes the code used to deref some pointer and not always use that value, so postponing the deref only to the site where it’s really needed, can also boost performance.
    Many small things like that helped to increase performance of the decode_inst function, you can check out the diff here, but it’s spanned across a few commits.
  4. Optimizing operands extraction also saved around 10%. For example some operands need to read further information from the binary stream and reading from the stream can vary with size of 1, 2, 4, 8 bytes. The original function is something like this:
    int read_stream(const char* codeBuf, unsigned size, uint64_t* out)
    {
    switch (size) {
    case 1: *out = *codeBuf; break;
    case 2: *out = *(uint16_t*)codeBuf; break;
    ...
    }
    }

    Now given the compiler knows the sizes of how much to read from the stream because all callers to read_stream specify ‘size’ as a constant, then it can really optimize the whole switch statement and just read what’s need from the stream, no biggy.
    The problem is that the function wasn’t inlined at all, and thus each stream reading took a lot of time just because of the branches. Making this function inline saved around 5% by itself, for free! :)

    Another issue was that a few things always happened in the beginning or ending of an operand extraction function that could be done lazily or postponed to the exact right moment, like described in the previous point, and the also saved significant time.

    The operand_extract function originally took around 11 argument, trying to reduce these and work with less argument is normally better. Some operands were calculated again and again (like the operands array index, &ops[index]) which sometimes is already calculated once, so try to pass it along instead, can save also some time. It sounds like small stuff, and it is, but eventually when you stop coding naively and start to see this stuff in the profiler and also by looking at the code, it adds to a lot.

    Since there are some 100 internal types of operands in distorm, they were classified to a few categories. Now the original code did something like: switch (type) case 1, case 2, case… and then another switch (type) case 10, case 11, case 12… and then yet another switch (type) case 20, case 21, case 22, and so on. Meaning that the case of 22 will get evaluated in two switch tables even though it surely gets to run code only in the third switch case. This can generate branch prediction noise (as these switch statements eventually become a jump table) and waste time also because it’s not hitting a target basic block, just going through to the next basic block. So I chose to add a corresponding range checks before each switch statement to spare the wasteful look ups. Alternatively it was possible to coalesce all switch cases under a single switch, but I didn’t like that form of code in my case.
  5. distorm has eventually two APIs, one for textual based disassembling and another for actual binary structure output, internally this functionality runs first, and then it formats the binary to text.
    But since the prototype of the functions take different input and output structure it changes how the core engine works with the output memory.
    In one of the main loops of converting the binary structure into textual output, distorm used to copy the binary structure (from uer’s memory) to a temporary on-stack memory, and then feed it as input into the formatting function, so eventually the user’s memory is the output one. Eventually I decided to eliminate the temporary buffer and make sure the formatting functionality can work with the same input/output buffer in-place!
    Well better see it in pseudo code:
    for (i = 0; i < instsCount; i++) {
    // insts[i] is a binary form of the instruction
    memcpy(&tmp, insts[i], sizeof inst);
    format_inst(&tmp, insts[i]); // in: tmp, out: insts[i]
    // Now insts[i] is a different struct with textual output
    }

    This in-place memory trick saved around ~3% and it required to refactor the formatting function so same memory fields had to be touched carefully, so we don’t override with text some binary fields that we need yet to convert…
    See code diff here.
  6. distorm has some extra features that require extra work CPU time to get them done. One of them is a very simple feature to return the relevant eflags of each decoded instruction (inside _DInst it’s modified, undefined and tested eflags masks). For example a JZ instruction takes ZF as input. ADD instruction writes to a few flags such as: AF, ZF, PF, OF, CF, etc. And some instructions even change some flags into an undefined state.

    This feature works all the time, even in a textual mode of the disassembler, like in most cases, and wastes time. So I turned this code into a feature that the user can select if they want to opt-in, and this saved ~3%.

    Moral of the story: don’t calculate stuff nobody uses!

    Note that what takes time in this is copying the compacted eflags to the extracted eflags compatible masks in the final output buffer…
    See code diff here, nothing interesting.

There are many more tricks, but these are the most performance impact.
Eventually this research taught me a lot about the modern processors pipeline, which is fascinating and worth to read about it from a few links.
Sometimes it was frustrating move one c code line from one place to another, which at times made things faster and sometimes slower. Sometimes eliminating a whole line made things slower, really. And some other times putting a new line and even an access to a memory address made things faster in the right place. Modern pipelines are somewhat voodoo, and be careful not to optimize for a specific one, there was some point that I decided to keep the code in readable C and playing with single lines is not a great idea, unless you can reason with it and a profiler yells at you to change it.


Cache was always an undervalued thing for me, even though I’ve been coding for over 2 decades now, I always took it for granted, but in reality once you understand the cache, you can get your code faster. And many naive things I did (which I didn’t consider myself a naive coder in terms of speed), were also bad. So in this exercise I learned a lot and tried to share some useful and practical points with you and less about the research itself. Hence I avoided talking about the processor itself (for instance, icache vs. DSB) and the tiresome process of benchmarking instructions, branches and uops, etc.
I guess looking at the code of the original version and the final version and comparing them can teach one a lot, in files such as decoder.c, distorm.c, operands.c, prefix.c and textdefs.c.

Here’s the log I kept of the second phase, (headers: timestamp, measurement time in MS, comment):
4:42 PM 4/21/2020 5170 BASELINE # MRT.exe (120MB) decoded in 32 bits. disasm compiled for x64.
5:35 PM 4/21/2020 4964 DF_FILL_FLAGS feature.
11:02 PM 4/21/2020 4832 In-place formatting.
11:58 PM 4/21/2020 4482 Fixed size register & mnemonic copying.
8:58 AM 4/22/2020 4393 str_hex optimized.
9:35 AM 4/22/2020 4386 textdefs.c cosmetics.
5:01 PM 4/22/2020 4130 wstring and textdefs optimizations.
5:00 PM 4/23/2020 3969 format optimizations.
5:02 PM 4/23/2020 3562 PGO enabled for mrt.exe.
4:56 PM 4/25/2020 3300 prefixes and decode_internal optimizations.
10:12 AM 4/26/2020 2890 optimized inst_decode.
3:33 PM 5/4/2020 2690 prefixes and operands optimizations.
11:44 PM 5/7/2020 2610 more prefixes and operands optimizations.
9:53 AM 5/14/2020 2500 prefixes, operands, decoder optimizations.

And these are vTune profiler showing the functions and timings (not fully accurate to my original tests TBH) for the formatting functionality (down from 1474 to 741 MS – 1.98X ) and the decoding functionality (down from 3084 to 1818 – 1.7X), you will see less symbols in the optimized functionality thanks to in-lining.

Future challenges for distorm – there are still 2 functions that drive me nuts that I managed to optimize a lot but are still slow and deserve probably tons of attention: a proprietary version of itoa for converting integers to hex ascii numbers (str_int_impl), and converting binary bytes to hex ascii (str_hex). And optimization there is super helpful.

Some useful links:

Private Symbols Look Up by Binary Signatures

Friday, July 1st, 2011

This post could really be extended and divided into a few posts, but I decided to try and keep it small as much as I can. If I see it draws serious attention I might elaborate on the topic.

Signature matching for finding functions is a very old technique, but I haven’t found anyone who talks about it with juicy details or at all, and decided to show you a real life example. It is related to the last post about finding service functions in the kernel. The problem is that sometimes inside the kernel you want to use internal functions, which are not exported. Don’t start with “this is not documented story”, I don’t care, sometimes we need to get things done no matter what. Sometimes there is no documented way to do what you want. Even in legitimate code, it doesn’t have to be a rootkit, alright? I can say, however, that when you wanna add new functionality to an existing and working system, in whatever level it might be, you would better depend as much as you can on the existing functionality that was written by the original programmers of that system. So yes, it requires lot of good reversing, before injecting more code and mess up with the process.
The example of a signature I’m going to talk about is again about getting the function ZwProtectVirtualMemory address in the kernel. See the old post here to remember what’s going on. Obviously the solution in the older post is almost 100% reliable, because we have anchors to rely upon. But sometimes with signature matching the only anchors you have are binary pieces of:
* immediate operand values
* strings
* xrefs
* disassembled instructions
* a call graph to walk on
and the list gets longer and can really get crazy and it does, but that’s another story.

I don’t wanna convert this post into a guideline of how to write a good signature, though I have lots of experience with it, even for various archs, though I will just say that you never wanna put binary code as part of your signature, only in extreme cases (I am talking about the actual opcode bytes), simply because you usually don’t know what the compiler is going to do with the source code, how it’s going to look in assembly, etc. The idea of a good signature is that it will be as generic as possible so it will survive (hopefully) the updates of the target binary you’re searching in. This is probably the most important rule about binary signatures. Unfortunately we can never guarantee a signature is to be future compatible with new updates. But always test that the signature matches on a few versions of the binary file. Suppose it’s a .DLL, then try to get as many versions of that DLL file as possible and make a script to try it out on all of them, a must. The more DLLs the signature is able to work on successfully, the better the signature is! Usually the goal is to write a single signature that covers all versions at once.
The reason you can’t rely on opcodes in your binary signature is because they get changed many times, almost in every compilation of the code in a different version, the compiler will allocate new registers for the instructions and thus change the instructions. Or since code might get compiled to many variations which effectively do the same thing, I.E: MOV EAX, 0 and XOR EAX, EAX.
One more note, a good signature is one that you can find FAST. We don’t really wanna disassemble the whole file and run on the listing it generated. Anyway, caching is always a good idea and if you have many passes to do for many signatures to find different things, you can always cache lots of stuff, and save precious loading time. So think well before you write a signature and be sure you’re using a good algorithm. Finding an XREF for a relative branch takes lots of time, try to avoid that, that should be cached, in one pass of scanning the whole code section of the file, into a dictionary of “target:source” pairs, with false positives (another long story) that can be looked up for a range of addresses…

I almost forgot to mention, I used such a binary signature inside the patch I wrote as a member of ZERT, for closing a vulnerability in Internet Explorer, I needed to find the weak function and patch it in memory, so you can both grab the source code and see for yourself. Though the example does use opcodes (and lots of them) as part of the signature, but there’s special reason for it. Long story made short: The signature won’t match once the function will get officially patched by MS (recall that we published that solution before MS acted), and then this way the patcher will know that it didn’t find the signature and probably the function was already patched well, so we don’t need to patch it on top of the new patch.. confusing shit.

The reason I find signatures amazing is because only reversers can do them well and it takes lots of skills to generate good ones,
happy signaturing :)

And surprisingly I found the following link which is interesting: http://wiki.amxmodx.org/Signature_Scanning

So let’s delve into my example, at last.
Here’s a real example of a signature for ZwProtectVirtualMemory in a Kernel driver.

Signature Source Code

From my tests this signature worked well on many versions…though always expect it might be broken.

diStorm Goes on Diet

Saturday, June 11th, 2011

I just wanted to share my happiness with you guys. After a long hard work (over a month in my free time, which ain’t much these days), I managed to refactor all the data-structures of the instructions DB in diStorm3. As the title says, I spared around 40kb in data! The original distorm3.dll file took around 130kb and currently it takes 90kb. I then went ahead and reconfigured the settings of the project in Visual Studio and instructed the compiler not to include the CRT shits. Then it bitched about “static constructors won’t be called and the like”, well duh. But since diStorm is written in C and I don’t have anything static to initialize (which is based on code) before the program starts, I didn’t mind it at all. And eventually I got the .dll file size to ~65kb. That’s really 50% of the original file size. This is sick.

I really don’t want to elaborate with the details of what I did, it’s really deep shit into diStorm. Hey, actually I can give you an simple example. Suppose I have around 900 mnemonics. Do not confuse mnemonic with opcodes – some opcodes share the same mnemonic although they are completely different in their behavior. You have so many variations of the instruction ‘ADD’, for instance. Just to clarify: mnemonic=display name of an opcode, opcode: the binary byte code which identifies the operation to do, instruction: all the bytes which represent the opcode and the operands, the whole.
Anyway, so there are 900 mnemonics, and the longest mnemonic by length takes 17 characters, some AVX mofo. Now since we want a quick look up in the mnemonics table, it was an array of [900][19], which means 900 mnemonics X 19 characters per mnemonic. Why 19? An extra character for the null terminating char, right? And another one for the Pascal string style – means there’s a leading length byte in front of the string data. Now you ask why I need them both: C string and Pascal string together. That’s because in diStorm all strings are concatenated very fast by using Pascal strings. And also because the guy who uses diStorm wants to use printf to display the mnemonic too, which uses C string, he will need a null terminating character at the end of the string, right?
So back to business, remember we have to allocate 19 bytes per mnemonic, even if the mnemonic is as short as ‘OR’, or ‘JZ’, we waste tons of space, right? 900×19=~17kb. And this is where you get the CPU vs. MEMORY issue once again, you get random access into the mnemonic, which is very important but it takes lots of space. Fortunately I came up with a cooler idea. I packed all the strings into a very long string, which looks something like this (copied from the source):

“\x09” “UNDEFINED\0” “\x03” “ADD\0” “\x04” “PUSH\0” “\x03” “POP\0” “\x02” “OR\0” \
“\x03” “ADC\0” “\x03” “SBB\0” “\x03” “AND\0” “\x03” “DAA\0” “\x03” “SUB\0”
and so on… 

 

You can see the leading length and the extra null terminating character for each mnemonic, and then it’s being followed by another mnemonic. And now it seems like we’re lost with random-access cause each string has a varying length and we can never get to the one we want… but lo and behold! Each instruction in the DB contains a field ‘opcodeId’ which denotes the index in the mnemonics array, the offset into the new mnemonics uber string. And now if you use the macro mnemonics.h supplies, you will get to the same mnemonic nevertheless. And all in all I spared around 10kb only on mnemonic strings!

FYI the macro is:

#define GET_MNEMONIC_NAME(m) ((_WMnemonic*)&_MNEMONICS[(m)])->p

As you can see, I access the mnemonics string with the given OpcodeId field which is taken from the decoded instruction and returns a WMnemonic structure, which is a Pascal string (char length; char bytes[1])…

The DB was much harder to compact, but one thing I can tell you when you serialize trees is that you can (and should) use integer-indices, rather than pointers! In x64, each pointer takes 8 bytes, for crying out loud! Now in the new layout, each index in the tree takes only 13 bits, the rest 5 bits talks about the type, where/what the index really points to… And it indirectly means that now the DB takes the same size both for x86 and x64 images, since it is not based on pointers.

Thanks for your time, I surely had pure fun :)

diStorm for Java, JNI

Monday, October 4th, 2010

Since we decided to use Java for the reverse engineering studio, ReviveR, I had to wrap diStorm for Java. For now we decided that the core framework is going to be written in Java, and probably the UI too, although we haven’t concluded that yet. Anyway, now we are thinking about the design of the whole system, and I’m so excited about how things start to look. I will save a whole post to tell you about the design once it’s ready.

I wanted to talk a bit about the JNI, that’s the Java Native Interface. Since diStorm is written in C, I had to use JNI to use it inside Java now. It might remind P/Invoke to people, or Python extensions, etc.

The first thing I had to do is to define the same C structures of diStorm’s API, but in Java. And this time they are classes, encapsulated obviously. After I had this classes ready, and stupid Java, I had to put each public class in a separate file… Eventually I had like 10 files for all definitions and then next step was to compile the whole thing and use the javah tool to get the definitions for the native functions. I didn’t like the way it worked, for instance, any time you rename the package name, add/remove a package the name of the exported C function, of the native .DLL file, changes as well, big lose.
Once I decided on the names of the packages and classes finally I could move on to implement the native C functions that correspond to the native definitions in the Java class that I wrote earlier. If you’re familiar a bit with JNI, you probably know well jobject and its friends. And because I use classes rather than a few primitive type arguments, I had to work hard to wrap them, not mentioning arrays of the instructions I want to return to the caller.

The Java definition looks as such:

public static native void Decompose(CodeInfo ci, DecomposedResult dr);

The corresponding C function looks as such:

JNIEXPORT void JNICALL Java_distorm3_Distorm3_Decompose
  (JNIEnv *env, jobject thiz, jobject jciObj, jobject jdrObj);

Since the method is static, there’s no use for the thiz (equivalent of class’s this) argument. And then the two objects of input and output.
Now, the way we treat the jobjects is dependent on the classes we set in Java. I separated them in such a way that one class, CodeInfo, is used for the input of the disassembler. And the other class, DecomposedResult, is used for output, this one would contain an array to return the instructions that were disassembled.

Since we are now messing with arrays, we don’t need to use another out-argument to indicate the number of entries we returned in the array, right? Because now we can use something like array.length… As opposed to C function: void f(int a[], int n). So I found myself having to change the classes a bit to take this into account, nothing’s special though. Just need to get advantages of high level languages.

Moving on, we have to access the fields of the classes, this is where I got really irritated by the way the JNI works. I wish it were as easy as cTypes for Python, of course they are not parallel exactly, but they solve the same problem after all. Or a different approach like parsing a tuple in Embedded Python, PyArg_ParseTuple, which eases this process so much.

For each field in the class, you need to know both its type and its Id. The type is something you know at compile time, it’s predefined and simply depends on the way you defined your classes in Java, easy. The ugly part now begins, Ids – You have to know to which field you want to access, either for read or write access. The idea behind those Ids was to make the code more flexible, in the way that if you inherit a class, then the field you want to access probably moved to a new slot in the raw structure that contains it.
Think of something like this:

struct A {
int x, y;
};

struct B {
 int z, color;
};

struct AB {
 A;
 B;
};

Suddenly, accessing to AB::B.z has a different index than accessing to B.z. Can you see that?
So they guys who designed JNI came with the idea of querying the class, by using internal reflection to get this Id (or really an index to the variable in the struct, I take a guess). But this reflection thingy is really slow, obviously you need to do string comparisons on all members of the class, and all classes in the derived class… No good. So you might say, “but wait a sec, the class’s hierarchy is not going to change in the lifetime of the application, so why not reuse its value?”. This is where the JNI documentation talks about caching-ids. Now seriously, why don’t you guys do it for us internally, why I need to implement my own caching. Don’t give me this ‘more-control’ bullshit. I don’t want control, I want to access the field as quickly as possible and get on to other tasks.

Well, since the facts are different, and we have to do things the way we do, now we have to cache the stupid Ids for good. While I read how people did it and why they got mysterious crashes, I solved the problem quickly, but I want to elaborate on it.

In order to cache the Ids of the fields you want to have access to, you do the following:

if (g_ID_CodeOffset == NULL) {
    g_ID_CodeOffset = (*env)->GetFieldID(env, jObj, "mCodeOffset", "J");
    // Assume the field exists, otherwise your interfaces are broken anyway.
}
// Now we can use it...

Great right? Well, not so fast. The problem is that if you have a few functions that each accesses this same class and its members, you will need to have this few lines of code everywhere for each use. No go. Therefore the common solution is to have another native static InitIDs function and invoke it right after loading the native library in your Java code, for instance:

static {
	System.loadLibrary("distorm3");
	InitIDs();
}

Another option would be to use the JNI_OnLoad exported function to initialize all global Ids before the rest of the functions get ever called. I like that option more than the InitIDs, which is less artificial in my opinion.

Once we got the Id ready we can use it, for instance:

codeOffset = (*env)->GetLongField(env, jciObj, g_ID_CodeOffset);

Note that I use the C interface of the JNI API, just so you are aware to it. And jciObj is the argument we got from Java calling us in the Decompose function.

When calling the GetField function we have to pass a jclass, that’s a Java-class object’s pointer kinda. In contrast to the class instance, I hope you know the difference. Now since we cache the Ids for the rest of the application life time, we have to keep a reference to this Java-class, otherwise weird problems and nice crashes should (un)surprise you. This is crucial since we use the same Ids for the same classes along the code. So when we call the GetFieldID we should hold a reference to that class, by calling:

(*env)->NewWeakGlobalRef(env, jCls);

Note that jCls was retrieved using:

jCls = (*env)->FindClass(env, "distorm3/CodeInfo");

Of course, don’t forget to remove the reference to those classes you used in your code, by calling DeleteGlobalRef in JNI_OnUnload to avoid leaks…

The FindClass function is very good once you know how to use it. It took me a while to figure out the syntax and naming convention. For example, the String which seems to be a primitive type in Java, is really not, it’s just a normal class, therefore you will have to use “java/lang/String” if you want to access a string member.
Suppose you got a class “CodeInfo” in the “distorm3” package, then “distorm3/CodeInfo” is the package-name/class-name.
Suppose you got an inner class (inside another class), then “distorm3/Outer$Inner” is the package-name/outer-class-name$inner-class-name.
And probably there are a bit more to it, but that’s a good start.

About returning new objects to the caller. We said already that we don’t use out-arguments in Java.
Think of:

void f(int *n)
{
 *n = 5;
}

That’s exactly what an out-argument is, to return some value rather than using the return keyword…
When you want to return lots of info, it’s not a good idea, you will have to pass lots of arguments as well, pretty ugly.
The idea is to pass a structure/class that will hold this information, and even have some encapsulation to it.
The problem at hand is whether to use a constructor of the class, or just create the object and set each of its values manually.
Also, I wonder which method is faster, letting the JVM do it on its own in a constructor, or doing everything using JNI.
Unfortunately I don’t have an answer to this question. I can only say that I used the latter method of creating the raw object and setting its fields. I thought it would be better.
It looks like this:

jobject jOperand = (*env)->AllocObject(env, g_OperandIds.jCls);
if (jOperand == NULL) // Handle error!
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Type, insts[i].ops[j].type);
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Index, insts[i].ops[j].index);
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Size, insts[i].ops[j].size);

(*env)->SetObjectArrayElement(env, jOperands, j, jOperand);

This is real piece of code taken from the wrapper code. It constructs an Operand class from the Operand structure in C. Notice the way the AllocObject is used, using that jCls we hold a reference to, instead of calling FindClass again… Then setting the fields and setting this object in the array of Operands.

What I didn’t like much in the JNI is that I had to call SetField, GetField and those variations. On one hand, I understand they wanted you to know which type of field you access to. But on the other hand, when I queried the Id of the field, I specified its type, so I pretty much know what type-value I’m setting, so… Well, unless you have bugs in your code, but that will always cause problems.

To another issue, one of the members of the CodeInfo structure in diStorm is a pointer to the binary code that you want to disassemble. It means that we have to get some buffer from Java as well. But apparently, sometimes the JVM decided to make a copy of the buffer/array that is being passed to the native function. In the beginning I used a straight forward byte[] member in the class. This sucks hard. We don’t want to waste time on copying and freeing buffers that are read-only. Performance, if can be better, should be better by default, if you ask me. So reading the documentation there’s an extension to the JNI, to use java.nio.ByteBuffer, which gives you a direct access to the Java buffer without the extra efforts of copying. Note that it requires the caller to the native API to use this class specifically and sometimes you’re limited…

The bottom line is that it takes a short while to understand how to use JNI and then you get going with it. I found it cumbersome a bit… The most annoying part is all the extra preparations you have to do in order to access a class or a field, etc. Unless you don’t care at all about performance but then your code is less readable for sure. We don’t have any information about performance of allocating new objects and array usage. We can’t base our ways of coding on anything. I wish it could be more user friendly or parts of it eliminated somehow.

Optimize My Index Yo

Thursday, November 26th, 2009

I happened to work with UNICODE_STRING recently for some kernel stuff. That simple structure is similar to pascal strings in a way, you got the length and the string doesn’t have to be null terminated, the length though, is stored in bytes. Normally I don’t look at the assembly listing of the application I compile, but when you get to debug it you get to see the code the compiler generated. Since some of my functions use strings for input but as null terminated ones, I had to copy the original string to my own copy and add the null character myself. And now that I think of it, I will rewrite everything to use lengths, I don’t like extra wcslen’s. :)

Here is a simple usage case:

p = (PWCHAR)ExAllocatePool(SomePool, Str->Length + sizeof(WCHAR));
if (p == NULL) return STATUS_NO_MEMORY;
memcpy(p, Str->buffer, Str->Length);
p[Str->Length / sizeof(WCHAR)] = UNICODE_NULL;

I will show you the resulting assembly code, so you can judge yourself:

shr    esi,1 
xor    ecx,ecx 
mov  word ptr [edi+esi*2],cx 

One time the compiler converts the length to WCHAR units, as I asked. Then it realizes it should take that value and use it as an index into the unicode string, thus it has to multiply the index by two, to get to the correct offset. It’s a waste-y.
This is the output of a fully optimized code by VS08, shame.

It’s silly, but this would generate what we really want:

*(PWCHAR)((PWCHAR)p + Str->Length) = UNICODE_NULL;

With this fix, this time without the extra div/mul. I just did a few more tests and it seems the dead-code removal and the simplifier algorithms are not perfect with doing some divisions inside the indexing for pointers.

Update: Thanks to commenter Roee Shenberg, it is now clear why the compiler does this extra shr/mul. The reason is that the compiler can’t know whether the length is odd, thus it has to round it.

It’s Vexed :)

Tuesday, November 17th, 2009

In the last few week I’ve been working on diStorm to add the new instruction sets: AVX and FMA. You can find lots of information about them on the inet. In a brief, the big advantages, are support of 256 bit registers, called now YMM’s (their low halves are XMM’s) and also support for AES built in, you have a few instruction to do small block encryption and decryption, really sweet. I guess it will help some security companies out there to boost stuff. Also the main feature behind these instruction sets is the 3 registers operands. So now you are not stuck with 2 registers per instruction, you can have up to four sometimes. This is good because you have two source operands and a destination operand, which means it saves you other instructions (to move or backup registers) and you don’t have to ruin your dest-src operand like in the old sets. Almost forgot to mention FMA itself, which is fused multiply-add instructions, so you can do two operations at once, like A*B+C, etc.

I wanted to talk about the VEX prefix itself. It’s really a new design and approach to prefixes that was never seen before. And the title of this post says it all, it’s really annoying.
The VEX (Vector Extension) prefix, is a multi-byte prefix for a change. It can be either 2 or 3 bytes. If you take a look at the one byte opcodes map, all of them are taken. Intel was in the need of a new unused byte, which didn’t really exist. What they did instead, was to share two existing opcodes, for each prefix. The sharing works in a special way, that let them know if you meant to use the original instruction or the new prefix. The chosen instructions are LDS (0xc4) and LES (0xc5). When you examine the second byte of these instructions, the byte upon which the new information is extracted from, you can learn that the most significant two bits can’t be set together (I.E: the value of 0xc0 or higher). However, if they are set, the processor will raise an illegal instruction exception. This is where the VEX prefix enters into the game. Instead of raising an exception they will be decoded as this special multi-byte prefix. Note that in 64 bits mode, all those Load-Segment instruction are invalid, so there is no need for sharing the opcode. When you encounter 0xc4 or 0xc5, you know it’s a VEX prefix, as simple as that. Unfortunately this is not the case in 32 bits mode, and since the second byte has to be with a value higher than 0xc0 (because the two most significant bits have to be set in 32 bits), the field in these corresponding bits is inverted actually, which means you will have to extract a few bits that represent some fields and bitwise-not them. This is seriously gross, but it seems Intel didn’t have much of a choice here. If it were up to me, I would do the same eventually, for the sake of backward compatibility, but it doesn’t make it any prettier to be honest. And for your information, AMD pulled the same trick but with the POP instruction (0x8f) for their new instruction sets (XOP, etc), without full backward compatibility.

To make some order in the bits let’s have a look at the following figure:
vexprefix
Cited Intel
Well, I am not going to talk about all fields, (which somewhat are similar to REX for 64 bits) but just about one interesting feature. Since now the prefix is 2/3 bytes and usually an SSE instruction is at least 3 bytes, this will explode the code segment with huge instruction, and certainly gonna make the processor cry a lot to fetch instructions. The trick that was used by Intel (and AMD too) was to have a field that will imply which prefix byte to put virtually before the VEX prefix itself, so this way we saved one byte. And the same idea was used again to spare the 0x0f escape byte or even two bytes of 0x0f, 0x38 or 0x0f, 0x3a which are very common basic opcodes for SSE instructions. So if we had to use an SSE instruction, for instance:
66 0f 38 17 c0; PTEST XMM0, XMM0 – has first 3 bytes that can be implied in the VEX prefix, thus it stays the same size! Kawabanga

I talked in an earlier post that I am not going to support SSE5 as for now, in the hope it’s gonna die. I believe CPU (or instruction set architectures, to be accurate) wars are bad for the coders and even for end users who can’t really enjoy those great technologies eventually.

Don’t Wait, Shoot. (KeSetEvent)

Tuesday, November 3rd, 2009

Apparently when you call down a driver with IoCallDriver you can either wait ’till the operation is finished or not. If you wait, you will need somebody to tell you “hey dude, you can stop waiting”. But that’s trivial, you set up a completion routine that will be called once the lower driver is finished with your IRP. The problem is if in some cases you don’t check the return code from that driver, and you assume you should always wait. So you Wait. Now what? Now the lower driver suddenly returns immediately, but did you know that? Probably not, cause you’re not blocked forever, otherwise you would have noticed it immediately. However, there’s seemingly no problem, because the lower driver will call your completion routine anyway indirectly and there you will signal “hey dude, you can stop waiting”, right? Therefore, it turns out you waited for nothing and just consumed some resources (locks).
That’s why usually you will see a simple test to see if the return code from the IoCallDriver is STATUS_PENDING, and only then you will wait ’till the operation is finished, in order to make it synchronized, that’s all the talk about. The thing is that you still need to do that same check in the completion routine you supplied. It seems that if you simply call SetEvent (and remebmer, now we know nobody is waiting on the event anymore, so why signaling it from the beginning), you still cause some performance penalty. And when you’re in a filter driver, for instance, you shouldn’t. And it’s a bad practice programming anyway.

I think it’s quite clear why WaitForSingleObject is “slow”, though in our case it will be immediately satisfied and yet… But I didn’t realize SetEvent is also problematic at a first thought. I thought it was a matter of flagging a boolean. In some sense, it’s true, there’s more to it. You see, since somebody might be waiting on the event, you will have to wake up the waiting thread and for that you need to lock the dispatcher-lock, and to yield execution, etc. Now suddenly it becomes a pain, huh?

Actually it’s quite interesting the way the KeSetEvent works. They knew they had to satisfy waiters, so they acquire the dispatcher-lock in the first place, and then they can also safely touch the event-state.

Moral of the story, don’t wait if you don’t have to, just shoot!

Proxy Functions – The Right Way

Thursday, August 21st, 2008

As much as I am an Assembly freak, I try to avoid it whenever possible. It’s just something like “pick the right language for your project” and don’t use overqualified stuff. Actually, in the beginning, when I started my patch on the IPhone, I compiled a simple stub for my proxy and then fixed it manually and only then used that code for the patch. Just to be sure about something here – a proxy function is a function that gets called instead of the original function, and then when the control belongs to the proxy function it might call the original function or not.

The way most people do this proxy function technique is using detour patching, which simply means, that we patch the first instruction (or a few, depends on the architecture) and change it to branch into our code. Now mind you that I’m messing with ARM here – iphone… However, the most important difference is that the return address of a function is stored on a register rather than in the stack, which if you’re not used to it – will get you confused easily and experiencing some crashes.

So suppose my target function begins with something like:

SUB SP, SP, #4
STMFD SP!, {R4-R7,LR}
ADD R7, SP, #0xC

This prologue is very equivalent to push ebp; mov ebp, esp thing on x86, plus storing a few registers so we can change their values without harming the caller, of course. And the last thing, we also store LR (link-register), the register which stores the return address of the caller.

Anyhow, in my case, I override (detour) the first instruction to branch into my code, wherever it is. Therefore, in order my proxy function to continue execution on the original function, I have to somehow emulate that overriden instruction and only then continue from the next instruction as if the original patched function wasn’t touched. Although, there are rare times when you cannot override some specific instructions, but then it means you only have to work harder and change the way your detour works (instructions that use the program counter as an operand or branches, etc).

Since the return address of the caller is stored onto a register, we can’t override the first instruction with a branch-link (‘call’ equivalent on x86). Because then we would have lost the original caller’s return address. Give it a thought for a second, it’s confusing in the first time, I know. Just an interesting point to note that it so happens that if there’s a function which don’t call internally to other functions, it doesn’t have to store LR on the stack and later pop the PC (program-counter, IP register) off the stack, because nobody touched that register, unless the function needs around 14 registers for optimizations, instead of using local stack variables… This way you can tell which of the functions are leaves on the call graph, although it is not guaranteed.

Once we understand how the ARM architecture works we can move on. However, I have to mention that the 4 first parameters are passed on registers (R0 to R3) and the rest on the stack, so in the proxy we will have to treat the parameters accordingly. The good thing is that this ABI (Application-Binary-Interface) is something known to the compiler (LLVM with GCC front-end in my case), so you don’t have to worry about it, unless you manually write the proxy function yourself.

My proxy function can be written fully in C, although it’s possible to use C++ as well, but then you can’t use all features…

int foo(int a, int b)
{
 if (a == 1000) b /= 2;
}

That’s my sample foo proxy function, which doesn’t do anything useful nor interesting, but usually in proxies, we want to change the arguments, before moving on to the original function.

Once it is compiled, we can rip the code from the object or executable file, doesn’t really matter, and put it inside our patched file, but we are still missing the glue code. The glue code is a sequence of manually crafted instructions that will allow you to use your C code within the rest of the binary file. And to be honest, this is what I really wanted to avoid in first place. Of course, you say, “but you could write it once and then copy paste that glue code and voila”. So in a way you’re right, I can do it. But it’s bothersome and takes too much time, even that simple copy paste. And besides it is enough that you have one or more data objects stored following your function that you have to relocate all the references to them. For instance, you might have a string that you use in the proxy function. Now the way ARM works it is all get compiled as PIC (Position-Independent-Code) for the good and bad of it, probably the good of it, in our case. But then if you want to put your glue code inside the function and before the string itself, you will have to change the offset from the current PC register to the string… Sometimes it’s just easier to see some code:

stmfd sp!, {lr} 
mov r0, #0
add r0, pc, r0
bl _strlen
ldmfd sp! {pc}
db “this function returns my length :)”, 0

 When you read the current PC, you get that current instruction’s address + 8, because of the way the pipeline works in ARM. So that’s why the offset to the string is 0. Trying to put another instruction at the end of the function, for the sake of glue code, you will have to change the offset to 4. This really gets complicated if you have more than one resource to read. Even 32 bits values are stored after the end of the function, rather than in the operand of the instruction itself, as we know it on the x86.

So to complete our proxy code in C, it will have to be:

int foo(int a, int b)
{

 int (*orig_code)(int, int) = (int (*)(int, int))<addr of orig_foo + 4>; 
// +4 = We skip the first instruction which branches into this code!
 if (a == 1000) b /= 2;
// Emulate the real instruction we overrode, so stack is balanced before we continue with original function.
 asm(“sub sp, sp, #4”);
 return orig_foo(a, b);
}

This code looks more complete than before but contains a potential bug, can you spot it? Ok, I will give you a hint, if you were to use this code for x86, it would blow, though for ARM it would work well to some extent.

The bug lies in the number of arguments the original function receives. And since on ARM, only the 5th argument is passed through the stack, our “sub sp, sp, #4” will make some things go wrong. The stack of the original function should be as if it were running without we touched that function. This means that we want to push the arguments on the stack, ONLY then, do the stack fix by 4, and afterwards branch to the second instruction of the original function. Sounds good, but this is not possible in C. :( cause it means we have to run ‘user-defined’ code between the ‘pushing-arguments’ phase and the ‘calling-function’ phase. Which is actually not possible in any language I’m aware of. Correct me if I’m wrong though. So my next sentence is going to be “except Assembly”. Saved again ;)

Since I don’t want to dirty my hands with editing the binary of my new proxy function after I compile it, we have to fix that problem I just desribed above. This is the way to do it, ladies and gentlemen:

int foo(int a, int b)
{
 if (a == 1000) b /= 2;
 return orig_foo(a, b);
}

void __attribute__((naked)) orig_foo(int a, int b)
{
// Emulate the real instruction we overrode, so stack is balanced before we continue with original function.
 asm(“sub sp, sp, #4\nldr r12, [pc]\n bx r12\n.long <FOO ADDR + 4>”);
}

The code simply fixes the stack, reads the address of the original absolute foo address, again skipping the first instruction, and branches into that code. Though, it won’t change the return address in LR, therefore when the original function is over, it will return straight to the caller of orig_foo, which is our proxy function, that way we can still control the return values, if we wish to do so.

We had to use the naked attribute (__declspec(naked) in VC) so that the compiler won’t put a prologue that will unbalance our stack again. In any way the epilogue wouldn’t get to run…

This technique will work on x86 the same way, though for branching into an absolute address, one should use: push <addr>; ret.

In the bottom line, I don’t mind to pay the price for a few code lines in Assembly, that’s perfectly ok with me. The problem was that I had to edit the binary after compilation in order to fix it so it’s becoming ready to be put in the original binary as a patch. Besides, the Assembly code is a must, if you wish to compile it without further a do, and as long as the first instruction of the function hasn’t changed, your code is good to go.

This code works well and just as I really wanted, so I thought so share it with you guys, for a better “infrastructure” to make proxy function patches.

However, it could have been perfect if the compiler would have stored the functions in the same order you write them in the source code, thus the first instruction of the block would be the first instruction you have to run. Now you might need to add another branch in the beginning of the code so it skips the non-entry code. This is really compiler dependent. GCC seems to be the best in preserving the functions’ order. VC and LLVM are more problematic when optimizations are enabled. I believe I will cover this topic in the future.

One last thing, if you use -O3, or functions inline, the orig_foo naked function gets to be part of the foo function, and then the way we assume the original function returns to our foo proxy function, won’t happen. So just be sure to peek at the code so everything is fine ;)

My Turn on the IPhone

Friday, July 25th, 2008

I tried adding some feature to the IPhone, and therefore I decided that I will do it my way, an in-memory patch, rather than on-disk patch. The reason I went for memory patching is simple, version 2 of IPSW contains code signing. Why should I smash my head against the wall trying to remove that code signing checks where I can easily do anything I want in runtime? Although, I read somewhere that you can sys-call something and disable the checks, they also said it makes the system a bit shaky… and later on I learnt there is a way to sign your own code, or existing patched files using ldid.

The thing is, I started my coding on beta 3 of version 2, and there everything worked well and wasn’t long time before my code work as expected. Then I updated my IPhone to the final second version and first thing I tried was my patch to know it doesn’t work. Now since there are no debuggers for the IPhone are really problematic, and except GDB, the other two I heard of are not free (DataRescue’s and DebuggerX).

Ok, to be honest, debuggers won’t even work because you can’t attach them to some of the tasks, this is since there is a new feature in OSX called ‘PT_DENY_ATTACH’, which simply means, no debugger can attach to this task, and if there is something currently attached, then detach it… This feature is implemented in the ptrace function, which lacks other important features, like reading and writing fro and to memory or getting registers’ values, etc. Of course, there are other ways to bypass those problems, and if you look well, you can find some resources about it.

Anyway, back to the story, I had to spot in my (long long) code what was wrong. After some short time I found that vm_protect failed. Another frustrating thing was to spot that failure in my code, because I didn’t have any way to print to debug (ala DebugView) or printf, or anything. I was kinda doomed, I had to crash the task in order to know that my code reached some point, and each time move the crash further a long the code, this is really lame, I know. Maybe if I had more knowledge with Linux/BSD/OSX I could track it down quicker. Hey, if you still got an idea how to do it next time, please drop a line, heh?

So once I knew the failing code, I tried to fix it, but actually I didn’t know what was wrong. The vm_protect returned something like ‘protection error’, and hell this doesn’t say much. I got really crazy at some point, that I used that same call on my own code block, and that failed too. I didn’t know what to do, I kept playing with that shit for hours :( and nothing came up to my mind. Then I left it and went to sleep with a bad mood (I hate when it happens, usually I keep on trying until I make it, but it was 6am in the morning already…) So later the next day, I decided I will read more in the MAN, and nothing special there, it only shows the parameters, the return value, bla bla, etc. By that time, I was sure that Apple touched something in the code related to vm_protect somehow, that was my hunch. The idea to RE the kernel and this function to see what was changed from beta 3 to the final version crossed my mind, not once. But I knew that I was missing something simple and I should not go that far, after all it’s a usermode API.

As stuck as I was, I googled for as much as possible information on this vm_protect and other OSX code snippets. Eventually I hit something interesting that used vm_protect and used another flag that I didn’t know that exists in OSX. The flag name is ‘vm_prot_copy’. This might finish the story for you if you know it. Otherwise, it means that when you try to write to a page, it will make a copy of that same page particulary for the requesting task and then let you write to it. This is used in many operating systems, when some file (code usually) is being loaded from disk, and the OS wants to optimize memory, it maps the same physical page of that code/data to all tasks which loaded that file. Then if you just want to write to that page, you are forbidden, of course. Here comes the COW (copy on write ) to save us.

The annoying thing is that since the documentation sucks I didn’t find this vm_prot_copy anywhere. I even took a look in the header files, where the ‘vm_prot_execute’ for instance, was defined, and didn’t see this extra flag. Only after I knew the solution I came back to that file again and I found this flag to be declared almost in the end of the file, LOL. The cool thing, which came too late, that Cydia had some notes regard ‘how to port applications from earlier versions to final version’ and they wrote something about NX and protections, though they didn’t say anything directly about this COW thingy…

It was kinda a surprise to see that I had to specify such a low level flag. As I come from the Windows world mostly, there you don’t have to specify such a thing when you try to change some page’s protection and write to it. Therefore I didn’t expect it to be the case in other OS’s.

Just wanted to share this frustrating story and the experience of how fun (or not) it is to code for the IPhone.

Crunching 3 More Bytes

Tuesday, February 12th, 2008

I already uploaded the new version, whose size is 213 bytes. Download here.

The trick I used this time is very not popular and I doubt if any of you know it at all. A friend showed me this trick a year a go and it only occurred to me yester night to use it in Tiny PE itself.

C:\Documents and Settings\arkon>ping ragestorm.net
Pinging ragestorm.net [63.247.129.107] with 32 bytes of data:
Ctrl^c
python
>>> 107+129*256+247*256**2+63*256**3
1073185131
>>> len(str(_))
10
>>> len(“ragestorm.net”)
13
>>>

I guess you still didn’t get it :) So check this link out:
http://1073185131
The IP is converted to a decimal integer which the Net API knows and parses it well…
Whooooo’s the biatch now?! http://0x3ff7816b/ also works… but it’s still 10 bytes. heh

 BTW – I updated diSlib64 to be able to parse Tiny PE style files well…It’s the only thing that parses them from all the PE tools I got. Though it still doesn’t parse the forwarded export table. Will fix that in next update ;)