Archive for the ‘Optimization’ Category

Optimize My Index Yo

Thursday, November 26th, 2009

I happened to work with UNICODE_STRING recently for some kernel stuff. That simple structure is similar to pascal strings in a way, you got the length and the string doesn’t have to be null terminated, the length though, is stored in bytes. Normally I don’t look at the assembly listing of the application I compile, but when you get to debug it you get to see the code the compiler generated. Since some of my functions use strings for input but as null terminated ones, I had to copy the original string to my own copy and add the null character myself. And now that I think of it, I will rewrite everything to use lengths, I don’t like extra wcslen’s. :)

Here is a simple usage case:

p = (PWCHAR)ExAllocatePool(SomePool, Str->Length + sizeof(WCHAR));
if (p == NULL) return STATUS_NO_MEMORY;
memcpy(p, Str->buffer, Str->Length);
p[Str->Length / sizeof(WCHAR)] = UNICODE_NULL;

I will show you the resulting assembly code, so you can judge yourself:

shr    esi,1
xor    ecx,ecx
mov  word ptr [edi+esi*2],cx

One time the compiler converts the length to WCHAR units, as I asked. Then it realizes it should take that value and use it as an index into the unicode string, thus it has to multiply the index by two, to get to the correct offset. It’s a waste-y.
This is the output of a fully optimized code by VS08, shame.

It’s silly, but this would generate what we really want:

*(PWCHAR)((PWCHAR)p + Str->Length) = UNICODE_NULL;

With this fix, this time without the extra div/mul. I just did a few more tests and it seems the dead-code removal and the simplifier algorithms are not perfect with doing some divisions inside the indexing for pointers.

Update: Thanks to commenter Roee Shenberg, it is now clear why the compiler does this extra shr/mul. The reason is that the compiler can’t know whether the length is odd, thus it has to round it.

It’s Vexed :)

Tuesday, November 17th, 2009

In the last few week I’ve been working on diStorm to add the new instruction sets: AVX and FMA. You can find lots of information about them on the inet. In a brief, the big advantages, are support of 256 bit registers, called now YMM’s (their low halves are XMM’s) and also support for AES built in, you have a few instruction to do small block encryption and decryption, really sweet. I guess it will help some security companies out there to boost stuff. Also the main feature behind these instruction sets is the 3 registers operands. So now you are not stuck with 2 registers per instruction, you can have up to four sometimes. This is good because you have two source operands and a destination operand, which means it saves you other instructions (to move or backup registers) and you don’t have to ruin your dest-src operand like in the old sets. Almost forgot to mention FMA itself, which is fused multiply-add instructions, so you can do two operations at once, like A*B+C, etc.

I wanted to talk about the VEX prefix itself. It’s really a new design and approach to prefixes that was never seen before. And the title of this post says it all, it’s really annoying.
The VEX (Vector Extension) prefix, is a multi-byte prefix for a change. It can be either 2 or 3 bytes. If you take a look at the one byte opcodes map, all of them are taken. Intel was in the need of a new unused byte, which didn’t really exist. What they did instead, was to share two existing opcodes, for each prefix. The sharing works in a special way, that let them know if you meant to use the original instruction or the new prefix. The chosen instructions are LDS (0xc4) and LES (0xc5). When you examine the second byte of these instructions, the byte upon which the new information is extracted from, you can learn that the most significant two bits can’t be set together (I.E: the value of 0xc0 or higher). However, if they are set, the processor will raise an illegal instruction exception. This is where the VEX prefix enters into the game. Instead of raising an exception they will be decoded as this special multi-byte prefix. Note that in 64 bits mode, all those Load-Segment instruction are invalid, so there is no need for sharing the opcode. When you encounter 0xc4 or 0xc5, you know it’s a VEX prefix, as simple as that. Unfortunately this is not the case in 32 bits mode, and since the second byte has to be with a value higher than 0xc0 (because the two most significant bits have to be set in 32 bits), the field in these corresponding bits is inverted actually, which means you will have to extract a few bits that represent some fields and bitwise-not them. This is seriously gross, but it seems Intel didn’t have much of a choice here. If it were up to me, I would do the same eventually, for the sake of backward compatibility, but it doesn’t make it any prettier to be honest. And for your information, AMD pulled the same trick but with the POP instruction (0×8f) for their new instruction sets (XOP, etc), without full backward compatibility.

To make some order in the bits let’s have a look at the following figure:
vexprefix
Cited Intel
Well, I am not going to talk about all fields, (which somewhat are similar to REX for 64 bits) but just about one interesting feature. Since now the prefix is 2/3 bytes and usually an SSE instruction is at least 3 bytes, this will explode the code segment with huge instruction, and certainly gonna make the processor cry a lot to fetch instructions. The trick that was used by Intel (and AMD too) was to have a field that will imply which prefix byte to put virtually before the VEX prefix itself, so this way we saved one byte. And the same idea was used again to spare the 0×0f escape byte or even two bytes of 0×0f, 0×38 or 0×0f, 0×3a which are very common basic opcodes for SSE instructions. So if we had to use an SSE instruction, for instance:
66 0f 38 17 c0; PTEST XMM0, XMM0 – has first 3 bytes that can be implied in the VEX prefix, thus it stays the same size! Kawabanga

I talked in an earlier post that I am not going to support SSE5 as for now, in the hope it’s gonna die. I believe CPU (or instruction set architectures, to be accurate) wars are bad for the coders and even for end users who can’t really enjoy those great technologies eventually.

Don’t Wait, Shoot. (KeSetEvent)

Tuesday, November 3rd, 2009

Apparently when you call down a driver with IoCallDriver you can either wait ’till the operation is finished or not. If you wait, you will need somebody to tell you “hey dude, you can stop waiting”. But that’s trivial, you set up a completion routine that will be called once the lower driver is finished with your IRP. The problem is if in some cases you don’t check the return code from that driver, and you assume you should always wait. So you Wait. Now what? Now the lower driver suddenly returns immediately, but did you know that? Probably not, cause you’re not blocked forever, otherwise you would have noticed it immediately. However, there’s seemingly no problem, because the lower driver will call your completion routine anyway indirectly and there you will signal “hey dude, you can stop waiting”, right? Therefore, it turns out you waited for nothing and just consumed some resources (locks).
That’s why usually you will see a simple test to see if the return code from the IoCallDriver is STATUS_PENDING, and only then you will wait ’till the operation is finished, in order to make it synchronized, that’s all the talk about. The thing is that you still need to do that same check in the completion routine you supplied. It seems that if you simply call SetEvent (and remebmer, now we know nobody is waiting on the event anymore, so why signaling it from the beginning), you still cause some performance penalty. And when you’re in a filter driver, for instance, you shouldn’t. And it’s a bad practice programming anyway.

I think it’s quite clear why WaitForSingleObject is “slow”, though in our case it will be immediately satisfied and yet… But I didn’t realize SetEvent is also problematic at a first thought. I thought it was a matter of flagging a boolean. In some sense, it’s true, there’s more to it. You see, since somebody might be waiting on the event, you will have to wake up the waiting thread and for that you need to lock the dispatcher-lock, and to yield execution, etc. Now suddenly it becomes a pain, huh?

Actually it’s quite interesting the way the KeSetEvent works. They knew they had to satisfy waiters, so they acquire the dispatcher-lock in the first place, and then they can also safely touch the event-state.

Moral of the story, don’t wait if you don’t have to, just shoot!

Proxy Functions – The Right Way

Thursday, August 21st, 2008

As much as I am an Assembly freak, I try to avoid it whenever possible. It’s just something like “pick the right language for your project” and don’t use overqualified stuff. Actually, in the beginning, when I started my patch on the IPhone, I compiled a simple stub for my proxy and then fixed it manually and only then used that code for the patch. Just to be sure about something here – a proxy function is a function that gets called instead of the original function, and then when the control belongs to the proxy function it might call the original function or not.

The way most people do this proxy function technique is using detour patching, which simply means, that we patch the first instruction (or a few, depends on the architecture) and change it to branch into our code. Now mind you that I’m messing with ARM here - iphone… However, the most important difference is that the return address of a function is stored on a register rather than in the stack, which if you’re not used to it – will get you confused easily and experiencing some crashes.

So suppose my target function begins with something like:

SUB SP, SP, #4
STMFD SP!, {R4-R7,LR}
ADD R7, SP, #0xC

This prologue is very equivalent to push ebp; mov ebp, esp thing on x86, plus storing a few registers so we can change their values without harming the caller, of course. And the last thing, we also store LR (link-register), the register which stores the return address of the caller.

Anyhow, in my case, I override (detour) the first instruction to branch into my code, wherever it is. Therefore, in order my proxy function to continue execution on the original function, I have to somehow emulate that overriden instruction and only then continue from the next instruction as if the original patched function wasn’t touched. Although, there are rare times when you cannot override some specific instructions, but then it means you only have to work harder and change the way your detour works (instructions that use the program counter as an operand or branches, etc).

Since the return address of the caller is stored onto a register, we can’t override the first instruction with a branch-link (‘call’ equivalent on x86). Because then we would have lost the original caller’s return address. Give it a thought for a second, it’s confusing in the first time, I know. Just an interesting point to note that it so happens that if there’s a function which don’t call internally to other functions, it doesn’t have to store LR on the stack and later pop the PC (program-counter, IP register) off the stack, because nobody touched that register, unless the function needs around 14 registers for optimizations, instead of using local stack variables… This way you can tell which of the functions are leaves on the call graph, although it is not guaranteed.

Once we understand how the ARM architecture works we can move on. However, I have to mention that the 4 first parameters are passed on registers (R0 to R3) and the rest on the stack, so in the proxy we will have to treat the parameters accordingly. The good thing is that this ABI (Application-Binary-Interface) is something known to the compiler (LLVM with GCC front-end in my case), so you don’t have to worry about it, unless you manually write the proxy function yourself.

My proxy function can be written fully in C, although it’s possible to use C++ as well, but then you can’t use all features…

int foo(int a, int b)
{
 if (a == 1000) b /= 2;
}

That’s my sample foo proxy function, which doesn’t do anything useful nor interesting, but usually in proxies, we want to change the arguments, before moving on to the original function.

Once it is compiled, we can rip the code from the object or executable file, doesn’t really matter, and put it inside our patched file, but we are still missing the glue code. The glue code is a sequence of manually crafted instructions that will allow you to use your C code within the rest of the binary file. And to be honest, this is what I really wanted to avoid in first place. Of course, you say, “but you could write it once and then copy paste that glue code and voila”. So in a way you’re right, I can do it. But it’s bothersome and takes too much time, even that simple copy paste. And besides it is enough that you have one or more data objects stored following your function that you have to relocate all the references to them. For instance, you might have a string that you use in the proxy function. Now the way ARM works it is all get compiled as PIC (Position-Independent-Code) for the good and bad of it, probably the good of it, in our case. But then if you want to put your glue code inside the function and before the string itself, you will have to change the offset from the current PC register to the string… Sometimes it’s just easier to see some code:

stmfd sp!, {lr} 
mov r0, #0
add r0, pc, r0
bl _strlen
ldmfd sp! {pc}
db “this function returns my length :) ”, 0

 When you read the current PC, you get that current instruction’s address + 8, because of the way the pipeline works in ARM. So that’s why the offset to the string is 0. Trying to put another instruction at the end of the function, for the sake of glue code, you will have to change the offset to 4. This really gets complicated if you have more than one resource to read. Even 32 bits values are stored after the end of the function, rather than in the operand of the instruction itself, as we know it on the x86.

So to complete our proxy code in C, it will have to be:

int foo(int a, int b)
{

 int (*orig_code)(int, int) = (int (*)(int, int))<addr of orig_foo + 4>; 
// +4 = We skip the first instruction which branches into this code!
 if (a == 1000) b /= 2;
// Emulate the real instruction we overrode, so stack is balanced before we continue with original function.
 asm(“sub sp, sp, #4″);
 return orig_foo(a, b);
}

This code looks more complete than before but contains a potential bug, can you spot it? Ok, I will give you a hint, if you were to use this code for x86, it would blow, though for ARM it would work well to some extent.

The bug lies in the number of arguments the original function receives. And since on ARM, only the 5th argument is passed through the stack, our “sub sp, sp, #4″ will make some things go wrong. The stack of the original function should be as if it were running without we touched that function. This means that we want to push the arguments on the stack, ONLY then, do the stack fix by 4, and afterwards branch to the second instruction of the original function. Sounds good, but this is not possible in C. :( cause it means we have to run ‘user-defined’ code between the ‘pushing-arguments’ phase and the ‘calling-function’ phase. Which is actually not possible in any language I’m aware of. Correct me if I’m wrong though. So my next sentence is going to be “except Assembly”. Saved again ;)

Since I don’t want to dirty my hands with editing the binary of my new proxy function after I compile it, we have to fix that problem I just desribed above. This is the way to do it, ladies and gentlemen:

int foo(int a, int b)
{
 if (a == 1000) b /= 2;
 return orig_foo(a, b);
}

void __attribute__((naked)) orig_foo(int a, int b)
{
// Emulate the real instruction we overrode, so stack is balanced before we continue with original function.
 asm(“sub sp, sp, #4\nldr r12, [pc]\n bx r12\n.long <FOO ADDR + 4>”);
}

The code simply fixes the stack, reads the address of the original absolute foo address, again skipping the first instruction, and branches into that code. Though, it won’t change the return address in LR, therefore when the original function is over, it will return straight to the caller of orig_foo, which is our proxy function, that way we can still control the return values, if we wish to do so.

We had to use the naked attribute (__declspec(naked) in VC) so that the compiler won’t put a prologue that will unbalance our stack again. In any way the epilogue wouldn’t get to run…

This technique will work on x86 the same way, though for branching into an absolute address, one should use: push <addr>; ret.

In the bottom line, I don’t mind to pay the price for a few code lines in Assembly, that’s perfectly ok with me. The problem was that I had to edit the binary after compilation in order to fix it so it’s becoming ready to be put in the original binary as a patch. Besides, the Assembly code is a must, if you wish to compile it without further a do, and as long as the first instruction of the function hasn’t changed, your code is good to go.

This code works well and just as I really wanted, so I thought so share it with you guys, for a better “infrastructure” to make proxy function patches.

However, it could have been perfect if the compiler would have stored the functions in the same order you write them in the source code, thus the first instruction of the block would be the first instruction you have to run. Now you might need to add another branch in the beginning of the code so it skips the non-entry code. This is really compiler dependent. GCC seems to be the best in preserving the functions’ order. VC and LLVM are more problematic when optimizations are enabled. I believe I will cover this topic in the future.

One last thing, if you use -O3, or functions inline, the orig_foo naked function gets to be part of the foo function, and then the way we assume the original function returns to our foo proxy function, won’t happen. So just be sure to peek at the code so everything is fine ;)

My Turn on the IPhone

Friday, July 25th, 2008

I tried adding some feature to the IPhone, and therefore I decided that I will do it my way, an in-memory patch, rather than on-disk patch. The reason I went for memory patching is simple, version 2 of IPSW contains code signing. Why should I smash my head against the wall trying to remove that code signing checks where I can easily do anything I want in runtime? Although, I read somewhere that you can sys-call something and disable the checks, they also said it makes the system a bit shaky… and later on I learnt there is a way to sign your own code, or existing patched files using ldid.

The thing is, I started my coding on beta 3 of version 2, and there everything worked well and wasn’t long time before my code work as expected. Then I updated my IPhone to the final second version and first thing I tried was my patch to know it doesn’t work. Now since there are no debuggers for the IPhone are really problematic, and except GDB, the other two I heard of are not free (DataRescue’s and DebuggerX).

Ok, to be honest, debuggers won’t even work because you can’t attach them to some of the tasks, this is since there is a new feature in OSX called ‘PT_DENY_ATTACH’, which simply means, no debugger can attach to this task, and if there is something currently attached, then detach it… This feature is implemented in the ptrace function, which lacks other important features, like reading and writing fro and to memory or getting registers’ values, etc. Of course, there are other ways to bypass those problems, and if you look well, you can find some resources about it.

Anyway, back to the story, I had to spot in my (long long) code what was wrong. After some short time I found that vm_protect failed. Another frustrating thing was to spot that failure in my code, because I didn’t have any way to print to debug (ala DebugView) or printf, or anything. I was kinda doomed, I had to crash the task in order to know that my code reached some point, and each time move the crash further a long the code, this is really lame, I know. Maybe if I had more knowledge with Linux/BSD/OSX I could track it down quicker. Hey, if you still got an idea how to do it next time, please drop a line, heh?

So once I knew the failing code, I tried to fix it, but actually I didn’t know what was wrong. The vm_protect returned something like ‘protection error’, and hell this doesn’t say much. I got really crazy at some point, that I used that same call on my own code block, and that failed too. I didn’t know what to do, I kept playing with that shit for hours :( and nothing came up to my mind. Then I left it and went to sleep with a bad mood (I hate when it happens, usually I keep on trying until I make it, but it was 6am in the morning already…) So later the next day, I decided I will read more in the MAN, and nothing special there, it only shows the parameters, the return value, bla bla, etc. By that time, I was sure that Apple touched something in the code related to vm_protect somehow, that was my hunch. The idea to RE the kernel and this function to see what was changed from beta 3 to the final version crossed my mind, not once. But I knew that I was missing something simple and I should not go that far, after all it’s a usermode API.

As stuck as I was, I googled for as much as possible information on this vm_protect and other OSX code snippets. Eventually I hit something interesting that used vm_protect and used another flag that I didn’t know that exists in OSX. The flag name is ’vm_prot_copy’. This might finish the story for you if you know it. Otherwise, it means that when you try to write to a page, it will make a copy of that same page particulary for the requesting task and then let you write to it. This is used in many operating systems, when some file (code usually) is being loaded from disk, and the OS wants to optimize memory, it maps the same physical page of that code/data to all tasks which loaded that file. Then if you just want to write to that page, you are forbidden, of course. Here comes the COW (copy on write ) to save us.

The annoying thing is that since the documentation sucks I didn’t find this vm_prot_copy anywhere. I even took a look in the header files, where the ‘vm_prot_execute’ for instance, was defined, and didn’t see this extra flag. Only after I knew the solution I came back to that file again and I found this flag to be declared almost in the end of the file, LOL. The cool thing, which came too late, that Cydia had some notes regard ‘how to port applications from earlier versions to final version’ and they wrote something about NX and protections, though they didn’t say anything directly about this COW thingy…

It was kinda a surprise to see that I had to specify such a low level flag. As I come from the Windows world mostly, there you don’t have to specify such a thing when you try to change some page’s protection and write to it. Therefore I didn’t expect it to be the case in other OS’s.

Just wanted to share this frustrating story and the experience of how fun (or not) it is to code for the IPhone.

Crunching 3 More Bytes

Tuesday, February 12th, 2008

I already uploaded the new version, whose size is 213 bytes. Download here.

The trick I used this time is very not popular and I doubt if any of you know it at all. A friend showed me this trick a year a go and it only occurred to me yester night to use it in Tiny PE itself.

C:\Documents and Settings\arkon>ping ragestorm.net
Pinging ragestorm.net [63.247.129.107] with 32 bytes of data:
Ctrl^c
python
>>> 107+129*256+247*256**2+63*256**3
1073185131
>>> len(str(_))
10
>>> len(“ragestorm.net”)
13
>>>

I guess you still didn’t get it :) So check this link out:
http://1073185131
The IP is converted to a decimal integer which the Net API knows and parses it well…
Whooooo’s the biatch now?! http://0×3ff7816b/ also works… but it’s still 10 bytes. heh

 BTW – I updated diSlib64 to be able to parse Tiny PE style files well…It’s the only thing that parses them from all the PE tools I got. Though it still doesn’t parse the forwarded export table. Will fix that in next update ;)

TinyPE NG

Tuesday, January 15th, 2008

I rewrote TinyPE and just got to 240 bytes!!!!!1!!11!!11!!1! * 9**9**9

Downloading a file from the .net (specifically my own site) and running it, while the strings in the .exe must be encrypted in someway. You can find more information by googling the TinyPE Challenge.

Holy shit, I’m kinda excited myself, it was my goal and I have just reached it after a few days of hard work. Now I gotta shave one more byte and then I’m done :)

 I will release the source once I am finished. I do new stuff and again no tools can read my .exe file, not even my own diSlib64… bammer.

Stay Tuned :) blee

SQL: Passing Array

Friday, January 4th, 2008

Many times you find yourself with this damned task, to pass an array to your SQL code (whether you supply the code, or whether it’s a stored procedure doesn’t really matter). You need to pass an array of elements, usually Id’s, but it’s absolutely up to you. In my app I had to pass to my stored procedure an array of Id’s. Each ID was a binary(20). Since I use MSSQL 2005, I could use XML to accomplish this task. So my stored procedure receieves a parameter of Xml:

CREATE PROCEDURE usp_MyProc
   @ids XML
AS
BEGIN
    SELECT f1, f2 FROM @ids.nodes(‘/id’) T(c) JOIN MyTable ON ID = T.c.value(‘.’, ‘MyIdType’);
END

Invoking it will be:

DECLARE @x AS XML;
SET @x = “<id>aa</id><id>bb</id>”;
EXEC usp_MyProc @x

Now the Ids in @x must be encoded in Base64, because MyIdType is Binary(20). So everything is cool now, except one thing, performance sucks big time. Now I was wondering what is the difference between attributed values and the node itself:
<id value=”id goes here”/>
( SELECT f1, f2 FROM @ids.nodes(‘/id’) T(c) JOIN MyTable ON ID = T.c.value(‘value’, ‘MyIdType’); )
and
<id>id goes here</id>

Apparently when only doing a test of parsing the Xmls input, they are almost running at the same speed, although the attributed way is a bit faster. But when you take the result (that’s the converted id as binary) and insert it into a temp table, such as: DECLARE @tmp AS TABLE (ID MyIdType), the attributed Xml seems to perform 2X faster than the nodes format. Interesting ah?

Even so, the performance was still bad, and I had to do something better, I decided to get rid off the Xml and Base64 stuff, they take too much time for SQL to evaluate. My first attempt was to convert from an input of VARCHAR directly to Binary, but without success, the conversion won’t be done well, I got my binary stream as 0×4142 if I were to supply an input of “ab” instead of a right value of 0xab…Then I looked into this substring builtin function and found out (thanks to a friend) that it can operate on varbinary types as well! Imagine that. This meant that from now on I can supply the ids array as one long binary input as varbinary…

The code is now:

DECLARE @P AS INT;
SET @P = 1;
DECLARE @L AS INT;
SET @L = LEN(@ids)+1;
WHILE @P < @L
  SELECT f1, f2 FROM MyTable WHERE ID = SUBSTRING(@ids, @P, 20);
  SET @P = @P + 20;
END

Voila. This seemed to be twice faster than attributed Xml, thus we saved string parsing and Base64 decoding and we even eliminated the use of JOIN. Yet, we do the parsing of the binary stream ourself, and it a bit blows the code with more lines, but it’s still simple code and the performance is much better.

In the beginning I didn’t realize why the above code was faster than a single select query with Xml, but I then understood that SQL scans the tables using indexing, of course. So even if it parses the Xml each time for the next node, the select query happens in reality more than a single time (as much as there is Id nodes to read in), so breaking the single select and putting it in the while statement didn’t degrade performance, but only boosted it with binary parsing. However, binary parsing is not the right phrase to use, because we only “slice” (as in Python :) ) the binary input stream to smaller chunks which doesn’t do any parsing anyways. The only disadvantage I found to this way of passing arrays, was that I cannot pass records/structures as in Xml… It might require more work then and ugly parsing for integers etc…

Pop Count

Tuesday, June 5th, 2007

Population Count is counting the number of bits set to one in a word (or any sized integer). I found pop counting as one of the most fascinating small algorithms in the bit twiddling category. There are so many uses for pop counting. Though, rarely, I find myself use them, to make stuff compact, like bit-arrays, special bit-flags which represent collections, etc… There are also popular algorithms which use pop counting in error-correcting codes and sound/image processing.

Here’s my favorite implementation:

int popcount(unsigned int x)
{
int c = 0;
for (; x > 0; x &= x -1) c++;
return c;
}

The whole concept is behind the mere formula: x &= x -1. But I will leave it for you to figure it out.

Speaking of pop count, Imri and I even used this same implementation above in the NULL Terminated Strip. I even hinted for this algorithm. And here’s a real 32 bits bugfree implementation (that strip has the bugs on purpose):

XCHG EBX, EAX
XOR EAX, EAX
OR EBX, EBX
JZ END
A:
LEA ECX, [EBX-1]
INC EAX
AND EBX, ECX
JNZ A
END:

Of course, most implementation are branch-free code, for the sake of speed. However - Lately, I added the SSE4 support to diStorm, and among the new instructions I found out there is a single instruction for it all: popcnt. :)