Archive for the ‘Code Analysis’ Category

Optimize My Index Yo

Thursday, November 26th, 2009

I happened to work with UNICODE_STRING recently for some kernel stuff. That simple structure is similar to pascal strings in a way, you got the length and the string doesn’t have to be null terminated, the length though, is stored in bytes. Normally I don’t look at the assembly listing of the application I compile, but when you get to debug it you get to see the code the compiler generated. Since some of my functions use strings for input but as null terminated ones, I had to copy the original string to my own copy and add the null character myself. And now that I think of it, I will rewrite everything to use lengths, I don’t like extra wcslen’s. :)

Here is a simple usage case:

p = (PWCHAR)ExAllocatePool(SomePool, Str->Length + sizeof(WCHAR));
if (p == NULL) return STATUS_NO_MEMORY;
memcpy(p, Str->buffer, Str->Length);
p[Str->Length / sizeof(WCHAR)] = UNICODE_NULL;

I will show you the resulting assembly code, so you can judge yourself:

shr    esi,1 
xor    ecx,ecx 
mov  word ptr [edi+esi*2],cx 

One time the compiler converts the length to WCHAR units, as I asked. Then it realizes it should take that value and use it as an index into the unicode string, thus it has to multiply the index by two, to get to the correct offset. It’s a waste-y.
This is the output of a fully optimized code by VS08, shame.

It’s silly, but this would generate what we really want:

*(PWCHAR)((PWCHAR)p + Str->Length) = UNICODE_NULL;

With this fix, this time without the extra div/mul. I just did a few more tests and it seems the dead-code removal and the simplifier algorithms are not perfect with doing some divisions inside the indexing for pointers.

Update: Thanks to commenter Roee Shenberg, it is now clear why the compiler does this extra shr/mul. The reason is that the compiler can’t know whether the length is odd, thus it has to round it.

Integer Promotion is Dodgey & Dangerous

Wednesday, October 28th, 2009

I know this subject and every time it surprises me again. Even if you know the rules of how it works and you read K&R, it will still confuse you and you will end up being wrong in some cases. At least, that’s what happened to me. So I decided to mention the subject and to give two examples along.

Integer promotions probably happen in your code so many times, and most of us are not even aware of that fact and don’t understand the way it works. To those of you who have no idea what integer promotion is, to make a long story short: “Objects of an integral type can be converted to another wider integral type (that is, a type that can represent a larger set of values). This widening type of conversion is called “integral promotion.”, cited by MSDN. Why? So calculation can be faster in some of the times, otherwise because of different types; so seamless conversions happen, etc. There are exact rules in the standard how it works and when, you should check it out on your own

enum {foo_a = -1, foo_b} foo_t;

unsigned int a = -1;
printf("%s", a == foo_a ? "true" : "false");

Can you tell what it prints?
It will print “true”. Nothing special right, just works as we expect?
Check the next one out:

unsigned char a = -1;
printf("%s", a == foo_a ? "true" : "false");

And this time? This one will result in “false”. Only because the type of ‘a’ is unsigned. Therefore it will be promoted to unsigned integer – 0x000000ff, and compare it to 0xffffffff, which will yield false, of course.
If ‘a’ were defined as signed, it would be ok, since the integer promotions would make sure to sign extend it.

Another simple case:

unsigned char a = 5, b = 200;
unsigned int c = a * b;
printf("%d", c);

Any idea what the result is? I would expect it to be (200*5) & 0xff – aka the low byte of the result, since we multiply uchars here, and you? But then I would be wrong as well. The result is 1000, you know why? … Integer Promotions, ta da. It’s not like c = (unsigned char)(a * b); And there is what confusing sometimes.
Let’s see some Assembly then:

movzx       eax,byte ptr [a]
movzx       ecx,byte ptr [b]
imul        eax,ecx
mov         dword ptr [c],eax

Nasty, the unsigned char variables are promoted to unsigned int. Then the multiplication happens in 32 bits operand size! And then the result is not being truncated, just like that, to unsigned char again.

Why is it dangerous? I think the answer is obvious.. you trivially expect for one result when you read/write the code, but in reality something different happens. Then you end up with a small piece of code that doesn’t do what you expect it to. And then you end up with some integer overflow vulnerability without slightly noticing. Ouch.

Update: Thanks to Daniel I changed my erroneous (second) example to what I really had in mind when I wrote this post.

diStorm3 – Call for Features

Friday, October 2nd, 2009

[Update diStorm3 News]

I have been working more and more on diStorm3 recently. The core code is already written, and it works so great. I am still not going to talk about the structure itself that diStorm uses to format the instructions. There are two API’s now, the old one, which takes a stream and formats it to text and a newer one, which takes a stream and formats it into structures. This one is much faster. Unlike diStorm64, where the text-formatting was coupled in the decoding code, it’s totally separated. For example, if you want to support AT&T syntax, you can do it in a couple of hours or less, really. I don’t like AT&T syntax, hence I am not going to implement it. I bet still many people don’t know how to read it without confusing…

Hereby, I am asking you guys to come up with ideas for diStorm3. So far I got some new ideas from people, which I am going to implement. Such as:
1) You will be able to tell the decoder to stop on any flow control instruction.
2) Instructions are going to be categorized, such as, flow-control, data-control, string instructions, io, etc. (To be honest, I am still not totally sure about this one).
3) Helper macros to extract data references. Since diStorm3 outputs structures, it’s really easy to know if there’s a data reference and its address. Therefore some macros will aid to do this work.
4) Code reference, – continues to next instruction, continues to a target according to a condition, or jump-always and call-always.

I am looking to hear more suggestions from you guys. Please be sure you are talking about disassembler features, and not other layers which use the disassembler.

Just wanted to let you know that diStorm3 is going to be dual licensed with GPL and commercial. diStorm64 is deprecated and I am not going to touch it anymore, though it’s still licensed as BSD, of course.

VML + ANI ZERT Patches

Tuesday, February 3rd, 2009

It is time to release an old presentation about the VML and ANI vulnerabilities that were patched by ZERT. It explains the vulnerabilities and how they were closed. It is somewhat very technical, Assembly is required if you wanna really enjoy it. I also gave a talk using this presentation in CCC 2007. It so happened that I wrote the patches, with the extensive help of the team, of course.

ZERT Patches.ppt

Code Anaylsis #1 – Oh, Please Return

Sunday, September 30th, 2007

The first thing you do when you write a high level disassembler (in contrast with diStorm which is a flat stream disassembler) is starting to scan from the entry point of the binary. Assuming you have one, for instance in .COM files it will be the very first byte in the binary file. For MZ or PE it’s a bit complicated story, yet easily achievable.

So what is this scanning really?

  Well, as I have never taken a look in any high level disassembler’s source code, my answer is from scratch only. The way I did it, was to disassemble the entry point, following control flow (branches, such as: jmp/jcc/loop, etc) and recursively add the new functions’ addresses (upon encountering the call instruction) to some list. This list will be processed until it is exhausted. So there’s some algo that will firstly insert the entry point to that function and then it will pop the first address and start analyzing it. Everytime it stumbles a new function address, it will add it to that same list. And once it’s finished analyzing the current function (for example:) by hitting the ret instruction; it will halt the inner loop, pop the next address off the list (if one exists) and continue again. The disassembled instruction/info will be stored into your favorite collection, in my case, it was a dictionary (address->instruction’s info), which later you can walk easily and print it or do anything you wish with it.

The thing is, that some functions (generated by compilers for the sake of conversation) are not always terminated by the RET instruction. It might be an IRET, and then immediately you know it’s an ISR. But that’s a simple case. Some functions end with INT3. Even that is ok. When do things get uglier? When they end with a CALL to ExitProcess (for the Win32 minded), so then your stupid scanning algo can’t determine when the function ends, because now it also has to ‘know’ the IAT and determine whether the function was ExitProcess, ExitThread or whatever Exit API there exists. So before you even made your first move with analyzing a binary code, you have to make it smarter. And that’s a bammer. My goal was to try and decide where a function start (usually easy) and where a function ends. Parsing the PE and getting the IAT is no biggy, but now it means that if you wanted to write a generic x86 disassembler, you’re screwed. So you will have to write plugins or addon (whatever, you name it) to extend the disassembler capabilities for different systems…

But even that’s ok, because the job is the same, although the project is now much bigger. And again, it all depends how accurate you wish to be. In my opinion, I try to be 99% accurate. With heuristics you cannot ask for 100%? Right? :P

So tell me, you smart aleck compiler-engineers out there, why the heck you write the generated function code in a way that it NEVER ends?

You all know the noreturn keyword or compiler extension, which states that the function doesn’t return. Yes, that’s good for functions that the (invisible) system takes control from that point, like ExitProcess, etc. I really never unerstood the reason that a programmer would like to state such a behaviour for a function. So what? Now your generated code will be optimized? To omit the RET instruction? Wow, you rock! NOT.

To be honest, talking about ExitProcess is not the case, and to be more accurate I was talking about the Linux code:

00000da6 (03) 8b40 14                  MOV EAX, [EAX+0x14] 
00000da9 (05) a3 ec7347c0              MOV [0xc04773ec], EAX 
00000dae (01) c3                       RET 
00000daf (05) 68 dcb834c0              PUSH 0xc034b8dc 
00000db4 (05) e8 8b09d1ff              CALL 0xffffffffffd11744 
00000db9 (05) 68 c5a034c0              PUSH 0xc034a0c5 
00000dbe (05) e8 8109d1ff              CALL 0xffffffffffd11744 
00000dc3 (03) 0068 00                  ADD [EAX+0x0], CH 
00000dc6 (05) 0d 39c06888              OR EAX, 0x8868c039

This is some disassembled code that I got from a good friend, Saul Tamari, while he was researching some stuff in the Linux kernel. He noticed that panic() function never returns, but this time, for real. So the problem now is that while flatly disassembling the stream you got, you go out of synchronization and start to disassemble real code in the wrong offset. You can see in the above snippet the second call, which a zero byte follows. That single byte is the end-of-function marker. How nice huh? The next instruction PUSH (68 00 …) is now out of synch, and actually is considered as a new different function.

So now tell me, how should you find this kind of noreturn function when you want to solve this puzzle only in static analysis?? It is defiantly not an easy question. We (Saul and I) had some ideas, but nothing 100% reliable. Performance was an issue also which made things harder. And there are some crazy ideas, which I will cover next time.

Meanwhile, if you got any ideas, you’re more than welcome to write them here.