Archive for September, 2007

Code Anaylsis #1 – Oh, Please Return

Sunday, September 30th, 2007

The first thing you do when you write a high level disassembler (in contrast with diStorm which is a flat stream disassembler) is starting to scan from the entry point of the binary. Assuming you have one, for instance in .COM files it will be the very first byte in the binary file. For MZ or PE it’s a bit complicated story, yet easily achievable.

So what is this scanning really?

  Well, as I have never taken a look in any high level disassembler’s source code, my answer is from scratch only. The way I did it, was to disassemble the entry point, following control flow (branches, such as: jmp/jcc/loop, etc) and recursively add the new functions’ addresses (upon encountering the call instruction) to some list. This list will be processed until it is exhausted. So there’s some algo that will firstly insert the entry point to that function and then it will pop the first address and start analyzing it. Everytime it stumbles a new function address, it will add it to that same list. And once it’s finished analyzing the current function (for example:) by hitting the ret instruction; it will halt the inner loop, pop the next address off the list (if one exists) and continue again. The disassembled instruction/info will be stored into your favorite collection, in my case, it was a dictionary (address->instruction’s info), which later you can walk easily and print it or do anything you wish with it.

The thing is, that some functions (generated by compilers for the sake of conversation) are not always terminated by the RET instruction. It might be an IRET, and then immediately you know it’s an ISR. But that’s a simple case. Some functions end with INT3. Even that is ok. When do things get uglier? When they end with a CALL to ExitProcess (for the Win32 minded), so then your stupid scanning algo can’t determine when the function ends, because now it also has to ‘know’ the IAT and determine whether the function was ExitProcess, ExitThread or whatever Exit API there exists. So before you even made your first move with analyzing a binary code, you have to make it smarter. And that’s a bammer. My goal was to try and decide where a function start (usually easy) and where a function ends. Parsing the PE and getting the IAT is no biggy, but now it means that if you wanted to write a generic x86 disassembler, you’re screwed. So you will have to write plugins or addon (whatever, you name it) to extend the disassembler capabilities for different systems…

But even that’s ok, because the job is the same, although the project is now much bigger. And again, it all depends how accurate you wish to be. In my opinion, I try to be 99% accurate. With heuristics you cannot ask for 100%? Right? :P

So tell me, you smart aleck compiler-engineers out there, why the heck you write the generated function code in a way that it NEVER ends?

You all know the noreturn keyword or compiler extension, which states that the function doesn’t return. Yes, that’s good for functions that the (invisible) system takes control from that point, like ExitProcess, etc. I really never unerstood the reason that a programmer would like to state such a behaviour for a function. So what? Now your generated code will be optimized? To omit the RET instruction? Wow, you rock! NOT.

To be honest, talking about ExitProcess is not the case, and to be more accurate I was talking about the Linux code:

00000da6 (03) 8b40 14                  MOV EAX, [EAX+0x14] 
00000da9 (05) a3 ec7347c0              MOV [0xc04773ec], EAX 
00000dae (01) c3                       RET 
00000daf (05) 68 dcb834c0              PUSH 0xc034b8dc 
00000db4 (05) e8 8b09d1ff              CALL 0xffffffffffd11744 
00000db9 (05) 68 c5a034c0              PUSH 0xc034a0c5 
00000dbe (05) e8 8109d1ff              CALL 0xffffffffffd11744 
00000dc3 (03) 0068 00                  ADD [EAX+0x0], CH 
00000dc6 (05) 0d 39c06888              OR EAX, 0x8868c039

This is some disassembled code that I got from a good friend, Saul Tamari, while he was researching some stuff in the Linux kernel. He noticed that panic() function never returns, but this time, for real. So the problem now is that while flatly disassembling the stream you got, you go out of synchronization and start to disassemble real code in the wrong offset. You can see in the above snippet the second call, which a zero byte follows. That single byte is the end-of-function marker. How nice huh? The next instruction PUSH (68 00 …) is now out of synch, and actually is considered as a new different function.

So now tell me, how should you find this kind of noreturn function when you want to solve this puzzle only in static analysis?? It is defiantly not an easy question. We (Saul and I) had some ideas, but nothing 100% reliable. Performance was an issue also which made things harder. And there are some crazy ideas, which I will cover next time.

Meanwhile, if you got any ideas, you’re more than welcome to write them here.

Some Rants

Sunday, September 23rd, 2007

Aaarg, the most annoying thing is to visit your own site and to see that it is down. And then you check on register.com what’s wrong and you see that the domain has expired… Afterwards, you check your inbox for the invoices of the payment to the hosting company and you see that you were charged last month for the domain registration renewal. Next thing, you find yourself writing an email from some anonymous address (because mine at ragestorm hadn’t work at the time) and trying to sound polite and remove the swears after you write them on the second pass of the email before sending it.

Damn, there are some things that piss me off.  Like the fact that I really want to write in this blog more frequently. There are many excuses for this. Eventually I suck, everything else I will say can’t make it any better. So I really should try better. And yet, sometimes I have weird ideas to write about, and I’m not sure that my audience follows my posts, so why write them then? They are too low level, technical, or some might say boring. But hey, it IS the insanely low level blog, no? So I made up my mind that I will write just about anything (computer related) that I have in my mind. So you should expect some weird stuff. Usually, I’m inspired with ideas from my daily work at my company, and sometimes from the stuff I do in my free time. Thing is, that, at work – please don’t laugh at me – I do SQL and .Net stuff.. Ok, you can really laugh now, I deserve it. ;) but that’s only temporary (“that’s what they all say” haha)

So SQL or not, believe me, you can make your hands dirty with some of the stuff there, like transactions where you suddenly realize that there might be dead locks because of your queries. And you start thinking on your code as if it were some multi threaded application that you implement its synchronization on your own. It really gave me the impression that most people who write SQL belong to one of three types: 1. They don’t know shit and therefore their queries are subject to not work well or not efficient. 2. They do know something (I wouldn’t call it programming language with all due respect to the L in SQL) and manage to get their stuff to work. 3. People who really know the internals and algorithms of SQL and understand how things tend to work together and write something good.

What I’m saying is that even in SQL, there might be some decent ‘coders’. But how many – I can’t tell. A few prolly. With all the classification I just did, you still gain experience only by sitting down and trying on your own. But that’s true regarding everything, I guess.

And about .Net, it’s really awesome. I like the way that everything is already ready for use, you don’t have to waste time writing your own collections/containers algorithms (like in C for example) for instance. That you can speak Sockets and COM in the same easyness. That the security is part of the system. Now I really wanna start a flame about Java. But noooo. I just think C# is much better and more permissive, they took all Java’s advantages, fixed the broken stuff and created a whole new better language. Genericness? Well, less, and yet it is VMed…so screw it.

Happy new year & Hatima Tova