diStorm3 is ready for the masses! ![]()
- if you want to maximize the information you get from a single instruction; Structure output rather than text, flow control analysis support and more!
Check it out now at its new google page.
Good luck!
diStorm3 is ready for the masses! ![]()
- if you want to maximize the information you get from a single instruction; Structure output rather than text, flow control analysis support and more!
Check it out now at its new google page.
Good luck!
There are still hippos around us, beware:

Kernel heap overflow.
BITMAPINFOHEADER bmih = {0};
bmih.biClrUsed = 0×200;
HGLOBAL h = GlobalAlloc(GMEM_FIXED, 0×1000);
memcpy((PVOID)GlobalLock(h), &bmih, sizeof(bmih));
GlobalUnlock(h);
OpenClipboard(NULL);
SetClipboardData(CF_DIBV5, (HANDLE)h);
CloseClipboard();
OpenClipboard(NULL);
GetClipboardData(CF_PALETTE);
[Update, 11th Aug]: Here is MSRC response.
When you start to write a post you always get a problem with the headline for the post. You need to find something that will, in a few words, sum it up for the reader. I was wondering which one is better, “Boosting WinDbg”, “Faster Kernel Debugging in WinDbg”, “Hacking WinDbg” and so on. But they might be not accurate, and once you will read the post you won’t find them appropriate. But instead of talking about meta-post issues, let’s get going.
Two posts ago, I was talking about hunting a specific race condition bug we had in some software I work on. At last, I have free time to write this post and get into some interesting details about Windows Kernel and Debugging.
First I want to say that I got really pissed off that I couldn’t hunt the bug we had in the software like a normal human being, that Jond and I had to do it the lame old school way, which takes more time, lots of time. What really bothered me is that computers are fast and so is debugging, at least, should be. Why the heck do I have to sit down in front of the computer, not mentioning – trying to dupe the damned bug, and only then manage to debug it and see what’s going on wrong. Unacceptable. You might say, write a better code in the first place, I agree, but even then people have bugs, and will have, forever, and I was called to simply help.
Suppose we want to set a breakpoint on memory access this time, but something more complicated with conditions. The reason we need a condition, rather than a normal breakpoint is because the memory we want to monitor gets accessed thousands times per second, in my case with the race condition, for instance.
You’re even welcome to make the following test locally on your computer, fire up Visual Studio, and test the following code: unsigned int counter = 1; while (counter < 99999999+1) { counter++; }, set a memory access breakpoint on counter which stops when hit count reach 99999999, and time the whole process, and then time it without the bp set, and compare the result, what's the ratio you got? Isn't that just crazy?
Here's an example in WinDbg's syntax, would be something like this:
ba w4 0x491004 "j (poi(0x491004)==0) 'gc'"
Which reads: break on write access for an integer at address 0x491004 only if its value is 0, otherwise continue execution.
It will be tens-thousands times faster without the bp set, hence the debugging infrastructure, even locally (usermode), is slowing things down seriously.
And think that you want to debug something similar on a remote machine, it's impossible, you are going to wait years in vain for something to happen on that machine. Think of all the COM/Pipe/USB/whatever-protocol messages that have to be transmitted back and forth the debugged machine to the debugger. And add to that the conditional breakpoint we set, someone has to see whether the condition is true or false and continue execution accordingly. And even if you use great tools like VirtualKD.
Suppose you set a breakpoint on a given address, what really happens once the processor executes the instruction at that address? Obviously a lot, but I am going to talk about Windows Kernel point of view.
Let's start bottom up, Interrupt #3 is being raised by the processor which ran that thread, which halts execution of the thread and transfers control _KiTrap3 in ntoskrnl. _KiTrap3 will build a context for the trapped thread, with all registers and this likely info and call CommonDispatchException with code 0x80000003 (to denote a breakpoint exception). Since the 'exception-raising' is common, everybody uses it, in other exceptions as well. CommonDispatchException calls _KiDispatchException. And _KiDispatchException is really the brain behind all the Windows-Exception mechanism. I'm not going to cover normal exception handling in Windows, which is very interesting in its own. So far nothing is new here. But we're getting to this function because it has something to do with debugging, it checks whether the _KdDebuggerEnabled is set and eventually it will call _KiDebugRoutine if it's set as well. Note that _KiDebugRoutine is a pointer to a function that gets set when the machine is debug-enabled. This is where we are going to get into business later, so as you can see the kernel has some minimal infrastructure to support kernel debugging with lots of functionality, many functions in ntoskrnl which start in "kdp", like KdpReadPhysicalMemory, KdpSetContext and many others. Eventually the controlling machine that uses WinDbg, has to speak to the remote machine using some protocol named KdCom, there's a KDCOM.DLL which is responsible for all of it.
Now, once we set a breakpoint in WinDbg, I don't know exactly what happens, but I guess it’s something like this: it stores the bp in some internal table locally, then sends it to the debugged machine using this KdCom protocol, the other machine receives the command and sets the breakpoint locally. Then when the bp occurs, eventually WinDbg gets an event that describes the debug event from the other machine. Then it needs to know what to do with this bp according to the dude who debugs the machine. So much going on for what looks like a simple breakpoint. The process is very similar for single stepping as well, though sending a different exception code.
The problem with conditional breakpoints is that they are being tested for the condition locally, on the WinDbg machine, not on the server, so to speak. I agree it’s a fine design for Windows, after all, Windows wasn’t meant to be an uber debugging infrastructure, but an operating system. So having a kernel debugging builtin we should say thanks… So no complaints on the design, and yet something has to be done.
Custom Debugging to our call!
That’s the reason I decided to describe above how the debugging mechanism works in the kernel, so we know where we can intervene that process and do something useful. Since we want to do smart debugging, we have to use conditional breakpoints, otherwise in critical variables that get touched every now and then, we will have to hit F5 (‘go’) all the time, and the application we are debugging won’t get time to process. That’s clear. Next thing we realized is that the condition tests are being done locally on our machine, the one that runs WinDbg. That’s not ok, here’s the trick:
I wrote a driver that replaces (hooks) the _KiDebugRoutine with my own function, which checks for the exception code, then examines the context according to my condition and only then sends the event to WinDbg on the other machine, or simply “continues-execution”, thus the whole technique happens on the debugged machine without sending a single message outside (regarding the bp we set), unless that condition is true, and that’s why everything is thousands of times or so faster, which is now acceptable and usable. Luckily, we only need to replace a pointer to a function and using very simple tests we get the ability to filter exceptions on spot. Although we need to get our hands dirty with touching Debug-Registers and the context of the trapping thread, but that’s a win, after all.
Here’s the debug routine I used to experiment this issue (using constants tough):
This routine checks when a breakpoint interrupt happened and stops the thread only if the pointer I wanted to monitor was accessed from a given address, else it would resume running that thread. This is where you go custom, and write whatever crazy condition you are up to. Using up to 4 breakpoints, that’s the processor limit for hardware breakpoints. Also checking out which thread or process trapped, etc. using the Kernel APIs… It just reminds me “compiled sprites”
I was assuming that there’s only one bp set on the machine which is the one I set through WinDbg, though this time, there was no necessity to set a conditional breakpoint in WinDbg itself, since we filter them using our own routine, and once WinDbg gets the event it will stop and let us act.
For some reason I had a problem with accessing the DRs from the Context structure, I didn’t try too hard, so I just backed to use them directly because I can.
Of course, doing what I did is not anything close to production quality, it was only a proof of concept, and it worked well. Next time that I will find myself in a weird bug hunting, I will know that I can draw this weapon.
I’m not sure how many people are interested in such things, but I thought it might help someone out there, I wish one day someone would write an open source WinDbg plugin that injects kernel code through WinDbg to the debugged machine that sets this routine with its custom runtime conditional breakpoints
I really wanted to paint some stupid pictures that show what’s going on between the two machines and everything, but my capabilities at doing that are aweful, so it’s up to you to imagine that, sorry.
For more related information you can see:
http://uninformed.org/index.cgi?v=8&a=2&p=16
http://www.vsj.co.uk/articles/display.asp?id=265
One of the fun things to do with applications is to bypass their copy-protection mechanisms. So I want to share my experience about some iPad application, though the application is targeted for the Jailbroken devices. It all began a few days ago, when a friend was challenging me to crack some application. I had my motives, and I’m not going to talk about them. However, that’s why the title says non-profit. Or maybe when they always say “for profit” they mean the technical-knowledge profit.
So before you start to crack some application, what you should do is see how it works, what happens when you run it, what GUI related stuff you can see, like dialog boxes or messages that popup, upon some event you fire. There are so many techniques to approach application-cracking, but I’m not here to write a tutorial, just to talk a bit about what I did.
So I fired IDA with the app loaded, the app was quite small, around 35kb. First thing I was doing was to see the imported functions. This is how I know what I’m going to fight with in one glare. I saw MD5/RSA imported from the crypto library, and that was like “oh uh”, but no drama. Thing is, my friend purchased the app and gave me the license file. Obviously it’s easier with a license file, otherwise, sometimes it’s proved that it’s impossible to crack software without critical info that is encrypted in the license file, that was the issue in my case too. Of course, there’s no point in a license file that only checks the serial-number or something like that, because it’s not enough. So without the license file, there wasn’t much to do.
For some reason IDA didn’t like to parse the app well, so I had to recall how to use this ugly API of IDC (the internal scripting language of IDA), yes, I know IDA Python, but didn’t want to use it. So my script was fixing all LDR instructions, cause the code is PICy so with the strings revealed I could easily follow all those ugly objc_msgSend calls. For Apple’s credit, the messages are text based, so it’s easy to understand what’s going on, once you manage to get to that string. For performance’s sake, this is so lame, I rather use integers than strings, com’on.
Luckily the developer of that app didn’t bother to hide the exported list of functions, he was busy with pure protection algorithm in Objective-C, good for me.
So eventually the way the app worked (license perspective) was to check if the license file exists, if so, parse it. Otherwise, ask for a permission to connect to the Internet and send the UDID (unique device ID) of the device to the app’s server, get a response, and if the status code was success, write it to a file, then run the license validator again.
The license validator was quite cool, it was calling dladdr on itself to get the full path of the executable itself, then calculating the MD5 of the binary. Can you see why? So if you thought you could easily tamper with the file, you were wrong. Taking the MD5 hash, and xoring it in some pattern with the data from the license file; Then decrypting the result with the public key that was in the static segment, though I didn’t care much about it. Since the MD5 of the binary itself was used, this dependency is a very clever trick of the developer, though expected. So I tried to learn more about how the protection works.
Suppose the license was legit, the app would take that buffer and strtok() it to tokens, to check that the UDID was correct. The developer was nice enough to call the lockdownd APIs directly, so in one second I knew where and what was going on around it. In the beginning I wanted to create a proxy dylib for this lockdownd library, but it would require me to patch the header of the mach-o so the imported function will be through my new file – but it still requires a change to the file, no good. So the way it worked with the decrypted string – it kept on tokenizing the string, but this time, it checked for some string match, as if someone tampered with the binary, the decryption would go wrong and the string wouldn’t compare well. And then it did some manipulation on some object, adding methods to it in runtime, with the names from the tokenized string, thus if you don’t have a license file to begin with, you don’t know the names of the new methods that were added. One star for the developer, yipi.
All in all, I have to say that I wasn’t using any debugger or runtime tricks, everything was static reversing, yikes. Therefore, after I was convinced that I can’t ignore the protection because I lack of the names of the new methods, and I can’t use a debugger to phish the names easily. I was left with one solution, as I said before – faking the UDID and fixing the MD5.
What I really cared about for a start, was how the app calculates the MD5 of itself:
Since the developer retrieved the name of the binary using dladdr, I couldn’t just change some path to point to the original copy of the binary, so when it hashes it, it would get the expected hash. That was a bammer, I had to do something else, but similar idea… I decided to patch the file-open function. The library functions are called in ARM mode and it’s very clear. The app itself was in THUMB, so it transitions to ARM using a BX instruction and calls a thunk, that in order will call the imported function. So the thunk function is in ARM mode, thus 4 bytes per instruction, very wasteful IMHO.
The goal of my patches was to patch those thunks, rather than all the callers to those thunks. Cause I could end up with a dozen of different places to patch. So I was limited in the patches I could do in a way. So eventually I extended the thunk of the file-open and made R0 register point to my controlled path, where I could guarantee an original copy of the binary, so when it calculated the MD5 of it, it would be the expected hash. Again, I could do so many other things, like planting a new MD5 value in the binary and copy it in the MD5-Final API call, but that required too much code changes. And oh yes, I’m such a jackass that I didn’t even use an Arm-assembler. Pfft, hex-editing FTW
Oh also, I have to comment that it was safe to patch the thunk of file-open, cause all the callers were related to the MD5 hashing…
Ok, so now I got the MD5 good and I could patch the file however I saw fit. Patching the UDID-strcmp’s wasn’t enough, since the license wasn’t a “yes/no” check, it had essential data I needed, otherwise I could finish with the protection in 1 minute patch (without going to the MD5 hassle). So I didn’t even touch those strcmp’s.
RSA encryption then? Ahhh not so fast, the developer was decrypting the xored license with the resulted MD5 hash, then comparing the UDID, so I got the license decrypted well with the MD5 patch, but now the UDID that was returned from the lockdownd was wrong, wrong because it wasn’t corresponding to the purchased license. So I had to change it as well. The problem with that UDID and the lockdownd API, is that it returns a CFSTR, so I had to wrap it with that annoying structure. That done, I patched the thunk of the lockdown API to simply return my CFSTR of the needed UDID string.
And guess what?? it crashed
I put my extra code in a __ustring segment, in the beginning I thought the segment wasn’t executable, because it’s a data. But I tried to run something very basic that would work for sure, and it did, so I understood the problem was with my patch. So I had to double check it. Then I found out that I was piggy-backing on the wrong (existing) CFSTR, because I changed its type. Probably some code that was using the patched CFSTR was expecting a different type and therefore crashed, so I piggy-backed a different CFSTR that wouldn’t harm the application and was a similar type to what I needed (Just a string, 0×7c8). What don’t we do when we don’t have segment slacks for our patch code.
And then it worked… how surprising, NOT. But it required lots of trial and errors, mind you, because lack of tools mostly.
End of story.
It’s really hard to say how I would design it better, when I had my chance, I was crazy about obfuscation, to make the reverser desperate, so he can’t see a single API call, no strings, nothing. Plant decoy strings, code, functionality, so he wastes more time. Since it’s always possible to bypass the protections, if the CPU can do it, I can do it too, right? (as long as I’m on the same ring).
Actually I had a trouble to come up with a good title for this post, at least one that I was satisfied with. Therefore I will start with a background story, as always.
The problem started when I had to debug a huge software which was mostly in Kernel mode. And there was this critical section (critsec from now on) synchronization object that wasn’t held always correctly. And eventually after 20 mins of trying to replicate the bug, we managed to crash the system with a NULL dereference. This variable was a global that everybody who after acquiring the critsec was its owner. Then how come we got a crash ? Simple, someone was touching the global out of it critsec scope. That’s why it was also very hard to replicate, or took very long.
The pseudo code was something like this:
Acquire Crit-Sec
g_ptr = “some structure we use”
do safe task with g_ptr
…
g_ptr = NULL
Release Crit-Sec
So you see, before the critsec was released the global pointer was NULLed again. Obvisouly this is totally fine, because it’s still in the scope of the acquired crit, so we can access it safely.
Looking at the crash dumps, we saw a very weird thing, but nothing surprising for those race conditions bugs. Also if you ask me, I think I would prefer dead-lock bugs to race conditions, since in dead lock, everything gets stuck and then you can examine which locks are held, and see why some thread (out of the two) is trying to acquire the lock, when it surely can’t… Not saying it’s easier, though.
Anyway, back to the crash dump, we saw that the g_ptr variable was accessed in some internal function after the critsec was acquired. So far so good. Then after a few instructions, in an inner function that referenced the variable again, suddenly it crashed. Traversing back to the point where we know by the disassembly listing of the function, where the g_ptr was touched first, we knew it worked there. Cause otherwise, it would have crashed there and then, before going on, right? I have to mention that between first time reading the variable and the second one where it crashed, we didn’t see any function calls.
This really freaked me out, because the conclusion was one – somebody else is tempering with our g_ptr in a different thread without locking the crit. If there were any function calls, might be that some of them, caused our thread to be in a Waitable state, which means we could accept APCs or other events, and then it could lead to a whole new execution path, that was hidden from the crash dump, which somehow zeroed the g_ptr variable. Also at the time of the crash, it’s important to note that the owner of the critsec was the crashing thread, no leads then to other problematic threads…
Next thing was to see that everybody touches the g_ptr only when the critsec is acquired. We surely know for now that someone is doing something very badly and we need to track the biatch down. Also we know the value that is written to the g_ptr variable is zero, so it limits the number of occurrences of such instruction (expression), which lead to two spots. Looking at both spots, everything looked fine. Of course, it looked fine, otherwise I would have spotted the bug easily, besides, we got a crash, which means, nothing is fine. Also, it’s time to admit, that part of the code was Windows itself, which made the problem a few times harder, because I couldn’t do whatever I wanted with it.
I don’t know how you guys would approach such a problem in order to solve it. But I had three ideas. Sometimes just like printf/OutputDebugPrint is your best friend, print logs when the critsec is acquired and released, who is waiting for it and just every piece of information we can gather about it. Mind you that part of it was Windows kernel itself, so we had to patch those functions too, to see, who’s acquiring the critsec and when. Luckily in debug mode, patchguard is down
Otherwise, it would be bloody around the kernel. So looking at the log, everything was fine, again, damn. You can stare at the god damned thing for hours and tracking the acquiring and releasing pairs of the critsec, and nothing is wrong. So it means, this is not going to be the savior.
The second idea, was to comment out some code portions with #if 0 surrouding the potential problematic code. And starting to eliminate the possibilities of which function is the cause of this bug. This is not such a great idea. Since a race condition can happen in a few places, finding one of them is not enough usually. Though it can teach you something about the original bug’s characteristics, then you can look at the rest of the code to fix that same thing. It’s really old school technique but sometimes it is of a help as bad as it sounds. So guess what we did? Patched the g_ptr = NULL of the kernel and then everything went smooth, no crashes and nothing. But the problem still was around, now we knew for sure it’s our bug and not MS, duh. And there were only a few places in our code which set this g_ptr. Looking at all of them, again, seemed fine. This is where I started going crazy, seriously.
While you were reading the above ideas, didn’t you come up with the most banal idea, to put a dumb breakpoint – on memory access, on g_ptr with a condition of “who writes zero”. Of course you did, that what you should have done in the first place. I hope you know that. Why we couldn’t do that?
Because the breakpoint was fired tens of thousands times in a single second. Rendering the whole system almost to freeze. Assuming it took us 20 mins to replicate the bug, when we heavily loaded the system. Doing that with such a breakpoint set, would take days or so, no kidding. Which is out of question.
This will lead me to the next post. Stay tuned.
I have been playing CS since 2001
Kinda addicted I can say. Like, after I had been in South America for half a year, suddenly I caught myself thinking “ohhh I wish I could play CS”… So I think it means I’m addicted. Anyway I really like that game. A few days ago I was playing on some server and suddenly hl2 crashed. How good is that they generate a crash dump automatically, so I fired up WinDbg and took a look what happened, I found out that some pointer was set to 1, not NULL, mind you. Looking around the crash area I found a buffer overflow on the stack, but only for booleans, so I don’t know what was the point and how it was triggered or who sent it (server or another player). Anyway, since I like this game so much, there is only one thing I don’t like it, the stupid children you play with/against, they curse and TK (team-kill) like noobs. One day I promised to myself that I will pwn those little bastards. Therefore I started to investigate this area of crash, which I won’t say anything about the technical details here, so you won’t be able to replicate it, except that I found a stack buffer overflow. The way from there to pwn the clients who connect to a server I set up is really easy. The down side is that they have to connect to a server I control, which is quite lame, the point is to pwn other players on a remote server, so I still work on that. For me pwning would be to find a way to kick them from the server for instance, I don’t need to execute code on their machines. Besides since I do everything for fun, and I’m not a criminal, I have to mention that it’s for eductional purposes only
Being the good guy I am, in ZERT and stuff. I just wanted to add that the protocol used to be really hole-y before CS: Source came out, everything was vulnerable, really, you could tell the server that you wanted to upload a file to it (your spray-decal file) with a name longer than 256 characters, and bam, you own the server through a stupid strcpy to a buffer on the stack. But after CSS came out, the guys did a great job and I could hardly find stuff. What I found is in some isoteric parser that the input comes from the server… What was weird is that some functions were protected with a security cookie and some weren’t. I don’t know what configuration those guys use to compile the game, but they surely need to work it out better.
Another thing I’ve been trying to pwn for a long time now, without much success, I have to say, is NTVDM. This piece of software is huge, though most of it is mostly in user-mode, there are lots of related code in kernel. Recently a very crazy bug was found there (which can lead to a privilege escalation), something in the design, of how the kernel transfers control to BIOS code and returns. You can read more here to get a better clue. So it gave me some idea what to do about some potential buggy code I found. Suppose I found a code in the kernel that takes DS:SI and changes it to a flat pointer, the calculation is (DS << 4) + SI. The thing is that DS is 16 bits only. The thing I thought is that with some wizardy I will be able to change DS to have some value above 0xffff. For some of you it might sound impossible, but in 32 bits and playing with pop ds, mov ds, ax and the like, I managed to put random values in the high 16 bits of DS (say it’s a 32 bit segment register). Though I don’t know if WinDbg showed me garbage or how it really worked, or what happened there, I surely saw big values in DS. So since I couldn’t reproduce this behavior in 16 bits under NTVDM, I tried to think of a way to set DS in the VDM Context itself. If you look at the exports of NTVDM you will see a function named “SetDS”, so taking a look of how it works I tried to use it inside my 16 bits code (exploiting some Escape bug I found myself and posted on this blog earlier), I could set DS to whatever arbitary value I wanted. Mind you, I set DS for the VM itself, not the DS of the usermode application of ntvdm.exe. And then I tried to trigger the other part in the kernel which takes my raw pointer and tries to write to it, but DS high 16 bits were zeros. Damn it. Then I gave to it more thought, and understood that what I did is not good enough. This is because once I set DS to some value, then I get to code to execute on the processor for real and then it enters kernel’s trap handler, DS high half gets truncated once again and I lost in the game. So I’m still thinking if it’s spossible. Maybe next step I should try is to invoke the kernel’s trap handler directly with DS set to whatever value I want, but that’s probably not possible since I can’t control the trap frame myself… or maybe I can
Yo yo yo… forgot to say happy xmas last time, never too late, ah?
This time I wanted to update you about diStorm3 once again. Yesterday I had a good coding session and I added some of the new features regarding flow control. The decode function gets a new parameter called ‘features’. Which is a bit field flag that lets you ask the disassembler to do some new stuff such as:
I wasn’t sure about SYSCALL and the like and UD2, for now I left them out. So what we got now is the ability to instruct the disassembler to stop decoding after it encounters one of the above conditions. This makes the higer disassembler layer more efficient, because now you can disassemble code by basic blocks. Also building a call-graph or branches-graph faster.
Note that now you will be able to ask the disassembler to return a single instruction. I know it sounds stupid, but I talked about it already, and I had some reasons to avoid this behavior. Anyway, now you’re free to ask how many instructions you want, as long as the disassembler can read them from the stream you supply.
Another feature added is the ability to filter non-flow-control instructions. Suppose you are interested in building a call-graph only, there’s no reason that you will get all the data-control instructions, because they are probably useless for the case. Mixing this flag with ‘Stop on RET’ and ‘Stop on CALL’, you can do nice stuff.
Another thing is that I separated the memory-indirection description of an operand into two forms. First of all, memory indirection operand is when an instruction reads/writes from/to memory. Usually in Assembly text, you will see the brackets characters surrounding some expressions. Something like: MOV [EDX], EAX. Means we write a DWORD to EDX pointer. If you followed me ’till here, you should know exactly what I’m talking about anyway.
When you get the result of such instruction from diStorm3, the type of the operand will be SMEM (stands for simple-memory), which hints there’s only one register in the memory-indirection operand. Although it doesn’t hint anything about the displacement, that’s that offset you usually see in the brackets. Like MOV [EDX+0x12345678], EAX. So you will have to test if the displacement exists in both forms. The other form is MEM (Normal memory indirection, or probably should be called ‘complex’) since it supports the full memory indirection operand, like: MOV [EAX*4 + ESI + 0x12345678], EAX. Then you will have to read another register that supplies the base register, in addition to the index register and scale. Note that this applies for 16 bits mode addressing as well, when you can have a mix of ['BX+SI]‘ or only ‘[BX]‘. Also note that sometimes in 32/64 bits mode, you can have a SIB byte, that sets only the base register and the index register is unused, but diStorm3 will return it as an SMEM, to simplify matters. This way it’s really easy to read the instruction’s parameters.
Another feature for text formatting is the ability to tell the disassembler to limit the address to 16 or 32 bits. This is good since the offsets are 64 bits today. And if you encounter an instruction that jumps backwards, you will get a huge negative value, which won’t make much sense if you disassemble 16 bits code…
diStorm3 still supplies the bad old interface. And now it supports two new additional functions. The decompose function, which returns the structures for each instruction read. And another function that formats a given structure into text, which is pretty straight forward. The text format is not an accurate behavior of diStorm64, it’s more simplified, but good enough. Besides I have never heard any special comments about the formatting of diStorm64, so I guess it doesn’t matter much to you guys. And maybe maybe I will add AT&T syntax later on.
Another field that is returned now, unlike diStorm64, is the instruction-set-class type of the instruction, with very broad categories, like Integer instructions, FPU instructions, SSE instructions, and so on. Still might be handy. And the hint about the flow-control type of the instruction.
Also I changed tons of code, and I really mean it, the skeleton is still the same, but the prefixes engine works totally different now. Trying to imitate a real processor this time. By including the last prefix found of that prefix-type. You can read more about this, here. I made the code way more optimized and eliminated double code and it’s still readable, if not for the better. Also I changed the way instruction are fetched, so the locate-instruction function is much smaller and better.
I’m pertty satisfied with the new version of diStorm and hopefully I will be able to share it with you guys soon. Still I got tons of tests to do, maybe I will add that unit-test module in Python to the proejct so you can enjoy it too, not sure yet.
Also I got a word from Mario Vilas, that he is going to help with compiling diStorm for different platforms, and I’m going to integrate his new Python wrappers that use ctypes, so you don’t need the Python extension anymore. Thanks Mario!
However, diStorm3 has its own Python module for the new structure output.
If you have more ideas, comments, complaints or you just hate me, this is the time to say so.
Cheers, happy new year soon!
Gil
In the DOS days there was this lovely trick that you could just ‘ret’ anytime you wanted and the process would be terminated, I’m talking about .com files. That’s true that there were no real processes back then, more like a big chaotic jungle. Anyways, how it worked? Apparently the RET instruction, popped a zero word from the stack, and jumped to address 0. At address 0, there was this PSP (Program Segment Prefix) which had lots of interesting stuff, like command line buffer, and the like. The first word in this PSP (at address 0 of the whole segment too) was 0×20CD (instruction INT 0×20), or ‘Terminate Process’ interrupt. So branching to address 0 would run this INT 0×20 and close the program. Of course, you could execute INT 0×20 on your own, but then it would cost another byte
But as long as the stack was balanced it seemed you could totally rely on popping a zero word in the stack for this usage. In other times I used this value to zero a register, by simply pop ax, for instance…
Time passed, and now we all use Windows (almost all of us anyway) and I came with a similar trick when I wrote Tiny PE. What I did was to put a last byte in my code with the value of 0xBC. Now usually there is some extra bytes following your code, either because you have data there or page alignment when the code section was allocated by the PE loader. This byte really means “MOV ESP, ???”. Though we don’t know what comes next to fill ESP with (some DWORD value), probably some junk, which is great.
At the cost of 1 byte, we caused ESP to get some uncontrolled value, which probably doesn’t map to anywhere valid. After this MOV ESP instruction was executed, nothing happens, alas, the next instruction is getting executed too and so on. When the process gets to terminate you ask? That’s the point we have to wait for one of two conditions to occur. Either by executing some junk instruction that will cause an access violation because it touched some random address which is unmapped. Or the second option is that we run out of bytes to execute and hit an unmapped page, but this time because of the EIP pointer itself.
This is where the SEH mechanism comes in, the system sees there’s an AV exception and tries to call the safe-exception handler because it’s just raised an exception. Now since ESP is a junk really it’s in an unrecoverable state already, so the system terminated the process. 1 byte FTW.
I happened to work with UNICODE_STRING recently for some kernel stuff. That simple structure is similar to pascal strings in a way, you got the length and the string doesn’t have to be null terminated, the length though, is stored in bytes. Normally I don’t look at the assembly listing of the application I compile, but when you get to debug it you get to see the code the compiler generated. Since some of my functions use strings for input but as null terminated ones, I had to copy the original string to my own copy and add the null character myself. And now that I think of it, I will rewrite everything to use lengths, I don’t like extra wcslen’s.
Here is a simple usage case:
I will show you the resulting assembly code, so you can judge yourself:
One time the compiler converts the length to WCHAR units, as I asked. Then it realizes it should take that value and use it as an index into the unicode string, thus it has to multiply the index by two, to get to the correct offset. It’s a waste-y.
This is the output of a fully optimized code by VS08, shame.
It’s silly, but this would generate what we really want:
With this fix, this time without the extra div/mul. I just did a few more tests and it seems the dead-code removal and the simplifier algorithms are not perfect with doing some divisions inside the indexing for pointers.
Update: Thanks to commenter Roee Shenberg, it is now clear why the compiler does this extra shr/mul. The reason is that the compiler can’t know whether the length is odd, thus it has to round it.
In the last few week I’ve been working on diStorm to add the new instruction sets: AVX and FMA. You can find lots of information about them on the inet. In a brief, the big advantages, are support of 256 bit registers, called now YMM’s (their low halves are XMM’s) and also support for AES built in, you have a few instruction to do small block encryption and decryption, really sweet. I guess it will help some security companies out there to boost stuff. Also the main feature behind these instruction sets is the 3 registers operands. So now you are not stuck with 2 registers per instruction, you can have up to four sometimes. This is good because you have two source operands and a destination operand, which means it saves you other instructions (to move or backup registers) and you don’t have to ruin your dest-src operand like in the old sets. Almost forgot to mention FMA itself, which is fused multiply-add instructions, so you can do two operations at once, like A*B+C, etc.
I wanted to talk about the VEX prefix itself. It’s really a new design and approach to prefixes that was never seen before. And the title of this post says it all, it’s really annoying.
The VEX (Vector Extension) prefix, is a multi-byte prefix for a change. It can be either 2 or 3 bytes. If you take a look at the one byte opcodes map, all of them are taken. Intel was in the need of a new unused byte, which didn’t really exist. What they did instead, was to share two existing opcodes, for each prefix. The sharing works in a special way, that let them know if you meant to use the original instruction or the new prefix. The chosen instructions are LDS (0xc4) and LES (0xc5). When you examine the second byte of these instructions, the byte upon which the new information is extracted from, you can learn that the most significant two bits can’t be set together (I.E: the value of 0xc0 or higher). However, if they are set, the processor will raise an illegal instruction exception. This is where the VEX prefix enters into the game. Instead of raising an exception they will be decoded as this special multi-byte prefix. Note that in 64 bits mode, all those Load-Segment instruction are invalid, so there is no need for sharing the opcode. When you encounter 0xc4 or 0xc5, you know it’s a VEX prefix, as simple as that. Unfortunately this is not the case in 32 bits mode, and since the second byte has to be with a value higher than 0xc0 (because the two most significant bits have to be set in 32 bits), the field in these corresponding bits is inverted actually, which means you will have to extract a few bits that represent some fields and bitwise-not them. This is seriously gross, but it seems Intel didn’t have much of a choice here. If it were up to me, I would do the same eventually, for the sake of backward compatibility, but it doesn’t make it any prettier to be honest. And for your information, AMD pulled the same trick but with the POP instruction (0×8f) for their new instruction sets (XOP, etc), without full backward compatibility.
To make some order in the bits let’s have a look at the following figure:

Cited Intel
Well, I am not going to talk about all fields, (which somewhat are similar to REX for 64 bits) but just about one interesting feature. Since now the prefix is 2/3 bytes and usually an SSE instruction is at least 3 bytes, this will explode the code segment with huge instruction, and certainly gonna make the processor cry a lot to fetch instructions. The trick that was used by Intel (and AMD too) was to have a field that will imply which prefix byte to put virtually before the VEX prefix itself, so this way we saved one byte. And the same idea was used again to spare the 0×0f escape byte or even two bytes of 0×0f, 0×38 or 0×0f, 0×3a which are very common basic opcodes for SSE instructions. So if we had to use an SSE instruction, for instance:
66 0f 38 17 c0; PTEST XMM0, XMM0 – has first 3 bytes that can be implied in the VEX prefix, thus it stays the same size! Kawabanga
I talked in an earlier post that I am not going to support SSE5 as for now, in the hope it’s gonna die. I believe CPU (or instruction set architectures, to be accurate) wars are bad for the coders and even for end users who can’t really enjoy those great technologies eventually.