Archive for the ‘Assembly’ Category

Terminate Process @ 1 byte

Monday, December 28th, 2009

In the DOS days there was this lovely trick that you could just ‘ret’ anytime you wanted and the process would be terminated, I’m talking about .com files. That’s true that there were no real processes back then, more like a big chaotic jungle. Anyways, how it worked? Apparently the RET instruction, popped a zero word from the stack, and jumped to address 0. At address 0, there was this PSP (Program Segment Prefix) which had lots of interesting stuff, like command line buffer, and the like. The first word in this PSP (at address 0 of the whole segment too) was 0x20CD (instruction INT 0x20), or ‘Terminate Process’ interrupt. So branching to address 0 would run this INT 0x20 and close the program. Of course, you could execute INT 0x20 on your own, but then it would cost another byte :) But as long as the stack was balanced it seemed you could totally rely on popping a zero word in the stack for this usage. In other times I used this value to zero a register, by simply pop ax, for instance…

Time passed, and now we all use Windows (almost all of us anyway) and I came with a similar trick when I wrote Tiny PE. What I did was to put a last byte in my code with the value of 0xBC. Now usually there is some extra bytes following your code, either because you have data there or page alignment when the code section was allocated by the PE loader. This byte really means “MOV ESP, ???”. Though we don’t know what comes next to fill ESP with (some DWORD value), probably some junk, which is great.
At the cost of 1 byte, we caused ESP to get some uncontrolled value, which probably doesn’t map to anywhere valid. After this MOV ESP instruction was executed, nothing happens, alas, the next instruction is getting executed too and so on. When the process gets to terminate you ask? That’s the point we have to wait for one of two conditions to occur. Either by executing some junk instruction that will cause an access violation because it touched some random address which is unmapped. Or the second option is that we run out of bytes to execute and hit an unmapped page, but this time because of the EIP pointer itself.
This is where the SEH mechanism comes in, the system sees there’s an AV exception and tries to call the safe-exception handler because it’s just raised an exception. Now since ESP is a junk really it’s in an unrecoverable state already, so the system terminated the process. 1 byte FTW.

Optimize My Index Yo

Thursday, November 26th, 2009

I happened to work with UNICODE_STRING recently for some kernel stuff. That simple structure is similar to pascal strings in a way, you got the length and the string doesn’t have to be null terminated, the length though, is stored in bytes. Normally I don’t look at the assembly listing of the application I compile, but when you get to debug it you get to see the code the compiler generated. Since some of my functions use strings for input but as null terminated ones, I had to copy the original string to my own copy and add the null character myself. And now that I think of it, I will rewrite everything to use lengths, I don’t like extra wcslen’s. :)

Here is a simple usage case:

p = (PWCHAR)ExAllocatePool(SomePool, Str->Length + sizeof(WCHAR));
if (p == NULL) return STATUS_NO_MEMORY;
memcpy(p, Str->buffer, Str->Length);
p[Str->Length / sizeof(WCHAR)] = UNICODE_NULL;

I will show you the resulting assembly code, so you can judge yourself:

shr    esi,1 
xor    ecx,ecx 
mov  word ptr [edi+esi*2],cx 

One time the compiler converts the length to WCHAR units, as I asked. Then it realizes it should take that value and use it as an index into the unicode string, thus it has to multiply the index by two, to get to the correct offset. It’s a waste-y.
This is the output of a fully optimized code by VS08, shame.

It’s silly, but this would generate what we really want:

*(PWCHAR)((PWCHAR)p + Str->Length) = UNICODE_NULL;

With this fix, this time without the extra div/mul. I just did a few more tests and it seems the dead-code removal and the simplifier algorithms are not perfect with doing some divisions inside the indexing for pointers.

Update: Thanks to commenter Roee Shenberg, it is now clear why the compiler does this extra shr/mul. The reason is that the compiler can’t know whether the length is odd, thus it has to round it.

It’s Vexed :)

Tuesday, November 17th, 2009

In the last few week I’ve been working on diStorm to add the new instruction sets: AVX and FMA. You can find lots of information about them on the inet. In a brief, the big advantages, are support of 256 bit registers, called now YMM’s (their low halves are XMM’s) and also support for AES built in, you have a few instruction to do small block encryption and decryption, really sweet. I guess it will help some security companies out there to boost stuff. Also the main feature behind these instruction sets is the 3 registers operands. So now you are not stuck with 2 registers per instruction, you can have up to four sometimes. This is good because you have two source operands and a destination operand, which means it saves you other instructions (to move or backup registers) and you don’t have to ruin your dest-src operand like in the old sets. Almost forgot to mention FMA itself, which is fused multiply-add instructions, so you can do two operations at once, like A*B+C, etc.

I wanted to talk about the VEX prefix itself. It’s really a new design and approach to prefixes that was never seen before. And the title of this post says it all, it’s really annoying.
The VEX (Vector Extension) prefix, is a multi-byte prefix for a change. It can be either 2 or 3 bytes. If you take a look at the one byte opcodes map, all of them are taken. Intel was in the need of a new unused byte, which didn’t really exist. What they did instead, was to share two existing opcodes, for each prefix. The sharing works in a special way, that let them know if you meant to use the original instruction or the new prefix. The chosen instructions are LDS (0xc4) and LES (0xc5). When you examine the second byte of these instructions, the byte upon which the new information is extracted from, you can learn that the most significant two bits can’t be set together (I.E: the value of 0xc0 or higher). However, if they are set, the processor will raise an illegal instruction exception. This is where the VEX prefix enters into the game. Instead of raising an exception they will be decoded as this special multi-byte prefix. Note that in 64 bits mode, all those Load-Segment instruction are invalid, so there is no need for sharing the opcode. When you encounter 0xc4 or 0xc5, you know it’s a VEX prefix, as simple as that. Unfortunately this is not the case in 32 bits mode, and since the second byte has to be with a value higher than 0xc0 (because the two most significant bits have to be set in 32 bits), the field in these corresponding bits is inverted actually, which means you will have to extract a few bits that represent some fields and bitwise-not them. This is seriously gross, but it seems Intel didn’t have much of a choice here. If it were up to me, I would do the same eventually, for the sake of backward compatibility, but it doesn’t make it any prettier to be honest. And for your information, AMD pulled the same trick but with the POP instruction (0x8f) for their new instruction sets (XOP, etc), without full backward compatibility.

To make some order in the bits let’s have a look at the following figure:
vexprefix
Cited Intel
Well, I am not going to talk about all fields, (which somewhat are similar to REX for 64 bits) but just about one interesting feature. Since now the prefix is 2/3 bytes and usually an SSE instruction is at least 3 bytes, this will explode the code segment with huge instruction, and certainly gonna make the processor cry a lot to fetch instructions. The trick that was used by Intel (and AMD too) was to have a field that will imply which prefix byte to put virtually before the VEX prefix itself, so this way we saved one byte. And the same idea was used again to spare the 0x0f escape byte or even two bytes of 0x0f, 0x38 or 0x0f, 0x3a which are very common basic opcodes for SSE instructions. So if we had to use an SSE instruction, for instance:
66 0f 38 17 c0; PTEST XMM0, XMM0 – has first 3 bytes that can be implied in the VEX prefix, thus it stays the same size! Kawabanga

I talked in an earlier post that I am not going to support SSE5 as for now, in the hope it’s gonna die. I believe CPU (or instruction set architectures, to be accurate) wars are bad for the coders and even for end users who can’t really enjoy those great technologies eventually.

A BSWAP Issue

Saturday, November 7th, 2009

The BSWAP instruction is very handy when you want to convert a big endian value to a little endian value and vice versa. Instead of reversing the bytes yourself with a few moves (or shifts, depends how you implement it), a single instruction will do it as simple as that. It personally reminds me something like the HTONS (and friends) in socket programming. The instruction supports 32 bits and 64 bits registers. It will swap (reverse) all bytes inside correspondingly. But it won’t work for 16 bits registers. The documentation says the result is undefined. Now WTF, what was the issue to support 16 bits registers, seriously? That’s a children’s game for Intel, but instead it’s documented to be undefined result. I really love those undefined results, NOT.
When decoding a stream in diStorm in 16 bits decoding mode, BSWAP still shows the registers as 32 bits. Which, I agree can be misleading and I should change it (next ver). Intel decided that this instruction won’t support 16 bits registers, and yet it will stay a legal instruction, rather than, say, raising an exception of undefined instruction, there are known cases already when a specific destination register can cause to an undefined instruction exception, like MOV CR5, EAX, etc. It’s true that it’s a bit different (because it’s not the register index but the decoding mode), but I guess it was easier for them to keep it defined and behave weird. When I think of it again, there are some instruction that don’t work in 64 bits, like LDS. Maybe it was before they wanted to break backward compatibility… So now I keep on getting emails to fix diStorm to support BSWAP for 16 bits registers, which is really dumb. And I have to admit that I don’t like this whole instruction in 16 bits, because it’s really confusing, and doesn’t give a true result. So what’s the point?

I was wondering whether those people who sent me emails regarding this issue were writing the code themselves or they were feeding diStorm with some existing code. The question is how come nobody saw it doesn’t work well for 16 bits? And what’s the deal with XCHG AH, AL, that’s an equal for BSWAP AX, which should work 100%. Actually I can’t remember whether compilers generate code that uses BSWAP, but I am sure I have seen the instruction being used with normal 32 bits registers, of course.

Integer Promotion is Dodgey & Dangerous

Wednesday, October 28th, 2009

I know this subject and every time it surprises me again. Even if you know the rules of how it works and you read K&R, it will still confuse you and you will end up being wrong in some cases. At least, that’s what happened to me. So I decided to mention the subject and to give two examples along.

Integer promotions probably happen in your code so many times, and most of us are not even aware of that fact and don’t understand the way it works. To those of you who have no idea what integer promotion is, to make a long story short: “Objects of an integral type can be converted to another wider integral type (that is, a type that can represent a larger set of values). This widening type of conversion is called “integral promotion.”, cited by MSDN. Why? So calculation can be faster in some of the times, otherwise because of different types; so seamless conversions happen, etc. There are exact rules in the standard how it works and when, you should check it out on your own

enum {foo_a = -1, foo_b} foo_t;

unsigned int a = -1;
printf("%s", a == foo_a ? "true" : "false");

Can you tell what it prints?
It will print “true”. Nothing special right, just works as we expect?
Check the next one out:

unsigned char a = -1;
printf("%s", a == foo_a ? "true" : "false");

And this time? This one will result in “false”. Only because the type of ‘a’ is unsigned. Therefore it will be promoted to unsigned integer – 0x000000ff, and compare it to 0xffffffff, which will yield false, of course.
If ‘a’ were defined as signed, it would be ok, since the integer promotions would make sure to sign extend it.

Another simple case:

unsigned char a = 5, b = 200;
unsigned int c = a * b;
printf("%d", c);

Any idea what the result is? I would expect it to be (200*5) & 0xff – aka the low byte of the result, since we multiply uchars here, and you? But then I would be wrong as well. The result is 1000, you know why? … Integer Promotions, ta da. It’s not like c = (unsigned char)(a * b); And there is what confusing sometimes.
Let’s see some Assembly then:

movzx       eax,byte ptr [a]
movzx       ecx,byte ptr [b]
imul        eax,ecx
mov         dword ptr [c],eax

Nasty, the unsigned char variables are promoted to unsigned int. Then the multiplication happens in 32 bits operand size! And then the result is not being truncated, just like that, to unsigned char again.

Why is it dangerous? I think the answer is obvious.. you trivially expect for one result when you read/write the code, but in reality something different happens. Then you end up with a small piece of code that doesn’t do what you expect it to. And then you end up with some integer overflow vulnerability without slightly noticing. Ouch.

Update: Thanks to Daniel I changed my erroneous (second) example to what I really had in mind when I wrote this post.

diStorm3 – Call for Features

Friday, October 2nd, 2009

[Update diStorm3 News]

I have been working more and more on diStorm3 recently. The core code is already written, and it works so great. I am still not going to talk about the structure itself that diStorm uses to format the instructions. There are two API’s now, the old one, which takes a stream and formats it to text and a newer one, which takes a stream and formats it into structures. This one is much faster. Unlike diStorm64, where the text-formatting was coupled in the decoding code, it’s totally separated. For example, if you want to support AT&T syntax, you can do it in a couple of hours or less, really. I don’t like AT&T syntax, hence I am not going to implement it. I bet still many people don’t know how to read it without confusing…

Hereby, I am asking you guys to come up with ideas for diStorm3. So far I got some new ideas from people, which I am going to implement. Such as:
1) You will be able to tell the decoder to stop on any flow control instruction.
2) Instructions are going to be categorized, such as, flow-control, data-control, string instructions, io, etc. (To be honest, I am still not totally sure about this one).
3) Helper macros to extract data references. Since diStorm3 outputs structures, it’s really easy to know if there’s a data reference and its address. Therefore some macros will aid to do this work.
4) Code reference, – continues to next instruction, continues to a target according to a condition, or jump-always and call-always.

I am looking to hear more suggestions from you guys. Please be sure you are talking about disassembler features, and not other layers which use the disassembler.

Just wanted to let you know that diStorm3 is going to be dual licensed with GPL and commercial. diStorm64 is deprecated and I am not going to touch it anymore, though it’s still licensed as BSD, of course.

Trampolines In x64

Sunday, September 27th, 2009

We got a few nice features from the new architecture of x64, like larger memory addressing, more registers (so fast call is the standard up to three registers and the rest get on the stack), and of course, a wider bandwidth of 64 bits, etc. AMD had a once in a life opportunity to change the ISA (instruction set architecture) a bit and to make it much better, but instead, they only added a very few new instructions, canceled a lot, and left the decoding as hard as before. Probably they were in a crazy rush, so that time Intel had to catch up with them!

The problem we face when hooking a function is how many bytes we will need to override. I already talked about Hot Patching and branching in x86. But I have never talked at length about x64. Usually most hookers use the JMP relative instruction (0xE9), which is possibly useful for x64 as well. And once again we are limited to a range of 2GB from the JMP instruction’s address, while in x64, 2GB is really nothing much. I also searched a bit over the inet to look for more info and found some interesting approaches. I decided to talk about them here and describe how they work.

1)
JMP relative instruction, when you know in advance the difference from the hooked function to the target trampoline is less than 2GB, is a very good method, only 5 bytes.

2)

MOV RAX, 
JMP RAX

This one is almost optimal, you can branch everywhere in the address space, it takes only 12 bytes. It suffers a destruction of a register. Of course, by the ABI (Application Binary Interface) which the compiler implements, some registers are defined as volatile, means you can use them almost any time without worrying or needing to restore them. Analyzing the function (using a disassembler) you may be able to know which register you can use safely. That’s a big headache though.
Note that you could replace the JMP RAX with PUSH RAX;RET, but it’s still the same size.

3)

PUSH 
; Only if required (when High half is non zero):
MOV [RSP+4], DWORD 
RET

This one was found on Nikolay Igotti’s blog. You split the QWORD address value into two DWORDs. The first push, although pushes a 32 bits value, really allocates 64 bits value on the stack. Then if the high half of the address is non zero, you will have to write it directly to the stack. This will write a full QWORD to the stack which you can then branch to by RETting. It takes 14 (=5+8+1) bytes, and doesn’t dirty any register.
Note that it’s possible to shave some bytes in a case where the high half is -1, by OR [RSP+4], -1 and the like.
4)
This one takes advantage of RIP relative addressing mode, check it out, yo

JMP [RIP+0]
DQ:

This one is cool, it will branch to the full address in the QWORD value. So it takes 14 bytes as well. On the other hand, I don’t like to mix code and data. In the world of firewalls (which do tons of hookings), for instance, hooking this function twice with two different methods, will probably lead to a crash, since most hooking engines disassemble the instructions, you will get garbage beginning with the second instruction.
This method leads to an interesting idea, you can move the address value around within a range of +-2GB, and then you will need only 6 bytes. In the instruction level, anytime you mention RIP it’s a waste of 4 bytes for a required 32 bits displacement, even if it’s 0, that sucks.
Unlikely, but still possible method sometimes is, JMP <Abs Addr>. Some OS API’s let you choose the allocated address, so you can make sure it’s in the first 2GB.
5)

MOV RAX, [ PointerToAddress64 ]
JMP RAX
;;;
PointerToAdderss64:
DQ

This one is similar to the former, but it can read the QWORD from everywhere. Because the addressing mode really supports 64 bit values. It takes 12 bytes. Needless to say, it dirties a register.
Note that this MOV RAX is a special instruction that supports a 64 bit addresses.

Each method has its own pros and cons. It seems you can do best by choosing a specific method according to the difference from the hooked function to the target trampoline address.
One crazy idea which I haven’t seen before is to use simple JMP relative (0xE9) instruction, and to jump into the MIDDLE of some function nearby, the function will be hooked in the middle to recover the hole (like a transparent proxy) and the hole will be filled with the full jump to any address.

Any other tricks you wanna share with us? Leave a comment.

Branching to Absolute Addresses

Friday, September 25th, 2009

The x86-64 architecture is very annoying sometimes. Especially when you want to hook functions or do some magical assembly tricks :)

Suppose you want to hook a function, you will have to change its first instruction to a jump instruction to your code, so the next time the function is being called, the control will be transferred to your stub. Everybody knows this technique, basically you will need 5 bytes, for the opcode E9 (JMP relative) and another DWORD for the relative offset itself. Then you are able to jump backward or forward by 2GB. Sometimes, it’s not enough, just because you have some limitations.
Now the problem with the processor is that there’s no instruction that can branch to an absolute address which is given straight to the instruction as an immediate operand. Try to think of a MOV EAX, 0x12345678,
but instead, JMPABS 0x12345678. What a shame, really. But no worries, of course it’s possible to work it out, the cost is a few more bytes. Sometimes you are short on bytes. However, just to open your appetite you can sometimes even hook a 1 byte function that only RETs in an efficient way. That’s worth another blog post sometime.

Anyway, what you can do is changing the code a bit and do something like this:

MOV EAX, 
JMP EAX ; <--- Suddenly the processor supports absolute addresses, ah ha!

But now, this code takes 7 bytes and it also screws up one of the registers. There are two possible fixes, either store and then reload EAX at the callee site (BAHH!!) or change the code again, which is trivial:

PUSH 
RET

This is very nice one, although, technically, I am not sure whether it's slower than a simple JMP. The algorithm for the RET instruction is quite impressive (to branch predict the caller(/return) address...), mind you. Now the code takes only 6 bytes, plus, you can access the whole address space.

Let's move to another similar issue, suppose you want to CALL an absolute address, what you gonna do now?

PUSH EBX
CALL $ + 5
HERE:
POP EBX
LEA EBX, [EBX + NEXT-HERE]
PUSH EBP
MOV EBP, ESP
XCHG EBX, [EBP + 4] ; 1337
POP EBP
PUSH 
RET
NEXT:

I will leave it to you to figure it out, hint - it doesn't screw any register.

Arkon Under the Woods

Saturday, May 2nd, 2009

Yeah, I am not alright, I am even an ass, leaving you all (my dear readers, are there any left?) without saying a word for the second time. At the end of last year I was in South East Asia for 3 months, and now I am in South America for 3 months and counting… It is just that I really wish to keep this blog totally technological, but I guess we are all human after all. So yeah, I have been trekking alot in Chile and Argentina (mostly in Patagonia) and having a great time here, now in Buenos Aires. Good steaks and wines, ohhh and the girls. Say no more.

It is really cool that almost every shitty hostel you go, you will find a WiFi available for free use. So carrying an Ipod touch with me I can actually be online, but apparently not many web developers think about Mobile web pages and thus I couldnt write blog posts with Safari, because there is some problem with the text area object. For some reason, I guess some JS code, doesnt run well on the Ipod and I dont get that keyboard thingy up and cannot type in anything, wordpress…

I am always surprised again to see how many computers here, in coffee-shops or just Internet shops, are not really secured. You run as admin some of the times. And there are not anti virus, which I think are good for the average users. And if you plug in your camera to upload some pictures, the next time you will see some stupid new files on it, named: desktop.ini and autorun.inf, sounds familiar? And then I read some MS blog post about disabling AutoRun for removable storage devices..yipi, about time. What I am also trying to say, that one can easily create a zombies army so easily with all those computers… the ease of access and no protection drives me mad.

Anyhow, I had some free time, of course, I am on a vacation, sort of, after all. And I accidentally reached some amazing blog that I couldnt stopped reading for a few days. Meet NO EXECUTE! If you are low level freaks like me, you will really like it too, although Darek begins with hardware stuff, which will fill some gaps for most people I believe, he talks about virtualizations and emulators (his obsession), and I just read it like some fantasies book, eager to get already to the next chapter everytime. I learnt tons of stuff there, and I really like to see that even today some few people still measure and check optimizations in cycles per instructions rather than seconds or MS. One of the stuff I really liked there was a trick he pulled when the guest OS runs on little endian, for instance, and the host OS runs on big endian. Thus every access to memory has to be swapped when the size of the access is more than 2 bytes, of course. Therefore, in order to eliminate the byte swaps, which is expensive, he kinda turned all the memory of the guest OS upside down, and therefore the endianity changed as well. Now it might sound as a simple matter, but this is awesome, and the way he describes it, you can really feel the excitment behind the invention… He also talks about how lame Intel and AMD are to come up with new instruction sets every Monday, which I already mentioned also in the past.

Regarding diStorm now, I decided that I will discontinue the development of the current diStorm64 version. But hey, dont worry. I am going to open source diStorm3 and I still consider making it dual licensed. The benefits of diStorm3 are structure output, and believe me, the speed is amazing and like the good old days, the structure per instruction is unbelieable tiny in size (relative to other disassemblers I saw out there), and you guys are gonna like it.

Thing is, I have no idea when I am getting home…Now with this Swine Flu spreading like hell, I dont know where I will end up. The only great thing about this Swine Flu, so to speak, is that you can see the Evolution in Progress.

Salud

VML + ANI ZERT Patches

Tuesday, February 3rd, 2009

It is time to release an old presentation about the VML and ANI vulnerabilities that were patched by ZERT. It explains the vulnerabilities and how they were closed. It is somewhat very technical, Assembly is required if you wanna really enjoy it. I also gave a talk using this presentation in CCC 2007. It so happened that I wrote the patches, with the extensive help of the team, of course.

ZERT Patches.ppt