Archive for the ‘Assembly’ Category

Escape

Sunday, February 1st, 2009

Wanted to share this with the world:

e 0:0 cc
e 100 c4 c4 54 27

NULL, Vulnerabilities and Fuzzing

Wednesday, December 31st, 2008

I remember seeing Ilja at BH07. We talked about Kernel attacks, aka privilege escalation. He told me, also, back then, that he found some holes that he managed to execute code through. I think the platform of target was Windows, although Ilja is specializing in Unix. Back in ’05 already he had a talk about Unix Kernel Auditing. Nothing new probably there, at least for the time being. However, the new approach of fuzzing the kernel, the system calls to be accurate, was pretty new. But feel free to correct me if I’m wrong about it. And it seems Ilja managed to find some holes using fuzzing. (BTW, a much more interesting paper from him about Unusual Bugs.)

Personally, I don’t believe in fuzzing. Usually the holes I find – there is no way a fuzzer will find. Although, I do believe that you need to mix tools/knowledge in order to find holes and audit a software in a better way. It is enough that there is a simple validation of some parameter you pass to a specific potential-hole’y function and all your test can be thrown away because of that validation, though, there is still a weakness in that function, you won’t get to it. Then you say “Ah Uh”, and you think that you can refine the randomness of the parameters you pass to that function and hopefully prevail. Well, it might work, it might not. As I said, I’m not a big fan of fuzzing.

Although, it might be cool to have a tool that analyzes the code of a function and builds the parameters in a special way to make a code coverage of 100% on that function, which is not fuzzing anymore and means: you walk all paths of execution and the chances to find a weakness are so much greater. Writing such a tool is crazyness, and yet possible, if you ask me.

Fuzzing or not, there are still weaknesses in Win32k, which supposed to be one of the most “secured”/audited components in the Kernel. Probably because many researches had their work on it as well. And that’s simply sad.

Speaking about Ilja’s fuzzing of kernel and stuff, and thinking we are cool to find weaknesses nowadays, Mark Russinovich wrote NTCrash back in ’96 for god sake and, it was a Fuzzer(!), but back then nobody called it or knew about fuzzers. And NTCrash as simple as it is, found some weaknesses in kernel system calls of NT4 ;) Respect (though today it won’t even scratch the kernel, so we might think ourselves cool for still finding stuff :) ).

A friend and I are trying to audit another application, and my friend found some NULL dereference which crashes that software. So we fired up Olly and tried to see what’s going on. It seems that some interface is queried and returns a successful code value and at the same time we get NULL for that interface, which means something is really f*cked up there. Thing is, as you probably can imagine for yourself, we want to execute code out of it. But odds seem to be against us at this time, since we can’t control that NULL or anything about it.

I then wanted to see what people have done with NULL before, how to exploit it better. And usually 99% of the applications running out there don’t have page 0 mapped to their address space. But CSRSS and NTVDM for instance, do have it mapped, but who cares now…? It doesn’t help our cause. Besides, you probably can’t control that page 0 and its data anyway. So I encountered that Flash Exploitation. To be honest, I didn’t read all of the white paper about the exploitation, I only looked for how the arbitary data write worked. And it seems that some CALLOC had failed to allocate memory because of an integer overflow weakness and from there you got a NULL pointer to begin with. But Flash didn’t access that pointer immediately – it had some pointer arithmetic added to it. And you guessed it right, you can control some offset before the pointer is really accessed, thus you can write (almost) anywhere you want. Now I really don’t underestimate the exploitation, from the bits I read it is a crazy and very beautiful exploitation. But to say that it is a new technique and a new class of exploitation is one thing that I really don’t agree to. You know what, looking at it in a different light – it was probably not leading to a code execution if that CALLOC not returned NULL, because then you won’t know where you are on the heap and you couldn’t really write to anywhere you knew accurately. And besides, the NULL wasn’t dererferenced directly and an offset was added to it (no matter what the calculation was for the sake of conversation), so therefore I don’t see it so exciting if you ask me (again, not the exploitation but the “new class of exploitation”). Still you should check it out :)

So, as I saw that no one did anything really useful with a real NULL dereference, it seems that the weakness he found is only a DoS, but maybe we can control something there, yet to be researched…

String Initialization is Tricksy

Monday, December 29th, 2008

A friend of mine had to hand in an assignment for Computer Science in university. As I understood, it was a relatively easy assignment. And the point is that that friend is very experienced programmer and knows a thing or two about it. Anyway, he had a line in the code which goes like this:
char buf[1024] = “abc”;

You don’t even need to know C in order to understand that line, right? I assume we all agree to that. It simply initializes the buffer with a constant string literal. So his lecturer asked him, what does this line do precisely. And to his surprise his answer was incorrect. The correct answer is that the whole buffer is initialized and then the string constant is copied (this can be done in a few ways, for example copying a buffer with the zeros at the end of it). So today another friend called me on the phone to ask about this thing, why our first friend was wrong about it. Now as a reverser, I suppose I need to know the answer to such a simple matter as well. But, the sad part was that I was wrong as well as the two of them. I fired up the C standard and started to search for the solution. I wanted al iving proof to the matter at hand. Looking here and there it took me around 15 mins to lie my hands on the piece of sentence that settled all that matter down. And I quote:

“If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, or fewer characters in a string literal used to initialize an array of known size than there are elements in the array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration.”

The underlined text is the answer – If there are less characters than the size of the array to initialize, the remainder has to be initialized as well. There is another clause which explains how the initialization is being done, but for now, let it be ‘zeroing’.

Now the reason I was wrong about it is because I happened to see many (for example):
mov [buf+1], ‘a’
mov [buf+2], ‘b’
mov [buf+3], ‘c’
mov [buf+4], ‘\0’

in lots of functions, and that means the source C code is:

char buf[] = “abc”;

The standard says about this case that the size of the buffer is to be acquired from the size of the literal constant string, don’t forget the null termination character as well. So that’s why I didn’t see the memset coming in to initialize the all buffer. Besides, maybe most of the people code it this way:
char buf[1024];
strcpy(buf, “abc”);

Which doesn’t lead to a memset or other way of initialization of the rest of the array.

Instructions’ Prefixes Hell

Sunday, December 21st, 2008

Since the first day diStorm was out people didn’t know how to deal with the fact that I drop(ignore) some prefixes. It seems that dropping unused prefixes isn’t such a great feature for many people and it only complicates the scanning of streams. Therefore I am thinking about removing the whole mechanism, or maybe change it in a way that still preserves the same interface but behaves differently.

For the following stream: “67 50”, the result by diStorm will be: “db 0x67” – “push eax”. The 0x67 prefix supposes to change the address size, which none is used in our case, thus it’s dropped. However, if we look at the hex code of the “push eax” part we will see “67 50”. And this is where most of the people become dumbfounded. Getting twice the same prefix-byte of the stream in two results is in a way confusing. Taking a look at other disassemblers will tell you that diStorm is not the only one to do such games with prefixes. Sometimes I get emails regarding this “impossible” prefix – since it gets to be output twice, which is wrong, right? Well, don’t know, it depends how you choose to decode it. The way I chose to decode prefixes was really advanced, each prefix could have been ignored, unless it has really affected (one of) the operand itself. I had to really keep tracking on each prefix and know whether it affected any operands in the instructions and only then I examined which prefixes I drop or not. This all sounds right in a way. Hey, at least for me.

However, we didn’t even talk about what you will do if you have multiple prefixes of the same family (segment-overide: DS, ES, SS, etc). Now this one is really up to interpretations of the designer. Probably the way I did it in diStorm is wrong, I admit it, that’s why I want to rewrite the whole prefixes thing from the beginning. There are 4 or 5 types of prefixes and according to the specs (Intel/AMD) I quote: “A single instruction should include a maximum of one prefix from each of the five groups.” …. “The result of using multiple prefixes from a single group is unpredictable.”. This pretty much sums all the problems in the world related to prefixes. I guess you can see for yourself from these 2 lines you can actually treat them in many different ways. We know now that it can lead to “unpredictable” results if you have many prefixes – in reality it won’t shut down your CPU, it won’t even throw an exception. So screw it you say, and you’re right. Now let’s see some CPU (16 bits) logic for decoding the prefixes:

while (prefix byte is read) {
 switch (prefix): {
  case seg_cs: use_seg = cs; break;
  case seg_ds: use_seg = ds; break;
  case seg_ss: use_seg = ss; break;
  ….
  ….
 case op_size: op_size = 32; break;
  case op_addr: op_addr = 32; break;
 case rep_z: rep = z; break;
 …
 }
 – skip byte in stream –
}

The processor will use those flags in order to know which prefix was presented or not. The thing about using a loop (in any form) is that now that you have to show text out of some streams with many prefixes, you don’t know whether the processor really uses the first occurrance of the prefix or its last, or maybe both? And maybe Intel and AMD implement it differently?

You know what? Why the heck do I bother so much with some minor end cases that never really happen in real code sections. I ask myself too, maybe I shouldn’t. Although I happened to see for myself some malware code that tries to screw up the disassembler with many extra prefixes, etc.. and I thought diStorm could help malware analyzers as well with advanced prefixes decoding.

Anyways, according to the above logic code I’m supposed to use the last prefix of each type. Given a stream such as: 66 66 67 67 40. I will get:
0: 66 (dropped)
2: 67 (dropped)
1: 66 67 40
Now you can see that the prefixes used are the second and the fourth and that the instruction starts at the second byte on the stream. Now I officially can commit a suicide, even I can’t follow these addresses, it’s hell. So any better solution?

Proxy Functions – The Right Way

Thursday, August 21st, 2008

As much as I am an Assembly freak, I try to avoid it whenever possible. It’s just something like “pick the right language for your project” and don’t use overqualified stuff. Actually, in the beginning, when I started my patch on the IPhone, I compiled a simple stub for my proxy and then fixed it manually and only then used that code for the patch. Just to be sure about something here – a proxy function is a function that gets called instead of the original function, and then when the control belongs to the proxy function it might call the original function or not.

The way most people do this proxy function technique is using detour patching, which simply means, that we patch the first instruction (or a few, depends on the architecture) and change it to branch into our code. Now mind you that I’m messing with ARM here – iphone… However, the most important difference is that the return address of a function is stored on a register rather than in the stack, which if you’re not used to it – will get you confused easily and experiencing some crashes.

So suppose my target function begins with something like:

SUB SP, SP, #4
STMFD SP!, {R4-R7,LR}
ADD R7, SP, #0xC

This prologue is very equivalent to push ebp; mov ebp, esp thing on x86, plus storing a few registers so we can change their values without harming the caller, of course. And the last thing, we also store LR (link-register), the register which stores the return address of the caller.

Anyhow, in my case, I override (detour) the first instruction to branch into my code, wherever it is. Therefore, in order my proxy function to continue execution on the original function, I have to somehow emulate that overriden instruction and only then continue from the next instruction as if the original patched function wasn’t touched. Although, there are rare times when you cannot override some specific instructions, but then it means you only have to work harder and change the way your detour works (instructions that use the program counter as an operand or branches, etc).

Since the return address of the caller is stored onto a register, we can’t override the first instruction with a branch-link (‘call’ equivalent on x86). Because then we would have lost the original caller’s return address. Give it a thought for a second, it’s confusing in the first time, I know. Just an interesting point to note that it so happens that if there’s a function which don’t call internally to other functions, it doesn’t have to store LR on the stack and later pop the PC (program-counter, IP register) off the stack, because nobody touched that register, unless the function needs around 14 registers for optimizations, instead of using local stack variables… This way you can tell which of the functions are leaves on the call graph, although it is not guaranteed.

Once we understand how the ARM architecture works we can move on. However, I have to mention that the 4 first parameters are passed on registers (R0 to R3) and the rest on the stack, so in the proxy we will have to treat the parameters accordingly. The good thing is that this ABI (Application-Binary-Interface) is something known to the compiler (LLVM with GCC front-end in my case), so you don’t have to worry about it, unless you manually write the proxy function yourself.

My proxy function can be written fully in C, although it’s possible to use C++ as well, but then you can’t use all features…

int foo(int a, int b)
{
 if (a == 1000) b /= 2;
}

That’s my sample foo proxy function, which doesn’t do anything useful nor interesting, but usually in proxies, we want to change the arguments, before moving on to the original function.

Once it is compiled, we can rip the code from the object or executable file, doesn’t really matter, and put it inside our patched file, but we are still missing the glue code. The glue code is a sequence of manually crafted instructions that will allow you to use your C code within the rest of the binary file. And to be honest, this is what I really wanted to avoid in first place. Of course, you say, “but you could write it once and then copy paste that glue code and voila”. So in a way you’re right, I can do it. But it’s bothersome and takes too much time, even that simple copy paste. And besides it is enough that you have one or more data objects stored following your function that you have to relocate all the references to them. For instance, you might have a string that you use in the proxy function. Now the way ARM works it is all get compiled as PIC (Position-Independent-Code) for the good and bad of it, probably the good of it, in our case. But then if you want to put your glue code inside the function and before the string itself, you will have to change the offset from the current PC register to the string… Sometimes it’s just easier to see some code:

stmfd sp!, {lr} 
mov r0, #0
add r0, pc, r0
bl _strlen
ldmfd sp! {pc}
db “this function returns my length :)”, 0

 When you read the current PC, you get that current instruction’s address + 8, because of the way the pipeline works in ARM. So that’s why the offset to the string is 0. Trying to put another instruction at the end of the function, for the sake of glue code, you will have to change the offset to 4. This really gets complicated if you have more than one resource to read. Even 32 bits values are stored after the end of the function, rather than in the operand of the instruction itself, as we know it on the x86.

So to complete our proxy code in C, it will have to be:

int foo(int a, int b)
{

 int (*orig_code)(int, int) = (int (*)(int, int))<addr of orig_foo + 4>; 
// +4 = We skip the first instruction which branches into this code!
 if (a == 1000) b /= 2;
// Emulate the real instruction we overrode, so stack is balanced before we continue with original function.
 asm(“sub sp, sp, #4”);
 return orig_foo(a, b);
}

This code looks more complete than before but contains a potential bug, can you spot it? Ok, I will give you a hint, if you were to use this code for x86, it would blow, though for ARM it would work well to some extent.

The bug lies in the number of arguments the original function receives. And since on ARM, only the 5th argument is passed through the stack, our “sub sp, sp, #4” will make some things go wrong. The stack of the original function should be as if it were running without we touched that function. This means that we want to push the arguments on the stack, ONLY then, do the stack fix by 4, and afterwards branch to the second instruction of the original function. Sounds good, but this is not possible in C. :( cause it means we have to run ‘user-defined’ code between the ‘pushing-arguments’ phase and the ‘calling-function’ phase. Which is actually not possible in any language I’m aware of. Correct me if I’m wrong though. So my next sentence is going to be “except Assembly”. Saved again ;)

Since I don’t want to dirty my hands with editing the binary of my new proxy function after I compile it, we have to fix that problem I just desribed above. This is the way to do it, ladies and gentlemen:

int foo(int a, int b)
{
 if (a == 1000) b /= 2;
 return orig_foo(a, b);
}

void __attribute__((naked)) orig_foo(int a, int b)
{
// Emulate the real instruction we overrode, so stack is balanced before we continue with original function.
 asm(“sub sp, sp, #4\nldr r12, [pc]\n bx r12\n.long <FOO ADDR + 4>”);
}

The code simply fixes the stack, reads the address of the original absolute foo address, again skipping the first instruction, and branches into that code. Though, it won’t change the return address in LR, therefore when the original function is over, it will return straight to the caller of orig_foo, which is our proxy function, that way we can still control the return values, if we wish to do so.

We had to use the naked attribute (__declspec(naked) in VC) so that the compiler won’t put a prologue that will unbalance our stack again. In any way the epilogue wouldn’t get to run…

This technique will work on x86 the same way, though for branching into an absolute address, one should use: push <addr>; ret.

In the bottom line, I don’t mind to pay the price for a few code lines in Assembly, that’s perfectly ok with me. The problem was that I had to edit the binary after compilation in order to fix it so it’s becoming ready to be put in the original binary as a patch. Besides, the Assembly code is a must, if you wish to compile it without further a do, and as long as the first instruction of the function hasn’t changed, your code is good to go.

This code works well and just as I really wanted, so I thought so share it with you guys, for a better “infrastructure” to make proxy function patches.

However, it could have been perfect if the compiler would have stored the functions in the same order you write them in the source code, thus the first instruction of the block would be the first instruction you have to run. Now you might need to add another branch in the beginning of the code so it skips the non-entry code. This is really compiler dependent. GCC seems to be the best in preserving the functions’ order. VC and LLVM are more problematic when optimizations are enabled. I believe I will cover this topic in the future.

One last thing, if you use -O3, or functions inline, the orig_foo naked function gets to be part of the foo function, and then the way we assume the original function returns to our foo proxy function, won’t happen. So just be sure to peek at the code so everything is fine ;)

x86 Instruction Set Wars

Thursday, August 14th, 2008

It all back in the 90’s where Intel came up with the dashing MMX technology. AMD wasn’t so late to respond with its new 3DNow! instruction set. That if you ask me, was much less popular and less used. But the good thing about 3DNow! is that it handled floating points rather than integers. And then came SSE instruction set, and the world got better and nowadays compilers even use it up to SSE2 while knowing it will be there when the code runs for 99% today. However, they still make a SSE test to be sure they can use it, I believe this check will always stay in code for assurance anyway, can’t harm, you know.

Now almost 10 years later, we see another split in technologies, Intel came up with SSE4, I already talked about it in an earlier post, which contained really valuable instructions. And then, guess what? AMD added several instruction on top of Intel’s.

In August 07 AMD announced a new set: SSE5. Aren’t we sick off SSE anymore??? Anyway, in April this year, Intel announced in response a new AVX instruction set (Advanced Vector Extensions).

This doesn’t go anywhere. Every company in its turn announces a new instruction set. The first company doesn’t support the other’s and vice versa. This is just going wrong and it’s our nightmare. Basically, most developers should not care anyway, they don’t use these sets, and they (partly) are not out officially (means you can’t use it as for now). Therefore, the game matters for compiler developers and those who write the tools that mess with machine code, etc. Probably many codecs and crypto algos will use it too directly if it exists on the processor they get to run on…

The thing is, I decided to take a side, and to stick only to AVX, and therefore I won’t support SSE5 in diStorm. If anyone from the community is gonna help with that, I will gladly accept it, don’t get me wrong. Since I don’t have much time for it anyway, and I hate this mess up, I will stick to Intel. I don’t know if it’s a good or bad news for diStorm, but as the only owner and the way I see it, I gotta do something against this lack of standard thing. Everything has standards today, why can’t the damned x86 have one as well?

I’m not the only one who speaks about this, Agner Fog has also talked about it here. The difference is that I can make my own small change.

What do you think should happen with this issue??? Who’s right, Intel or AMD? Should developers code twice now for each instruction set?

arrrrgh so many questions

Anti-Unpacker Tricks

Friday, July 18th, 2008

Peter Ferrie, a former employee of Symantec, who now works for MS wrote a paper about Anti Unpacker tricks. I was really fascinated reading that paper. There were so many examples in there for tricks that still work nowadays. Some I knew already some were new to me, he covers so many tricks. The useful thing is that every trick has a thorough description and a code snippet (mostly Assembly). So now it becomes one of the most valueable papers in the subject and you should really read it to get up to date. The paper can be found here.

One idea I that I really like from the paper, is something that Peter himself found, that you can use ReadFile (or WriteProcessMemory) to override a memory block so no software breakpoints will be raised when you execute it. But on a second thought, why a simple memcpy won’t do the same trick?

If you guys remember the Tiny PE challenge I posted 2 years ago in Securiteam.com, then Peter was the only one who kicked my ass with a version of 232 byts,  where I came with 274 bytes. But no worries, after a long while I came back with a version of 213(!) bytes (over here) and used some new tricks. Today I still wait for Peter’s last word…

Have fun

Signed Division In Python (and Vial)

Friday, April 25th, 2008

My stupid hosting company decided to move the site to a different server blah blah, the point is that I lost some of the recent DB changes and my email hasn’t been working for a week now :(

Anyways I repost it. The sad truth was that I had to find the post in Google’s cache in order to restore it, way to go.

Friday, April 18th, 2008:

As I was working on Vial to implement the IDIV instruction, I needed to have a signed division operator in Python. And since the x86 is a 2’s complement based, I first have to convert the number into Python’s negative (from unsigned) and only then make the operation, in my case a simple division. It was supposed to be a matter of a few minutes to code this function which gets the two operands of IDIV and return the result, but in practice it took a few bad hours.

The conversion is really easy, say we mess with 8 bits integers, then 0xff is -1, and 0×80 is -128 etc. The equation to convert it to a Python’s negative is: val – (1 << sizeof(val)*8). Of course, you do that only if the most significant bit, sign bit, is set. Eventually you return the result of val1 / val2. So far so good, but no, as I was trying to feed my IDIV with random input numbers, I saw that the result my Python’s code returns is not the same as the processor’s. This was when I started to freak out. Trying to figure out what’s the problem with my very simple snippet of code. And alas, later on I realized nothing was wrong with my code, it’s all Python’s fault.

What’s wrong with Python’s divide operator? Well, to be strict, it does not round the negative result toward 0, but towards negative infinity. Now, to be honest, I’m not really into math stuff, but all x86 processors rounds negative numbers (and positive also to be accurate) toward 0. So one would really assume Python does the same, as would C, for instance. The simple case to show what I mean is: 5/-3, in Python results in -2. Rather than -1, as the x86 IDIV instruction is expected and should return. And besides -(5/3) is not 5/-3 in Python, now it’s the time you say WTF. Which is another annoying point. But again, as I’m not a math guy, though I was speaking with many friends about this behavior, that equality (or to be accurate, inequality) is ok in real world. Seriously, what we, coders, care about real world math now? I just want to simulate a simple instruction. I really wanted to go and shout “hey there’s a bug in Python divide operator” and how come nobody saw it before? But after some digging, this behavior is really documented in Python. As much as I would hate it and many other people I know, that’s that. I even took a look at the source code of the integer division algorithm, and saw a ‘patch’ to fix the numbers to be floored if the result is negative because of C89 doesn’t define the rounding well enough.

While you’re coding something and you have a bug, you usually just start debugging your code and track it down and then fix it easily while keeping on working on the code. Because you’re in the middle of the coding phase. There are those rare times that you really get crazy when you’re absolutely sure your code is supposed to work (which it does not) and then you realize that the layer you should trust is broken (in a way). Really you want kill someone  … being a good guy I won’t do that.

Did I hear anyone say modulo?? Oh don’t even bother, but this time I think that Python returns the (math) expected result rather than the CPU. But what does it matter now? I really want only to imitate the processor’s behavior. So I had to hack that one too.

The solution after all, was to make the Python’s negative number to be absolute and remember its original sign, that we do for both operands. And then we make an unsigned division and if the signs of the input are not the same we change the sign of the result. This is because we know that the unsigned division works as the processor does and we can then use it safely.

res = x/y; if (sign_of_x != sign_of_y) res = -res;

The bottom line is that I really hate this behavior in Python and it’s not a bug, after all. I’m not sure how many people like me encountered this issue. But it’s really annoying. I don’t believe they are going to fix it in Python 3, never know though.

Anyway, I got my IDIV working now, and that was the last instruction I had to cover in my unit tests. Now It’s analysis time :)

AAA/AAS Quirk

Thursday, April 3rd, 2008

As I was working on the simulation of these two instructions I found that they have some quirk, although the algorithms for these instructions are described in Intel’s specs, which (seems to) make the output defined for all inputs, it is not the case. Everytime I finish writing an implementation for a specific instruction I add that instruction to my unit tests. The instruction is being simulated with random(and some smarter) input and then checked against pure native execution to see if the results are correct. So this way I found a quirk for a range of input that reveals how the instruction is really implemented (microcode stuff prolly) rather than how it’s documented.

AL = AL + 6 is done when the AF is set or low nibble of AL is above 9. According to the documentation the destination register is AL, but in reality the destination register is AX. Now how do we know such a thing?

If we try the following input:
mov al, 0xff
aaa

The result will be 0x205, rather than 0x105 (which is what we expect according to the docs).

What really happens is that we supply a number that when added with 6 creates a carry into AH, thus incrementing AH by 1. Then looking at the docs again, we see that if AL was added with 6, it also increments AH by 1 manually. Thus AH is really incremented by 2. :P

The question is why they do AX = AX + 6, rather then operating on AL. No, actually the biggest question is why I get this same behavior on an AMD processor (whereas I work on an Intel processor). And we already by my last post about SHLD that they don’t work the same in some undefined behavior aspects (although some people believe AMD copied Intel’s architecture and implementation)…

There might be some people who will say that I went too far with testing this instruction, because I, somewhat, supply an input which is not in the valid range (it’s unpacked BCD after all), which therefore I must not rely on the output. The thing is, the algorithm is defined well to receive any input I pass it, hence I expect it to work for even undefined input. Though I believe there is no such a thing as undefined input, only undefined output, and that’s why I implemented my instrction as they both did. Specifically where they both didn’t state anything about undefined input/output, which makes my case stronger. Anyway, the point is they don’t tell us something here, the implementation is not similar to this documented in both AMD/Intel docs.

This quirk works the same for AAS, where instead of doing AL = AL -6, it’s really AX = AX – 6. I also tried to see whether they work on the whole EAX, but I saw that the high word wasn’t changed (by carry/borrow). And I also tried to see if this ill behavior is found in both DAA/DAS, but no.

Shift Double Precision

Saturday, March 29th, 2008

Were you asking me I had no idea why Intel has support for shift double precision in the 80×86. Probably their answer would be “because it used to be a CISC processor”. The shift double precision is pretty easy to implement algorithm. But maybe it was popular back then and they decided to support it hardware-ly. Like now that they add very important instructions to the SSE sets. Even so, everyone (includes me) seems to implement the algorithm like this:

(a << c) | (b >> (32-c))

Where a and b are the 32 bits input variables(/registers) and c is the count. The code shows a shift left double precision. Shifting right will require to change the shifts direction for each one of the shifts. However, if a and b were 16 bits, the equation of the second shift amount changes to (16-c). And now there is a problem, why? Because we might enter into the magical world of undefined behavior. And why is that? Because the first thing that describes the shift/rotate instructions is that the count operand is masked to preserve only the 5 least significant bits. This is because the largest shift amount for a 32 bits input is 32 shifts (and then you get a 0, ignore SAR for now). And if the input is 16 bits, the count is still masked with 31. That means that you can shift a 16 bits register more than its size. Which doesn’t make much sense, but possible for other shift instructions. But when you use a shift double preicision, not that it doesn’t makes sense, it is also undefined. That is the result is undefined, because then you try to move bits from b into a. But the count becomes negative. For example: shld ax, bx, 17. And internally the second shift amount is calculated as (16-c) which becomes (16-17). And that’s bad, right?

In reality everything is defined when it comes to digital logic. Even the undefined stuff. There must be a reason to the result I get from executing such an instruction like in the example above, even though it’s correctly and officially undefined. And I know that there is a rational behind it, because the result is consistent (at least to my Intel Core2Duo processor). So being the stubborn I am, I decided I want to know how that calculation is really being done in the hardware level.

I forgot to mention that the reason I care of how to implement this instruction is because I have to simulate it for the Vial project. I guess eventually it’s a waste of time, but I really wanted to know what’s going on anyway. Therefore I decided to research the matter and get with the algorithm my processor uses. Examining the results of officially undefined results, I quickly managed to see how to calculate the shift like the processor does, and it goes like this for 16 bits input (I guess, it will work the same for 8 bits input as well, and note that 32 bits input can’t have an undefined range, because you can’t get a negative shift amount):

def shld(a, b, c):
 c &= 31
 if c <= 15:
  return ((a << c) | (b >> (16-c))) & 0xffff
 else:

  # Undefined behavior:
  c &= 15
  return ((b << c) | (a >> (16-c))) & 0xffff

Yes, the code is in Python. But you can see that if the the count is bigger than 15, then we are replacing the input order. And then comes the part where you say “NOW WTF?!”. Even though I got this algorithm to return the same results as the processor does for defined and undefined input, I could wager the processor won’t do this kind of stuff internally. So I sat down some (long) more, and stared at the code, doing a few experiments here and there. Eventually it occurred to me:

def shld(a, b, c):
 c &= 31
 x = a | (b << 16)
 return ((x << c) | (x >> (32-c))) & 0xffff

Now you can see that the input for the original equation is the same bits-buffer input, which contains both inputs together as one. Taking a count of 17, won’t yield a negative register, but something else. Anyway, I have no idea why they implemented this instruction like they did (and it applies to SHRD as well), but I believe it has something to do with the way their processor so-called ‘engine’ works and hardware stuff.

After I learned how it works I was so eager to see how it works on AMD. And guess what? They don’t work the same, where it comes to the undefined behavior, of course. And since I don’t have an AMD anymore I didn’t see how they really implemented their shift double precision instructions.

In the Vial project, where I simulate these instructions, I added a special check for the count, to see that it’s not bigger than the input size, and if it is, I mark the destination register and some of the flags as Undefined. This way I will know when I do code-analysis that something is really wrong/buggy with the way the application works. Now what if the application is purposely uses the undefined behavior? Screw us both then. Now why would a sane application do that? ohh and that’s another story…

By the way, other shift/rotate instructions don’t have any problem with the shift amount since they can’t yield negative shift amount internally in any way, therefore the results are always defined for every input.