Archive for March, 2008

Shift Double Precision

Saturday, March 29th, 2008

Were you asking me I had no idea why Intel has support for shift double precision in the 80×86. Probably their answer would be “because it used to be a CISC processor”. The shift double precision is pretty easy to implement algorithm. But maybe it was popular back then and they decided to support it hardware-ly. Like now that they add very important instructions to the SSE sets. Even so, everyone (includes me) seems to implement the algorithm like this:

(a << c) | (b >> (32-c))

Where a and b are the 32 bits input variables(/registers) and c is the count. The code shows a shift left double precision. Shifting right will require to change the shifts direction for each one of the shifts. However, if a and b were 16 bits, the equation of the second shift amount changes to (16-c). And now there is a problem, why? Because we might enter into the magical world of undefined behavior. And why is that? Because the first thing that describes the shift/rotate instructions is that the count operand is masked to preserve only the 5 least significant bits. This is because the largest shift amount for a 32 bits input is 32 shifts (and then you get a 0, ignore SAR for now). And if the input is 16 bits, the count is still masked with 31. That means that you can shift a 16 bits register more than its size. Which doesn’t make much sense, but possible for other shift instructions. But when you use a shift double preicision, not that it doesn’t makes sense, it is also undefined. That is the result is undefined, because then you try to move bits from b into a. But the count becomes negative. For example: shld ax, bx, 17. And internally the second shift amount is calculated as (16-c) which becomes (16-17). And that’s bad, right?

In reality everything is defined when it comes to digital logic. Even the undefined stuff. There must be a reason to the result I get from executing such an instruction like in the example above, even though it’s correctly and officially undefined. And I know that there is a rational behind it, because the result is consistent (at least to my Intel Core2Duo processor). So being the stubborn I am, I decided I want to know how that calculation is really being done in the hardware level.

I forgot to mention that the reason I care of how to implement this instruction is because I have to simulate it for the Vial project. I guess eventually it’s a waste of time, but I really wanted to know what’s going on anyway. Therefore I decided to research the matter and get with the algorithm my processor uses. Examining the results of officially undefined results, I quickly managed to see how to calculate the shift like the processor does, and it goes like this for 16 bits input (I guess, it will work the same for 8 bits input as well, and note that 32 bits input can’t have an undefined range, because you can’t get a negative shift amount):

def shld(a, b, c):
 c &= 31
 if c <= 15:
  return ((a << c) | (b >> (16-c))) & 0xffff
 else:

  # Undefined behavior:
  c &= 15
  return ((b << c) | (a >> (16-c))) & 0xffff

Yes, the code is in Python. But you can see that if the the count is bigger than 15, then we are replacing the input order. And then comes the part where you say “NOW WTF?!”. Even though I got this algorithm to return the same results as the processor does for defined and undefined input, I could wager the processor won’t do this kind of stuff internally. So I sat down some (long) more, and stared at the code, doing a few experiments here and there. Eventually it occurred to me:

def shld(a, b, c):
 c &= 31
 x = a | (b << 16)
 return ((x << c) | (x >> (32-c))) & 0xffff

Now you can see that the input for the original equation is the same bits-buffer input, which contains both inputs together as one. Taking a count of 17, won’t yield a negative register, but something else. Anyway, I have no idea why they implemented this instruction like they did (and it applies to SHRD as well), but I believe it has something to do with the way their processor so-called ‘engine’ works and hardware stuff.

After I learned how it works I was so eager to see how it works on AMD. And guess what? They don’t work the same, where it comes to the undefined behavior, of course. And since I don’t have an AMD anymore I didn’t see how they really implemented their shift double precision instructions.

In the Vial project, where I simulate these instructions, I added a special check for the count, to see that it’s not bigger than the input size, and if it is, I mark the destination register and some of the flags as Undefined. This way I will know when I do code-analysis that something is really wrong/buggy with the way the application works. Now what if the application is purposely uses the undefined behavior? Screw us both then. Now why would a sane application do that? ohh and that’s another story…

By the way, other shift/rotate instructions don’t have any problem with the shift amount since they can’t yield negative shift amount internally in any way, therefore the results are always defined for every input.

Testing The VM

Saturday, March 22nd, 2008

Since Imri and I started working on this huge project, we finally got to a situation where we got a few layers ready: disassembler to structure output –> structure to expression tress –> VM to run those expressions. This is all for the x86 so far, but the code is supposed to be generic, that’s why there’s the expression trees infrastructure from the beginning, so you can translate any machine code to this language and then the same VM will be able to run it. Sounds familiar in a way? Java, Ahm Ahm…

Imri has described some of our work in his last blog post, where you can find more information on what’s going on. But now that we got the VM somewhat working, my next step is to check all layers, that is, running a piece of code (currently assembly based) and see that I get the same results after running it for real. So after getting the idea of inline-assembly in Python, yes yes :) you heard me right. However, thanks to a friend who hasn’t published it yet. I changed the idea so instead of inline assembly I have a function that takes a chunk of assembly text and runs it both on the VM and natively under Python. Then when both are done executing, I check the result in EAX and see if they match… EAX can hold anything, like the EFLAGS and then I can even see if the flags of an operation like INC were calculated well in my VM…

The way I run the code in Python natively is by using ctypes. I wrote a wrapper for YASM and now I can compile a few assembly lines and then the decomposer (disassembler to expressions layer) is fed with the output of machine code, which is being ran in the VM. Then I take ctypes to run a raw buffer of code, using WINFUNCTYPE and the address of the buffer of the binary code, which I can then execute as simple as calling that instance. And I run it natively inside Python’s own thread’s context. The downsides are that I’m capped to 32 bits (since I’m on 32 bits environment), where x86 can be 16, 32 or 64 bits. Thus, I can’t really check 16 and 64 bits at the moment. More disadvantage are that I can’t write anywhere in memory for the sake of it, I need to allocate a buffer and feed my code with the pointer to it, unlike in the VM where I can read and write from anywhere I wish (it doesn’t imitate a PC). And I better not raise any exceptions whether intantionaly or not, because I will blow up my own Python’s process. ;) So I have to be a good boy and do simple tests like: mov al, 0x7f; inc al. and then see if the result is 0x80 and whether OF is set for example. This is really amazing to see the unit test function returns true on such a function :) So on the way we find bugs in the upper layers that sit above diStorm (which is quite stable on its own) and fixing them immediately and then retrying the unit tests. While I add more unit tests and fix more things. Eventually the goal is to take an inline C compiler and run the code and see the result. I’m interested in checking specific instructions at this phase of the project to know that everything we are based on works great, and then we can go with bigger blocks of code, doing fibunaci and other stuff in C…

I am a bit annoyed about the way that I run the code natively. I mean, I trust the code since I write it myself and there are only simple tests. But I can screw up the whole process by dividing by zero for example. I have to run the tests on an x86 machine with 32 bits environment, so I can’t really check different modes. The good thing is that I can really use the stack, as long as I leave it balanced when I end the test (like recovering all regs, etc)… Though I wish I could have run the tests in a real v86 processor or something like that, it won’t be portable with Linux, etc… Even spawning another process and inject the code inside will require some code for Windows and for Linux. And even then if the code is bogus, I can’t really control everything, it will require more work per OS as well. So for now the infrastructure is pretty cool and it gives me what I need. But if you have any idea of how to better doing it, let me know.

Converting A Floating Point Number To A String – Explained

Sunday, March 16th, 2008

I was always curios about how to convert a floating point number into a decimal ASCII string. The equation is trivial after you get to implement it. But the idea behind took me a long while to come up with. For the sake of example I will deal with a single precision floating point of IEEE 754. However, it doesn’t really matter to me how many bits I have to convert into a string, the algorithm stays the same. Floating Point numbers have all this crap about NaNs and QaNs and other weird stuff that I didn’t care about. They are just technical stuff that you have to implement, nothing really hard about it. I am just gonna focus on the way to convert the mantissa (that’s the part of the fraction of the real number) into decimal. I think it’s rather easy to take a look at the implementation of printf in libc of Linux. That way you don’t have a challenge with coming with this, eventually, simple algo on your own.

To simplify matters, I don’t handle big exponents, again it was not my focus. If you wish to change this code to something really usable, it’s possible with some work. Here I am not going to cover the format of the IEEE 754 floating point. I presume you know it, and if not, give it a look here.

void pf(unsigned long x)
{
 unsigned int sign = x >> 31;
 unsigned int exp = ((x >> 23) & 0xff) - 127;
 unsigned int man = x & ((1 << 23) - 1);
 man |= 1 << 23;
 if (sign) printf("-");
 printf("%d", man >> (23 - exp));
 printf(".");
 unsigned int frac = man & ((1 << (23-exp)) - 1);
 unsigned int base = 1 << (23 - exp);
 int c = 0;
 while (frac != 0 && c++ < 6)
 {
  frac *= 10;
  printf("%d", (frac / base));
  frac %= base;
 }
 printf("\n");}
}

You can see in the beginning that we extract the sign bit, exponent and mantissa from the raw number we got as a parameter. Then we fix the exponent since it is biased, though it’s a really nice trick, I won’t cover it today. And we set bit 24 in the mantissa, because we assume we got a normalized floating point number where bit 24th in the data is implied and not saved to spare this extra bit… Printing the sign is self explantory. And then we print the integer part of the mantissa, while taking care of the exponent. This is where a too big or too small exponent will screw up things. But for small integers everything is fine just yet.

Now the fun begins, converting the mantissa from base 2 to base 10. Yipi kay hey. Let’s see. We store the fraction part of the floating point number and the base. For now let’s ignore this base variable and we will get back to it later. We limit the loop to print 6 digits or less when there is no more what to print. The algorithm here is to ‘pull’ the digits over to be in the integer part of the number and print that part which is straight forward (though it’s only a digit and not a whole integer). Then we see how much more digits we have left to print and repeat the body of the loop until we’re done or printed enough digits.

I will try to explain this loop again this way: Let’s say you want to print the number 5/6 in decimal. How will you do that? That’s the key algorithm for converting the number to decimal as well in our case. I was told by a friend that children are taught this method in elemtary school, maybe I lacked of that class or we didn’t have it. :) Then what we are doing is: multiplying the 5 by 10 then see how many times it gets in the result: 50/6 = 8, then we do a modulu with 6, getting 50%6=2. And starting again, 2*10=20; 20 / 6 = 3; 20%6=2; 2*10=20; 20/6=3 and over and over again, the final result is 0.8333… This is similar to 1/3, try this algo on 1/4 and you will see that you get 2 and then 5 and then you reach 0 and stop, resulting in 0.25.

Back to the code above, we take the fraction part and we have a base, take a long breath – of 2 powered by the number of bits the fraction use to be stored. Why is that the base?

Like when you convert an integer to decimal, you have to multiply every set bit by 2**index, where the index is the index of the set bit in the integer stream. For example: 111b is 1*2**0 + 1*2**1 + 1*2**2 = 1 + 2 + 4 = 7. The same goes for bits that are on the right side of the point (thus fractions): .11b is 1*2**(-1) + 1*2**(-2) = 1/2 + 1/4 = 0.75. Notice that the indices this time are negative, and power of negative number is division, therefore: 1/base**(-index). Now instead of doing this conversion per bit, we can do it simply by dividing the fraction once, in our example: 11b by 4. (base powered by number of bits, 2**2=4). Now we treat the fraction as an integer and divide it by the new calculated base; we get 3/4 = 0.75. We saw how the conversion can be done easily and we need a way to print the result of such division… And now we’re back to the description above of how to print a simple fraction. This time we know the reasons behind the fraction and the base. Note that the base is bigger than the fraction by nature, otherwise we won’t have a correct input for the algo and then it won’t work well.

How to use the above code:
float x = 1337.99;
pf(*(unsigned long*)&x);
printf(“%f”, x);

Running the code using a different number every time, you will see that printf %f actually rounds the number where we print it as a raw floating number, without touching it. Therefore, next time I will talk about rounding the number.

A Common Bug With LEA

Friday, March 7th, 2008

If we examine the LEA instruction from text output point of view, we can say that it is similar to the MOV instruction, but a one that uses memoy indirection. Hence you got the square brackets to denote a memory address… Only when you interpret LEA you should know to remove the fake memory indirection and to treat the ‘address’ as an immediate value, or as an expression. The expressions can get as complex  as eax*8+ebx+1234. In reality, in order to simplify matters – the second (source) operand’s type of LEA will be probably OP_TYPE_MEM. Just like any other instruction that might have only a memory indirection, for instance, cmpxchg… So why shouldn’t we (disassemblers) have a special operand type for LEA? Well, mainly to say because it’s a headache to maintain two types that parse the memory indirection bytes, and eventually they really output the same stuff, so why not having only one type which LEA will be able to piggy back?

In diStorm, I used OP_TYPE_MEM for LEA’s operand. I didn’t want to have another special OP_TYPE_LEA. And I didn’t have any problem with it. Until today, I thought of a ‘bug’. You can use a segment override to change the default segment an address will be read from or written to. Here MOV EAX, [FS:EAX]. Normally the segment overrides you will see are only GS and FS, since DS and ES are usually set to the same value, so you can copy data from a source to dest bufferw using the MOVS instruction. Or you can compare two buffers using CMPS… And of course you cannot run without CS and SS… For that reason in 64bits they got rid off CS, DS, ES and SS segment overrides.

So what’s the bug you’re asking? LEA EAX, [FS:EAX] – Doesn’t really make any sense ah? And why does it happen? Because I use the MEM type… So I had two options, either add a special type so you filter the segment overrides for LEA. Or just filter the segment overrides when you see the instruction you decode is LEA. For simplicity I implemented the latter.

Anyway, diStorm is already updated by now. The bug was found in other disassemblers as well, to name Olly and VS’s debugger. I didn’t try other disassemblers, I guess more have this issue. Not to my surprise, IDA is immune.

The moral of the story? hmmm, don’t be a lazy jackass? No. Just don’t assume too much and try to think of the small details. :)

UPDATE:

Try to feed the assemblers with that buggy instruction and you will see that they generate it errornously with the segment override prefix :)

Fooling Around With LEA

Wednesday, March 5th, 2008

Yesterday it hit me. I just realized something so funny that I had to post it right here. I have been using LEA for years now and so have you I guess. Most of the times LEA is used to load an offset of a local variable in a function, for example:

void f()
{
int x;
g(&x);
}

The parameter &x for calling g will use LEA to load the address of x and pass it to g so g can change it inside. But this is nothing new.

You can write something like this:
LEA EAX, [0x12345678]
And you know what?
EAX will be now 0x12345678
This is somewhat trivial when you get to think about it, but when do you??
I wonder how good it is as anti-disassemblers stuff, I think it will get the disassembler a bit crazy, it worth a test… because now instead of loading immediates with MOV, you can use LEA…

Basic Ops #4 – Division

Saturday, March 1st, 2008

I wasn’t sure whether to post this one or not. More over, my dilemma was whether to document this one or not. I decided that it’s so hard to explain what’s going on, that I will only dump my code here. The interface is ugly, because you cannot divide by zero…

// return success, *ret = a / b
int div(unsigned short a, unsigned short b, unsigned short *ret)
{
 if (b == 0) return 0;
 *ret = 0;
 unsigned short mask = 0, len = 0;

 // calc the mask of b.
 for (int i = 15; i >= 0; i–) {
  if (b & (1 << i)) {
  mask = (1 << i) – 1;
  len = 16 – i;
  break;
  }
 }
 // complement mask.
 mask = ~mask;

 // left justify b.
 while ((~b) & (1 << 15)) b <<= 1;

 while (len–) {
  *ret <<= 1;
  if ((a & mask) >= b) {
   *ret |= 1;
   a -= b;
  }
  b >>= 1;
  mask >>= 1;
 }

 return 1;
}

And yes it’s 16 bits division, it will apply for 32 or 64 bits as well. It really was one of the toughest things to write from scratch. Division is not that trivial like subtraction or multiplication. The trick eventually was to use the fact that we’re dealing with binary division..oh well, I quit. Get someone else to describe this one. Although I think there are many ways to implement this one. I remember a friend telling me he didn’t expect this way of solution. I don’t think, however, that this technique may be used for long division for bit array or really big numbers, maybe there’s something faster.

By the way, this is an unsigned division. As far as I know, and I might be wrong, you can’t do signed division without using unsigned division. And to do signed division you just remember the sign of a and b and then make them absolute and then you change the result accordingly. Anyways, if I’m wrong with my assumption, let me know otherwise.