X86 Assemblyyy

Complex instructions are really useful, especially if you try to optimize the size of your code. Of course, modern processors nowadays are becoming RISC’ish more and more. But as for X86 its backward compatibility makes those instruction to stay there (forever?) ready for you to use. The funny thing is that in the modern X86 processors the RISC instructions are probably faster, so compiler don’t generate code with the CISC instructions. Thing is, that when you size-optimizing your code, or writing a shell code, you don’t care much about speed at all. So why not take advantage of those instructions?

The most popular X86 CISC instruction is LOOP. It’s a simple one as well, decrements the genereal purpose register CX(/ECX/RCX) by one and jumps to some address if it’s not zero. So you have something like 3 sub-instructions in one. Or call it micro-opcodes. Such as: a decrement, an if statement (cmp) and a branch.

So speaking of LOOP, there are also LOOPZ and LOOPNZ, those instruction in addition to branching upon rCX not being zero, will also branch if the Zero flag is set or not. Which means that you “earn” another condition testing for free. For instance, if you wanted to do some test on each entry in an array and then continue to next entry only if the previous was successful and there are still cells to scan, those instruction might be helpful.

I have never seen anyone uses those instructions, even not in code crunching. I think it’s because most people just don’t read the specs, and even so, they don’t know how to use those instructions. Not that they are hard to use, but maybe a bit confusing or not popular.

I found somewhat a useless combination of the repeat prefix with the LODS instruction. A REP LODSB, means: read into AL the byte at address DS:rSI and advance rSI (by examining DF…). So you end up with some code that gets into AL the last byte of the buffer that rSI was pointing to…(Of course it depends on the initial value of rCX). I think that in the 8086 this repeat and lods combination was prohibited. So while I was working on diStorm, I made it so if a LODS instruction is prefixed with a REP, that REP prefix is being ignored. Then I got some angry email that today it’s not the case and this combo is supported… I even checked the current specs and it seems that that guy was right. So honestly, I’m not sure it’s useful for anything… but it’s cool to note it.

Another instruction I wanted to talk about is SCAS. I guess you know this instruction in the strlen implementation as follows:

sub ecx, ecx
sub al, al
not ecx
cld
repne scasb
not ecx
dec ecx

Now, I’m not sure whether this is the fastest way to implement an strlen, some compilers use this implementation and other have find-a-zero-byte-inside-a-dword trick. Though maybe I should talk about those tricks in another post someday…

Anyway, back to SCASB, so now that we saw how strlen is implemented, we know that with the REPNE prefix, which means continue as long as rCX is not zero and as long as the Zero flag is zero as well; we test for two conditions in one instruction. In the code above the REPNE prefix tests ZF, but the truth is that the SCAS instruction updates all other flags. So think of the SCAS instruction as a compare instruction between the Accumulator register (AL/AX/EAX/RAX) and the source memory…For example you can do SCAS and then JS (jump on sign)…

There are many other forsaken instructions, that are not fully used, so next time when you fire your assembler, take a look at the specs again, maybe you will find something better. Well, if you have more ideas of the like, you are welcome to send a comment.

This entry was posted on Monday, October 8th, 2007 at 5:28 pm and is filed under Assembly. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

11 Responses to “X86 Assemblyyy”

arkon says:

October 9, 2007 at 3:53 am

I think you meant this implementation for ROL:
x = (a << b) | (a >> (32-b))

I have never seen it used by compilers. Maybe it was used as some trick combination.
Attila Suszter says:

October 9, 2007 at 4:08 am

> I think you meant this implementation for ROL:

Yes, and I wrote this, of course. I don’t know what happend my code, maybe the parser was not working very well.
toptigin says:

October 15, 2007 at 10:14 am

what is this rant? have some more beer…
stop wasting time, write something useful
arkon says:

October 15, 2007 at 6:34 pm

give me some ideas, or i start talking COM here. :)
Peter Ferrie says:

October 16, 2007 at 10:27 am

The rep lods is a particularly interesting one – it’s equivalent to lea esi, [ecx*1/2/4 + esi] (1/2/4 depending on lodsb/w/d), and smaller, too. For lodsb, it’s the same as add esi,ecx but without affecting the carry.
Very cool.
arkon says:

October 16, 2007 at 2:14 pm

that’s an interesting way to look at it.

and –
of course it is necessary to remember that it also affects the accumulator.
funny thing is that if esi points to an invalid page you will get an av :)
Erez says:

November 3, 2007 at 7:52 am

“I have never seen anyone uses those instructions”
That is because you have never looked at my code :)

But the best use of a CISC instruction I’ve seen so far was within the MASM stdlib: They brilliantly used DAS (Decimal Adjust) in order to convert a byte to 2 hex chars. Truely a work of art. (If it interests anyone I’ll make the effort of finding that implementation again)
Charles Doty says:

November 29, 2007 at 1:10 pm

rep movsd and LOOP have been slower than a load/store/decrement counter loop since the 286. That could explain why LOOP is very rarely used.
Petey B says:

January 31, 2009 at 6:54 am

Fifteen years ago I found a use for “REP LODS?” writing a database engine in assembler. Assume there is a structure of indeterminate size that starts with a value consisting of bit flags which indicate what values are present in that structure. If a value is not present in the structure then a default value (defined as a const) should be used.

lds si Source
les di Target
mov bx,Flags

You could (and probably would) do something like this:

test bx,Flag1
jnz LoadVal
mov ax,Default1
stosw
jmp done
LoadVal:
movsw
done:

But how about

xor cx,cx
ror cx,bx
mov ax,Default1
rep lodsw
stosw

If cx is zero the lodsw never gets executed and the default value is stored. In the modern paradigm it may not be faster; but in the old days when dumping the prefetch queue was to to be avoided at all cost it is an an enlightened use of an opcode/prefix combination that even Michaeld Abrash (Zen of Assembler) considered useless and included only for the sake of completeness (actually the chip design was such that exluding it would have meant a lot more overhead). Conditional execution without branching, faster on a 286 smaller on anything and, finally, a use for something long considered inutile.
arkon says:

January 31, 2009 at 10:30 pm

Wow man, I really liked this trick!
arkon says:

February 9, 2009 at 3:35 pm

I just realized another interesting use for “rep lodsb” is for something like ProbeForRead :)

Insanely Low-Level

X86 Assemblyyy

11 Responses to “X86 Assemblyyy”

Leave a Reply