X86 Assemblyyy

Complex instructions are really useful, especially if you try to optimize the size of your code. Of course, modern processors nowadays are becoming RISC’ish more and more. But as for X86 its backward compatibility makes those instruction to stay there (forever?) ready for you to use. The funny thing is that in the modern X86 processors the RISC instructions are probably faster, so compiler don’t generate code with the CISC instructions. Thing is, that when you size-optimizing your code, or writing a shell code, you don’t care much about speed at all. So why not take advantage of those instructions?

The most popular X86 CISC instruction is LOOP. It’s a simple one as well, decrements the genereal purpose register CX(/ECX/RCX) by one and jumps to some address if it’s not zero. So you have something like 3 sub-instructions in one. Or call it micro-opcodes. Such as: a decrement, an if statement (cmp) and a branch.

 So speaking of LOOP, there are also LOOPZ and LOOPNZ, those instruction in addition to branching upon rCX not being zero, will also branch if the Zero flag is set or not. Which means that you “earn” another condition testing for free. For instance, if you wanted to do some test on each entry in an array and then continue to next entry only if the previous was successful and there are still cells to scan, those instruction might be helpful.

 I have never seen anyone uses those instructions, even not in code crunching. I think it’s because most people just don’t read the specs, and even so, they don’t know how to use those instructions. Not that they are hard to use, but maybe a bit confusing or not popular.

I found somewhat a useless combination of the repeat prefix with the LODS instruction. A REP LODSB, means: read into AL the byte at address DS:rSI and advance rSI (by examining DF…). So you end up with some code that gets into AL the last byte of the buffer that rSI was pointing to…(Of course it depends on the initial value of rCX). I think that in the 8086 this repeat and lods combination was prohibited. So while I was working on diStorm, I made it so if a LODS instruction is prefixed with a REP, that REP prefix is being ignored. Then I got some angry email that today it’s not the case and this combo is supported… I even checked the current specs and it seems that that guy was right. So honestly, I’m not sure it’s useful for anything… but it’s cool to note it.

Another instruction I wanted to talk about is SCAS. I guess you know this instruction in the strlen implementation as follows:

sub ecx, ecx
sub al, al
not ecx
cld
repne scasb
not ecx
dec ecx

Now, I’m not sure whether this is the fastest way to implement an strlen, some compilers use this implementation and other have find-a-zero-byte-inside-a-dword trick. Though maybe I should talk about those tricks in another post someday…

Anyway, back to SCASB, so now that we saw how strlen is implemented, we know that with the REPNE prefix, which means continue as long as rCX is not zero and as long as the Zero flag is zero as well; we test for two conditions in one instruction. In the code above the REPNE prefix tests ZF, but the truth is that the SCAS instruction updates all other flags. So think of the SCAS instruction as a compare instruction between the Accumulator register (AL/AX/EAX/RAX) and the source memory…For example you can do SCAS and then JS (jump on sign)…

There are many other forsaken instructions, that are not fully used, so next time when you fire your assembler, take a look at the specs again, maybe you will find something better. Well, if you have more ideas of the like, you are welcome to send a comment.

11 Responses to “X86 Assemblyyy”

  1. arkon says:

    I think you meant this implementation for ROL:
    x = (a << b) | (a >> (32-b))

    I have never seen it used by compilers. Maybe it was used as some trick combination.

  2. Attila Suszter says:

    > I think you meant this implementation for ROL:

    Yes, and I wrote this, of course. I don’t know what happend my code, maybe the parser was not working very well.

  3. toptigin says:

    what is this rant? have some more beer…
    stop wasting time, write something useful

  4. arkon says:

    give me some ideas, or i start talking COM here. :)

  5. Peter Ferrie says:

    The rep lods is a particularly interesting one – it’s equivalent to lea esi, [ecx*1/2/4 + esi] (1/2/4 depending on lodsb/w/d), and smaller, too. For lodsb, it’s the same as add esi,ecx but without affecting the carry.
    Very cool.

  6. arkon says:

    that’s an interesting way to look at it.

    and –
    of course it is necessary to remember that it also affects the accumulator.
    funny thing is that if esi points to an invalid page you will get an av :)

  7. Erez says:

    “I have never seen anyone uses those instructions”
    That is because you have never looked at my code :)

    But the best use of a CISC instruction I’ve seen so far was within the MASM stdlib: They brilliantly used DAS (Decimal Adjust) in order to convert a byte to 2 hex chars. Truely a work of art. (If it interests anyone I’ll make the effort of finding that implementation again)

  8. Charles Doty says:

    rep movsd and LOOP have been slower than a load/store/decrement counter loop since the 286. That could explain why LOOP is very rarely used.

  9. Petey B says:

    Fifteen years ago I found a use for “REP LODS?” writing a database engine in assembler. Assume there is a structure of indeterminate size that starts with a value consisting of bit flags which indicate what values are present in that structure. If a value is not present in the structure then a default value (defined as a const) should be used.

    lds si Source
    les di Target
    mov bx,Flags

    You could (and probably would) do something like this:

    test bx,Flag1
    jnz LoadVal
    mov ax,Default1
    stosw
    jmp done
    LoadVal:
    movsw
    done:

    But how about

    xor cx,cx
    ror cx,bx
    mov ax,Default1
    rep lodsw
    stosw

    If cx is zero the lodsw never gets executed and the default value is stored. In the modern paradigm it may not be faster; but in the old days when dumping the prefetch queue was to to be avoided at all cost it is an an enlightened use of an opcode/prefix combination that even Michaeld Abrash (Zen of Assembler) considered useless and included only for the sake of completeness (actually the chip design was such that exluding it would have meant a lot more overhead). Conditional execution without branching, faster on a 286 smaller on anything and, finally, a use for something long considered inutile.

  10. arkon says:

    Wow man, I really liked this trick!

  11. arkon says:

    I just realized another interesting use for “rep lodsb” is for something like ProbeForRead :)

Leave a Reply