Multi-byte NOPs

Horay, finally official multi-byte NOPs are supported by Intel 80×86 processors. So OK, not all processors support it yet, and you have to check the result of a cpuid instruction to know whether you can use it or not, but it’s a good start nevertheless. It is unnecessary to mention that this kind of NOP is a real NOP, and yet I just did. :)

So why do we care you ask?

Well, probably you certainly don’t. But some code generator tools will use it in the future, or hand written code in assembly. It will make the code more understandable and it will fit all sizes you need from 2 bytes up to 9 bytes. However, to be honest, I can’t really grasp the reason why they actually support it. I mean, one could write X single byte NOPs. But then I thought, the processor will spend time for decoding the instructions in the pipeline until it will realize it’s a NOP. So I guess the multi-byte NOP will be faster than a varying number of NOPs in a row. I tried to think what’s the benefit of such thing. Most of the NOPs or pseudo NOPs are used to align code to 8 or 16 bytes boundaries. This is probably for caching reasons and faster memory read operations. But if the NOPs are used for alignment usually they will follow a branch and (sometimes) even conditional branch, so they might not get to be executed most of the times. So why the effort for making such a new instruction? I can’t come up with a good answer. Got any idea? Ah, and if you were to align your code and it won’t run then why not dumping zeroes as a padding? And it’s not like 80×86 architecture has delayed branch… So they had a good reasoning for coming up with this instruction. The only real answer I came up with is for time critical code where you have to measure your code in MS…

Maybe you already saw some pseudo NOPs in the past. These are real instructions that don’t affect registers or flags of the current execution context. You can come up with many variations of such NOPs using the LEA instruction:


And you can come up with less popular NOPs, MOV REG, REG…

The advantage of the LEA instruction is that it accepts a complex addressing mode operand so you can make the whole instruction bigger easily by using different sizes of the immediate ‘0’. If you don’t understand how the ModR/M is formatted, you will have to assemble the code and disassemble it until it fits the size you require. Not mentioning that you can’t control 100% of the code generated by the assembler that you would prefer to use the DB thingy directly to emit the pseudo NOP. The multi-byte NOP accepts the same source operand as LEA accepts, therefore its size varies.

Well, you might find yourself find of ADD EAX, 0 as a NOP. Probably because in math it doesn’t mean anything special. But while executed it affects the flags. Though, you might say it’s not a big deal and that depends where you dump the pseudo NOPs in your code. However, it seems that even a popular assembler had some bad issues with generating pseudo NOPs. And believe me, when you trust a tool and it misbehaves, you might get crazy until you realize who’s to blame!

Now I have to make diStorm support it 😉


4 Responses to “Multi-byte NOPs”

  1. Peter Ferrie says:

    Multi-byte NOPs have existed since at least the Pentium 3, maybe even earlier. They were documented by Intel only recently, though. Since AMD and Intel CPUs share code, AMD CPUs support these instructions, too.

    They are useful for when the code flow will reach a loop. The loop should be cache-line aligned for best fetch performance.

    LEA is horribly slow on newer CPUs. It also contains a register write. You might not want that write.

    The NOPs are fast because MODR/M resolution happens in parallel to the fetch itself. See here for a more detailed description of how that can go wrong 😉 –

  2. arkon says:

    Thanks for the info 😉

    BTW – In your post you claim that the multi-byte NOP actually might generate an FP, but according to Intel, it doesn’t access the memory no matter what.

  3. Erez says:

    NOPs or fake opcodes are often used as fillers in code which might later be patched. It’s important that they won’t get stuck in the pipeline, because they actually run.
    I was wondering why no one uses jmp as a big nop. For instance, 8-byte align:
    jmp @+6
    db 6 dup (0)

    It doesn’t change flags, doesn’t write registers, doesn’t read/write memory, and it shouldn’t affect the pipeline or run prediction in anyway since it’s not conditional. Modern cpus should just skip it with no problem.
    Can anyone tell me if and why I’m wrong?

  4. arkon says:

    ah, i guess because it *does* affect some state in the cpu (other than eip, of course). thought, i don’t know why or how.

Leave a Reply