Archive for the ‘Hardware’ Category

Custom Kernel Debugging is Faster

Tuesday, July 20th, 2010

When you start to write a post you always get a problem with the headline for the post. You need to find something that will, in a few words, sum it up for the reader. I was wondering which one is better, “Boosting WinDbg”, “Faster Kernel Debugging in WinDbg”, “Hacking WinDbg” and so on. But they might be not accurate, and once you will read the post you won’t find them appropriate. But instead of talking about meta-post issues, let’s get going.

Two posts ago, I was talking about hunting a specific race condition bug we had in some software I work on. At last, I have free time to write this post and get into some interesting details about Windows Kernel and Debugging.

First I want to say that I got really pissed off that I couldn’t hunt the bug we had in the software like a normal human being, that Jond and I had to do it the lame old school way, which takes more time, lots of time. What really bothered me is that computers are fast and so is debugging, at least, should be. Why the heck do I have to sit down in front of the computer, not mentioning – trying to dupe the damned bug, and only then manage to debug it and see what’s going on wrong. Unacceptable. You might say, write a better code in the first place, I agree, but even then people have bugs, and will have, forever, and I was called to simply help.

Suppose we want to set a breakpoint on memory access this time, but something more complicated with conditions. The reason we need a condition, rather than a normal breakpoint is because the memory we want to monitor gets accessed thousands times per second, in my case with the race condition, for instance.
You’re even welcome to make the following test locally on your computer, fire up Visual Studio, and test the following code: unsigned int counter = 1; while (counter < 99999999+1) { counter++; }, set a memory access breakpoint on counter which stops when hit count reach 99999999, and time the whole process, and then time it without the bp set, and compare the result, what's the ratio you got? Isn't that just crazy? Here's an example in WinDbg's syntax, would be something like this: ba w4 0x491004 "j (poi(0x491004)==0) 'gc'" Which reads: break on write access for an integer at address 0x491004 only if its value is 0, otherwise continue execution. It will be tens-thousands times faster without the bp set, hence the debugging infrastructure, even locally (usermode), is slowing things down seriously. And think that you want to debug something similar on a remote machine, it's impossible, you are going to wait years in vain for something to happen on that machine. Think of all the COM/Pipe/USB/whatever-protocol messages that have to be transmitted back and forth the debugged machine to the debugger. And add to that the conditional breakpoint we set, someone has to see whether the condition is true or false and continue execution accordingly. And even if you use great tools like VirtualKD. Suppose you set a breakpoint on a given address, what really happens once the processor executes the instruction at that address? Obviously a lot, but I am going to talk about Windows Kernel point of view. Let's start bottom up, Interrupt #3 is being raised by the processor which ran that thread, which halts execution of the thread and transfers control _KiTrap3 in ntoskrnl. _KiTrap3 will build a context for the trapped thread, with all registers and this likely info and call CommonDispatchException with code 0x80000003 (to denote a breakpoint exception). Since the 'exception-raising' is common, everybody uses it, in other exceptions as well. CommonDispatchException calls _KiDispatchException. And _KiDispatchException is really the brain behind all the Windows-Exception mechanism. I'm not going to cover normal exception handling in Windows, which is very interesting in its own. So far nothing is new here. But we're getting to this function because it has something to do with debugging, it checks whether the _KdDebuggerEnabled is set and eventually it will call _KiDebugRoutine if it's set as well. Note that _KiDebugRoutine is a pointer to a function that gets set when the machine is debug-enabled. This is where we are going to get into business later, so as you can see the kernel has some minimal infrastructure to support kernel debugging with lots of functionality, many functions in ntoskrnl which start in "kdp", like KdpReadPhysicalMemory, KdpSetContext and many others. Eventually the controlling machine that uses WinDbg, has to speak to the remote machine using some protocol named KdCom, there's a KDCOM.DLL which is responsible for all of it. Now, once we set a breakpoint in WinDbg, I don't know exactly what happens, but I guess it’s something like this: it stores the bp in some internal table locally, then sends it to the debugged machine using this KdCom protocol, the other machine receives the command and sets the breakpoint locally. Then when the bp occurs, eventually WinDbg gets an event that describes the debug event from the other machine. Then it needs to know what to do with this bp according to the dude who debugs the machine. So much going on for what looks like a simple breakpoint. The process is very similar for single stepping as well, though sending a different exception code.

The problem with conditional breakpoints is that they are being tested for the condition locally, on the WinDbg machine, not on the server, so to speak. I agree it’s a fine design for Windows, after all, Windows wasn’t meant to be an uber debugging infrastructure, but an operating system. So having a kernel debugging builtin we should say thanks… So no complaints on the design, and yet something has to be done.

Custom Debugging to our call!

That’s the reason I decided to describe above how the debugging mechanism works in the kernel, so we know where we can intervene that process and do something useful. Since we want to do smart debugging, we have to use conditional breakpoints, otherwise in critical variables that get touched every now and then, we will have to hit F5 (‘go’) all the time, and the application we are debugging won’t get time to process. That’s clear. Next thing we realized is that the condition tests are being done locally on our machine, the one that runs WinDbg. That’s not ok, here’s the trick:
I wrote a driver that replaces (hooks) the _KiDebugRoutine with my own function, which checks for the exception code, then examines the context according to my condition and only then sends the event to WinDbg on the other machine, or simply “continues-execution”, thus the whole technique happens on the debugged machine without sending a single message outside (regarding the bp we set), unless that condition is true, and that’s why everything is thousands of times or so faster, which is now acceptable and usable. Luckily, we only need to replace a pointer to a function and using very simple tests we get the ability to filter exceptions on spot. Although we need to get our hands dirty with touching Debug-Registers and the context of the trapping thread, but that’s a win, after all.

Here’s the debug routine I used to experiment this issue (using constants tough):

int __stdcall my_debug(IN PVOID TrapFrame,
	IN PVOID Reserved,
	IN PEXCEPTION_RECORD ExceptionRecord,
	IN PCONTEXT Context,
	IN KPROCESSOR_MODE PreviousMode,
	IN UCHAR LastChance)
{
	ULONG _dr6, _dr0;
	__asm {
		mov eax, dr6
		mov _dr6, eax
		mov eax, dr0
		mov _dr0, eax
	};
	if ((ExceptionRecord->ExceptionCode == 0x80000003) &&
		(_dr6 & 0xf) &&
		(_dr0 == MY_WANTED_POINTER) &&
		(ExceptionRecord->ExceptionAddress != MY_WANTED_EIP))
	{
		return 1;
	}
	return old_debug_routine(TrapFrame, Reserved, ExceptionRecord, Context, PreviousMode, LastChance);
}

This routine checks when a breakpoint interrupt happened and stops the thread only if the pointer I wanted to monitor was accessed from a given address, else it would resume running that thread. This is where you go custom, and write whatever crazy condition you are up to. Using up to 4 breakpoints, that’s the processor limit for hardware breakpoints. Also checking out which thread or process trapped, etc. using the Kernel APIs… It just reminds me “compiled sprites” :)

I was assuming that there’s only one bp set on the machine which is the one I set through WinDbg, though this time, there was no necessity to set a conditional breakpoint in WinDbg itself, since we filter them using our own routine, and once WinDbg gets the event it will stop and let us act.

For some reason I had a problem with accessing the DRs from the Context structure, I didn’t try too hard, so I just backed to use them directly because I can.

Of course, doing what I did is not anything close to production quality, it was only a proof of concept, and it worked well. Next time that I will find myself in a weird bug hunting, I will know that I can draw this weapon.
I’m not sure how many people are interested in such things, but I thought it might help someone out there, I wish one day someone would write an open source WinDbg plugin that injects kernel code through WinDbg to the debugged machine that sets this routine with its custom runtime conditional breakpoints :)

I really wanted to paint some stupid pictures that show what’s going on between the two machines and everything, but my capabilities at doing that are aweful, so it’s up to you to imagine that, sorry.

For more related information you can see:
http://uninformed.org/index.cgi?v=8&a=2&p=16
http://www.vsj.co.uk/articles/display.asp?id=265

It’s Vexed :)

Tuesday, November 17th, 2009

In the last few week I’ve been working on diStorm to add the new instruction sets: AVX and FMA. You can find lots of information about them on the inet. In a brief, the big advantages, are support of 256 bit registers, called now YMM’s (their low halves are XMM’s) and also support for AES built in, you have a few instruction to do small block encryption and decryption, really sweet. I guess it will help some security companies out there to boost stuff. Also the main feature behind these instruction sets is the 3 registers operands. So now you are not stuck with 2 registers per instruction, you can have up to four sometimes. This is good because you have two source operands and a destination operand, which means it saves you other instructions (to move or backup registers) and you don’t have to ruin your dest-src operand like in the old sets. Almost forgot to mention FMA itself, which is fused multiply-add instructions, so you can do two operations at once, like A*B+C, etc.

I wanted to talk about the VEX prefix itself. It’s really a new design and approach to prefixes that was never seen before. And the title of this post says it all, it’s really annoying.
The VEX (Vector Extension) prefix, is a multi-byte prefix for a change. It can be either 2 or 3 bytes. If you take a look at the one byte opcodes map, all of them are taken. Intel was in the need of a new unused byte, which didn’t really exist. What they did instead, was to share two existing opcodes, for each prefix. The sharing works in a special way, that let them know if you meant to use the original instruction or the new prefix. The chosen instructions are LDS (0xc4) and LES (0xc5). When you examine the second byte of these instructions, the byte upon which the new information is extracted from, you can learn that the most significant two bits can’t be set together (I.E: the value of 0xc0 or higher). However, if they are set, the processor will raise an illegal instruction exception. This is where the VEX prefix enters into the game. Instead of raising an exception they will be decoded as this special multi-byte prefix. Note that in 64 bits mode, all those Load-Segment instruction are invalid, so there is no need for sharing the opcode. When you encounter 0xc4 or 0xc5, you know it’s a VEX prefix, as simple as that. Unfortunately this is not the case in 32 bits mode, and since the second byte has to be with a value higher than 0xc0 (because the two most significant bits have to be set in 32 bits), the field in these corresponding bits is inverted actually, which means you will have to extract a few bits that represent some fields and bitwise-not them. This is seriously gross, but it seems Intel didn’t have much of a choice here. If it were up to me, I would do the same eventually, for the sake of backward compatibility, but it doesn’t make it any prettier to be honest. And for your information, AMD pulled the same trick but with the POP instruction (0x8f) for their new instruction sets (XOP, etc), without full backward compatibility.

To make some order in the bits let’s have a look at the following figure:
vexprefix
Cited Intel
Well, I am not going to talk about all fields, (which somewhat are similar to REX for 64 bits) but just about one interesting feature. Since now the prefix is 2/3 bytes and usually an SSE instruction is at least 3 bytes, this will explode the code segment with huge instruction, and certainly gonna make the processor cry a lot to fetch instructions. The trick that was used by Intel (and AMD too) was to have a field that will imply which prefix byte to put virtually before the VEX prefix itself, so this way we saved one byte. And the same idea was used again to spare the 0x0f escape byte or even two bytes of 0x0f, 0x38 or 0x0f, 0x3a which are very common basic opcodes for SSE instructions. So if we had to use an SSE instruction, for instance:
66 0f 38 17 c0; PTEST XMM0, XMM0 – has first 3 bytes that can be implied in the VEX prefix, thus it stays the same size! Kawabanga

I talked in an earlier post that I am not going to support SSE5 as for now, in the hope it’s gonna die. I believe CPU (or instruction set architectures, to be accurate) wars are bad for the coders and even for end users who can’t really enjoy those great technologies eventually.

A BSWAP Issue

Saturday, November 7th, 2009

The BSWAP instruction is very handy when you want to convert a big endian value to a little endian value and vice versa. Instead of reversing the bytes yourself with a few moves (or shifts, depends how you implement it), a single instruction will do it as simple as that. It personally reminds me something like the HTONS (and friends) in socket programming. The instruction supports 32 bits and 64 bits registers. It will swap (reverse) all bytes inside correspondingly. But it won’t work for 16 bits registers. The documentation says the result is undefined. Now WTF, what was the issue to support 16 bits registers, seriously? That’s a children’s game for Intel, but instead it’s documented to be undefined result. I really love those undefined results, NOT.
When decoding a stream in diStorm in 16 bits decoding mode, BSWAP still shows the registers as 32 bits. Which, I agree can be misleading and I should change it (next ver). Intel decided that this instruction won’t support 16 bits registers, and yet it will stay a legal instruction, rather than, say, raising an exception of undefined instruction, there are known cases already when a specific destination register can cause to an undefined instruction exception, like MOV CR5, EAX, etc. It’s true that it’s a bit different (because it’s not the register index but the decoding mode), but I guess it was easier for them to keep it defined and behave weird. When I think of it again, there are some instruction that don’t work in 64 bits, like LDS. Maybe it was before they wanted to break backward compatibility… So now I keep on getting emails to fix diStorm to support BSWAP for 16 bits registers, which is really dumb. And I have to admit that I don’t like this whole instruction in 16 bits, because it’s really confusing, and doesn’t give a true result. So what’s the point?

I was wondering whether those people who sent me emails regarding this issue were writing the code themselves or they were feeding diStorm with some existing code. The question is how come nobody saw it doesn’t work well for 16 bits? And what’s the deal with XCHG AH, AL, that’s an equal for BSWAP AX, which should work 100%. Actually I can’t remember whether compilers generate code that uses BSWAP, but I am sure I have seen the instruction being used with normal 32 bits registers, of course.

Arkon Under the Woods

Saturday, May 2nd, 2009

Yeah, I am not alright, I am even an ass, leaving you all (my dear readers, are there any left?) without saying a word for the second time. At the end of last year I was in South East Asia for 3 months, and now I am in South America for 3 months and counting… It is just that I really wish to keep this blog totally technological, but I guess we are all human after all. So yeah, I have been trekking alot in Chile and Argentina (mostly in Patagonia) and having a great time here, now in Buenos Aires. Good steaks and wines, ohhh and the girls. Say no more.

It is really cool that almost every shitty hostel you go, you will find a WiFi available for free use. So carrying an Ipod touch with me I can actually be online, but apparently not many web developers think about Mobile web pages and thus I couldnt write blog posts with Safari, because there is some problem with the text area object. For some reason, I guess some JS code, doesnt run well on the Ipod and I dont get that keyboard thingy up and cannot type in anything, wordpress…

I am always surprised again to see how many computers here, in coffee-shops or just Internet shops, are not really secured. You run as admin some of the times. And there are not anti virus, which I think are good for the average users. And if you plug in your camera to upload some pictures, the next time you will see some stupid new files on it, named: desktop.ini and autorun.inf, sounds familiar? And then I read some MS blog post about disabling AutoRun for removable storage devices..yipi, about time. What I am also trying to say, that one can easily create a zombies army so easily with all those computers… the ease of access and no protection drives me mad.

Anyhow, I had some free time, of course, I am on a vacation, sort of, after all. And I accidentally reached some amazing blog that I couldnt stopped reading for a few days. Meet NO EXECUTE! If you are low level freaks like me, you will really like it too, although Darek begins with hardware stuff, which will fill some gaps for most people I believe, he talks about virtualizations and emulators (his obsession), and I just read it like some fantasies book, eager to get already to the next chapter everytime. I learnt tons of stuff there, and I really like to see that even today some few people still measure and check optimizations in cycles per instructions rather than seconds or MS. One of the stuff I really liked there was a trick he pulled when the guest OS runs on little endian, for instance, and the host OS runs on big endian. Thus every access to memory has to be swapped when the size of the access is more than 2 bytes, of course. Therefore, in order to eliminate the byte swaps, which is expensive, he kinda turned all the memory of the guest OS upside down, and therefore the endianity changed as well. Now it might sound as a simple matter, but this is awesome, and the way he describes it, you can really feel the excitment behind the invention… He also talks about how lame Intel and AMD are to come up with new instruction sets every Monday, which I already mentioned also in the past.

Regarding diStorm now, I decided that I will discontinue the development of the current diStorm64 version. But hey, dont worry. I am going to open source diStorm3 and I still consider making it dual licensed. The benefits of diStorm3 are structure output, and believe me, the speed is amazing and like the good old days, the structure per instruction is unbelieable tiny in size (relative to other disassemblers I saw out there), and you guys are gonna like it.

Thing is, I have no idea when I am getting home…Now with this Swine Flu spreading like hell, I dont know where I will end up. The only great thing about this Swine Flu, so to speak, is that you can see the Evolution in Progress.

Salud

x86 Instruction Set Wars

Thursday, August 14th, 2008

It all back in the 90’s where Intel came up with the dashing MMX technology. AMD wasn’t so late to respond with its new 3DNow! instruction set. That if you ask me, was much less popular and less used. But the good thing about 3DNow! is that it handled floating points rather than integers. And then came SSE instruction set, and the world got better and nowadays compilers even use it up to SSE2 while knowing it will be there when the code runs for 99% today. However, they still make a SSE test to be sure they can use it, I believe this check will always stay in code for assurance anyway, can’t harm, you know.

Now almost 10 years later, we see another split in technologies, Intel came up with SSE4, I already talked about it in an earlier post, which contained really valuable instructions. And then, guess what? AMD added several instruction on top of Intel’s.

In August 07 AMD announced a new set: SSE5. Aren’t we sick off SSE anymore??? Anyway, in April this year, Intel announced in response a new AVX instruction set (Advanced Vector Extensions).

This doesn’t go anywhere. Every company in its turn announces a new instruction set. The first company doesn’t support the other’s and vice versa. This is just going wrong and it’s our nightmare. Basically, most developers should not care anyway, they don’t use these sets, and they (partly) are not out officially (means you can’t use it as for now). Therefore, the game matters for compiler developers and those who write the tools that mess with machine code, etc. Probably many codecs and crypto algos will use it too directly if it exists on the processor they get to run on…

The thing is, I decided to take a side, and to stick only to AVX, and therefore I won’t support SSE5 in diStorm. If anyone from the community is gonna help with that, I will gladly accept it, don’t get me wrong. Since I don’t have much time for it anyway, and I hate this mess up, I will stick to Intel. I don’t know if it’s a good or bad news for diStorm, but as the only owner and the way I see it, I gotta do something against this lack of standard thing. Everything has standards today, why can’t the damned x86 have one as well?

I’m not the only one who speaks about this, Agner Fog has also talked about it here. The difference is that I can make my own small change.

What do you think should happen with this issue??? Who’s right, Intel or AMD? Should developers code twice now for each instruction set?

arrrrgh so many questions

AAA/AAS Quirk

Thursday, April 3rd, 2008

As I was working on the simulation of these two instructions I found that they have some quirk, although the algorithms for these instructions are described in Intel’s specs, which (seems to) make the output defined for all inputs, it is not the case. Everytime I finish writing an implementation for a specific instruction I add that instruction to my unit tests. The instruction is being simulated with random(and some smarter) input and then checked against pure native execution to see if the results are correct. So this way I found a quirk for a range of input that reveals how the instruction is really implemented (microcode stuff prolly) rather than how it’s documented.

AL = AL + 6 is done when the AF is set or low nibble of AL is above 9. According to the documentation the destination register is AL, but in reality the destination register is AX. Now how do we know such a thing?

If we try the following input:
mov al, 0xff
aaa

The result will be 0x205, rather than 0x105 (which is what we expect according to the docs).

What really happens is that we supply a number that when added with 6 creates a carry into AH, thus incrementing AH by 1. Then looking at the docs again, we see that if AL was added with 6, it also increments AH by 1 manually. Thus AH is really incremented by 2. :P

The question is why they do AX = AX + 6, rather then operating on AL. No, actually the biggest question is why I get this same behavior on an AMD processor (whereas I work on an Intel processor). And we already by my last post about SHLD that they don’t work the same in some undefined behavior aspects (although some people believe AMD copied Intel’s architecture and implementation)…

There might be some people who will say that I went too far with testing this instruction, because I, somewhat, supply an input which is not in the valid range (it’s unpacked BCD after all), which therefore I must not rely on the output. The thing is, the algorithm is defined well to receive any input I pass it, hence I expect it to work for even undefined input. Though I believe there is no such a thing as undefined input, only undefined output, and that’s why I implemented my instrction as they both did. Specifically where they both didn’t state anything about undefined input/output, which makes my case stronger. Anyway, the point is they don’t tell us something here, the implementation is not similar to this documented in both AMD/Intel docs.

This quirk works the same for AAS, where instead of doing AL = AL -6, it’s really AX = AX – 6. I also tried to see whether they work on the whole EAX, but I saw that the high word wasn’t changed (by carry/borrow). And I also tried to see if this ill behavior is found in both DAA/DAS, but no.

Shift Double Precision

Saturday, March 29th, 2008

Were you asking me I had no idea why Intel has support for shift double precision in the 80×86. Probably their answer would be “because it used to be a CISC processor”. The shift double precision is pretty easy to implement algorithm. But maybe it was popular back then and they decided to support it hardware-ly. Like now that they add very important instructions to the SSE sets. Even so, everyone (includes me) seems to implement the algorithm like this:

(a << c) | (b >> (32-c))

Where a and b are the 32 bits input variables(/registers) and c is the count. The code shows a shift left double precision. Shifting right will require to change the shifts direction for each one of the shifts. However, if a and b were 16 bits, the equation of the second shift amount changes to (16-c). And now there is a problem, why? Because we might enter into the magical world of undefined behavior. And why is that? Because the first thing that describes the shift/rotate instructions is that the count operand is masked to preserve only the 5 least significant bits. This is because the largest shift amount for a 32 bits input is 32 shifts (and then you get a 0, ignore SAR for now). And if the input is 16 bits, the count is still masked with 31. That means that you can shift a 16 bits register more than its size. Which doesn’t make much sense, but possible for other shift instructions. But when you use a shift double preicision, not that it doesn’t makes sense, it is also undefined. That is the result is undefined, because then you try to move bits from b into a. But the count becomes negative. For example: shld ax, bx, 17. And internally the second shift amount is calculated as (16-c) which becomes (16-17). And that’s bad, right?

In reality everything is defined when it comes to digital logic. Even the undefined stuff. There must be a reason to the result I get from executing such an instruction like in the example above, even though it’s correctly and officially undefined. And I know that there is a rational behind it, because the result is consistent (at least to my Intel Core2Duo processor). So being the stubborn I am, I decided I want to know how that calculation is really being done in the hardware level.

I forgot to mention that the reason I care of how to implement this instruction is because I have to simulate it for the Vial project. I guess eventually it’s a waste of time, but I really wanted to know what’s going on anyway. Therefore I decided to research the matter and get with the algorithm my processor uses. Examining the results of officially undefined results, I quickly managed to see how to calculate the shift like the processor does, and it goes like this for 16 bits input (I guess, it will work the same for 8 bits input as well, and note that 32 bits input can’t have an undefined range, because you can’t get a negative shift amount):

def shld(a, b, c):
 c &= 31
 if c <= 15:
  return ((a << c) | (b >> (16-c))) & 0xffff
 else:

  # Undefined behavior:
  c &= 15
  return ((b << c) | (a >> (16-c))) & 0xffff

Yes, the code is in Python. But you can see that if the the count is bigger than 15, then we are replacing the input order. And then comes the part where you say “NOW WTF?!”. Even though I got this algorithm to return the same results as the processor does for defined and undefined input, I could wager the processor won’t do this kind of stuff internally. So I sat down some (long) more, and stared at the code, doing a few experiments here and there. Eventually it occurred to me:

def shld(a, b, c):
 c &= 31
 x = a | (b << 16)
 return ((x << c) | (x >> (32-c))) & 0xffff

Now you can see that the input for the original equation is the same bits-buffer input, which contains both inputs together as one. Taking a count of 17, won’t yield a negative register, but something else. Anyway, I have no idea why they implemented this instruction like they did (and it applies to SHRD as well), but I believe it has something to do with the way their processor so-called ‘engine’ works and hardware stuff.

After I learned how it works I was so eager to see how it works on AMD. And guess what? They don’t work the same, where it comes to the undefined behavior, of course. And since I don’t have an AMD anymore I didn’t see how they really implemented their shift double precision instructions.

In the Vial project, where I simulate these instructions, I added a special check for the count, to see that it’s not bigger than the input size, and if it is, I mark the destination register and some of the flags as Undefined. This way I will know when I do code-analysis that something is really wrong/buggy with the way the application works. Now what if the application is purposely uses the undefined behavior? Screw us both then. Now why would a sane application do that? ohh and that’s another story…

By the way, other shift/rotate instructions don’t have any problem with the shift amount since they can’t yield negative shift amount internally in any way, therefore the results are always defined for every input.

Multi-byte NOPs

Tuesday, June 26th, 2007

Horay, finally official multi-byte NOPs are supported by Intel 80×86 processors. So OK, not all processors support it yet, and you have to check the result of a cpuid instruction to know whether you can use it or not, but it’s a good start nevertheless. It is unnecessary to mention that this kind of NOP is a real NOP, and yet I just did. :)

So why do we care you ask?

Well, probably you certainly don’t. But some code generator tools will use it in the future, or hand written code in assembly. It will make the code more understandable and it will fit all sizes you need from 2 bytes up to 9 bytes. However, to be honest, I can’t really grasp the reason why they actually support it. I mean, one could write X single byte NOPs. But then I thought, the processor will spend time for decoding the instructions in the pipeline until it will realize it’s a NOP. So I guess the multi-byte NOP will be faster than a varying number of NOPs in a row. I tried to think what’s the benefit of such thing. Most of the NOPs or pseudo NOPs are used to align code to 8 or 16 bytes boundaries. This is probably for caching reasons and faster memory read operations. But if the NOPs are used for alignment usually they will follow a branch and (sometimes) even conditional branch, so they might not get to be executed most of the times. So why the effort for making such a new instruction? I can’t come up with a good answer. Got any idea? Ah, and if you were to align your code and it won’t run then why not dumping zeroes as a padding? And it’s not like 80×86 architecture has delayed branch… So they had a good reasoning for coming up with this instruction. The only real answer I came up with is for time critical code where you have to measure your code in MS…

Maybe you already saw some pseudo NOPs in the past. These are real instructions that don’t affect registers or flags of the current execution context. You can come up with many variations of such NOPs using the LEA instruction:

LEA REG, [REG+0]

And you can come up with less popular NOPs, MOV REG, REG…

The advantage of the LEA instruction is that it accepts a complex addressing mode operand so you can make the whole instruction bigger easily by using different sizes of the immediate ‘0’. If you don’t understand how the ModR/M is formatted, you will have to assemble the code and disassemble it until it fits the size you require. Not mentioning that you can’t control 100% of the code generated by the assembler that you would prefer to use the DB thingy directly to emit the pseudo NOP. The multi-byte NOP accepts the same source operand as LEA accepts, therefore its size varies.

Well, you might find yourself find of ADD EAX, 0 as a NOP. Probably because in math it doesn’t mean anything special. But while executed it affects the flags. Though, you might say it’s not a big deal and that depends where you dump the pseudo NOPs in your code. However, it seems that even a popular assembler had some bad issues with generating pseudo NOPs. And believe me, when you trust a tool and it misbehaves, you might get crazy until you realize who’s to blame!

Now I have to make diStorm support it ;)

NOP NOP NOP!!!1

FAT Bastard

Monday, June 25th, 2007

Hey, sorry for being so quiet for the last week… been busy.

 Anyways, this is an interesting thing that happened to me a couple of weeks a go. My father bought some IPod nano thingy, I wouldn’t call it fake, but it has its advantages, radio, video and it even lets you delete files directly. Don’t get me wrong, I myself have the IPod nano which is really cool and prolly enough for most of my needs. But it is pretty expensive for a 2gb disk… oh well. So I was uploading songs to my father’s player. I was kinda lazy and decided to dump all songs in the root directory. After a while I got some message while trying to copy more songs, something like “Cannot create a file or directory”. It really annoyed me, I thought it might be something with the unicode file names, so I even tried to rename them to English alphabetic, just in case, but for no avail. At this moment, I wanted to give up. I was lacking any idea of what the problem might be. I almost thought the player is broken in some way and should be replaced…who knows.

The day after I sat with my friends from work, telling them this issue, no one had any idea what it might be. (And they are people like me who should have the faintest idea.. but nothing). So I told them that the only crazy idea I had in mind is that I can’t copy the songs because the player is formatted with FAT16. Which if you know or not, has a static size (of sectors) for the root directory. It means that if you allocate all these sectors, the root directory is full and you can’t continue creating a directory or copying a new file onto it. I even wrote a sample that shows how to parses FAT12 (FAT16 is easier to parse, because the size of a chain pointer is a whole 2 bytes, not 1.5…but both are piece of cakes compared to todays FS’s). Anyways, I got back home, formatted the player’s disk, for some reason it got crazy and did directory loops (recursively linking to the same directory) ??? I didn’t know what was going on. And to be honest, really didn’t care. Later on, I decided that now that I think I know what was the problem, I must create directories. Now, in FAT the sub-directories are not limited in any way (as opposed to the root directory). Creating directories and uploading songs, finally, I could easily upload songs and use the whole disk.

I bet that even the support team of the company who invented this player has no idea about this limitation. Funny enough I later noticed that is also supports FAT32 (which doesn’t suffer from the root directory limitation), but it seemed like the player got problems browsing the directories… aargh.

 I think this is a really cool example that shows that your low-level knowledge might help you for real high-level stuff… Knowledge is power. :)

 P.S: When saying “disk” as the player’s storage, I meant a flash storage…but it doesn’t really have anything with my post.

Here is more info about FAT16 vs FAT32.