Sad But True, Really Long Paths

November 4th, 2009

It’s really an end case, some might claim. Though I find it irritating and I wanted to share it with you guys.
Nothing serious, I bet you know it. Well let’s get to the point then.
MAX_PATH_LEN or however it is exactly defined is 260. 260 bytes long to hold a path and everybody uses that. The thing is that under NTFS you can create really long paths (~32k), when each element (directory name or file name) has to be up to 250+something bytes, so you can chain a few easily as sub-directory and pass the 260 bytes limit.

I created a very long pathname using Python for ease:

import win32file
win32file.CreateDirectory(u"\\\\?\\c:\\01" + "2"*250, None)
win32file.CreateDirectoryW(u"\\\\?\\c:\\01" + "2"*250 + "\\" + "3"*250, None)

Then if you try to open it with explorer.exe you can enter only the first directory. Of course, you cannot browse the sub-directory, not even right click on it (you get the menu as if you right clicked on the background of the window) or delete it.. Explorer really acts weird with those directories.
Luckily RD /s always works, also fails in some cases. I was also trying to create many sub-directories and at some point it didn’t let me access to the lower sub-directory.

Now I ask myself, usually you ask the path length by passing a NULL instead of a buffer to the API, and this way you get the size, then you allocate that size and ask the data itself by a second call. Almost all Windows API work this way. So why support only 260 bytes full path name? Maybe it’s not practical to have that long file names? Even if it doesn’t, you are supposed to malloc already for the second call anyway… so it turns out people are just lazy and supply a buffer of 260 bytes and that’s it for first call and go.

One note though:
when I say 260 bytes, in reality it’s 260*2 bytes, cause NTFS stores the names in Unicode.

Waiting for someone to tell me I am wrong about the whole issue.

Don’t Wait, Shoot. (KeSetEvent)

November 3rd, 2009

Apparently when you call down a driver with IoCallDriver you can either wait ’till the operation is finished or not. If you wait, you will need somebody to tell you “hey dude, you can stop waiting”. But that’s trivial, you set up a completion routine that will be called once the lower driver is finished with your IRP. The problem is if in some cases you don’t check the return code from that driver, and you assume you should always wait. So you Wait. Now what? Now the lower driver suddenly returns immediately, but did you know that? Probably not, cause you’re not blocked forever, otherwise you would have noticed it immediately. However, there’s seemingly no problem, because the lower driver will call your completion routine anyway indirectly and there you will signal “hey dude, you can stop waiting”, right? Therefore, it turns out you waited for nothing and just consumed some resources (locks).
That’s why usually you will see a simple test to see if the return code from the IoCallDriver is STATUS_PENDING, and only then you will wait ’till the operation is finished, in order to make it synchronized, that’s all the talk about. The thing is that you still need to do that same check in the completion routine you supplied. It seems that if you simply call SetEvent (and remebmer, now we know nobody is waiting on the event anymore, so why signaling it from the beginning), you still cause some performance penalty. And when you’re in a filter driver, for instance, you shouldn’t. And it’s a bad practice programming anyway.

I think it’s quite clear why WaitForSingleObject is “slow”, though in our case it will be immediately satisfied and yet… But I didn’t realize SetEvent is also problematic at a first thought. I thought it was a matter of flagging a boolean. In some sense, it’s true, there’s more to it. You see, since somebody might be waiting on the event, you will have to wake up the waiting thread and for that you need to lock the dispatcher-lock, and to yield execution, etc. Now suddenly it becomes a pain, huh?

Actually it’s quite interesting the way the KeSetEvent works. They knew they had to satisfy waiters, so they acquire the dispatcher-lock in the first place, and then they can also safely touch the event-state.

Moral of the story, don’t wait if you don’t have to, just shoot!

Integer Promotion is Dodgey & Dangerous

October 28th, 2009

I know this subject and every time it surprises me again. Even if you know the rules of how it works and you read K&R, it will still confuse you and you will end up being wrong in some cases. At least, that’s what happened to me. So I decided to mention the subject and to give two examples along.

Integer promotions probably happen in your code so many times, and most of us are not even aware of that fact and don’t understand the way it works. To those of you who have no idea what integer promotion is, to make a long story short: “Objects of an integral type can be converted to another wider integral type (that is, a type that can represent a larger set of values). This widening type of conversion is called “integral promotion.”, cited by MSDN. Why? So calculation can be faster in some of the times, otherwise because of different types; so seamless conversions happen, etc. There are exact rules in the standard how it works and when, you should check it out on your own

enum {foo_a = -1, foo_b} foo_t;

unsigned int a = -1;
printf("%s", a == foo_a ? "true" : "false");

Can you tell what it prints?
It will print “true”. Nothing special right, just works as we expect?
Check the next one out:

unsigned char a = -1;
printf("%s", a == foo_a ? "true" : "false");

And this time? This one will result in “false”. Only because the type of ‘a’ is unsigned. Therefore it will be promoted to unsigned integer – 0x000000ff, and compare it to 0xffffffff, which will yield false, of course.
If ‘a’ were defined as signed, it would be ok, since the integer promotions would make sure to sign extend it.

Another simple case:

unsigned char a = 5, b = 200;
unsigned int c = a * b;
printf("%d", c);

Any idea what the result is? I would expect it to be (200*5) & 0xff – aka the low byte of the result, since we multiply uchars here, and you? But then I would be wrong as well. The result is 1000, you know why? … Integer Promotions, ta da. It’s not like c = (unsigned char)(a * b); And there is what confusing sometimes.
Let’s see some Assembly then:

movzx       eax,byte ptr [a]
movzx       ecx,byte ptr [b]
imul        eax,ecx
mov         dword ptr [c],eax

Nasty, the unsigned char variables are promoted to unsigned int. Then the multiplication happens in 32 bits operand size! And then the result is not being truncated, just like that, to unsigned char again.

Why is it dangerous? I think the answer is obvious.. you trivially expect for one result when you read/write the code, but in reality something different happens. Then you end up with a small piece of code that doesn’t do what you expect it to. And then you end up with some integer overflow vulnerability without slightly noticing. Ouch.

Update: Thanks to Daniel I changed my erroneous (second) example to what I really had in mind when I wrote this post.

diStorm3 – Call for Features

October 2nd, 2009

[Update diStorm3 News]

I have been working more and more on diStorm3 recently. The core code is already written, and it works so great. I am still not going to talk about the structure itself that diStorm uses to format the instructions. There are two API’s now, the old one, which takes a stream and formats it to text and a newer one, which takes a stream and formats it into structures. This one is much faster. Unlike diStorm64, where the text-formatting was coupled in the decoding code, it’s totally separated. For example, if you want to support AT&T syntax, you can do it in a couple of hours or less, really. I don’t like AT&T syntax, hence I am not going to implement it. I bet still many people don’t know how to read it without confusing…

Hereby, I am asking you guys to come up with ideas for diStorm3. So far I got some new ideas from people, which I am going to implement. Such as:
1) You will be able to tell the decoder to stop on any flow control instruction.
2) Instructions are going to be categorized, such as, flow-control, data-control, string instructions, io, etc. (To be honest, I am still not totally sure about this one).
3) Helper macros to extract data references. Since diStorm3 outputs structures, it’s really easy to know if there’s a data reference and its address. Therefore some macros will aid to do this work.
4) Code reference, – continues to next instruction, continues to a target according to a condition, or jump-always and call-always.

I am looking to hear more suggestions from you guys. Please be sure you are talking about disassembler features, and not other layers which use the disassembler.

Just wanted to let you know that diStorm3 is going to be dual licensed with GPL and commercial. diStorm64 is deprecated and I am not going to touch it anymore, though it’s still licensed as BSD, of course.

Trampolines In x64

September 27th, 2009

We got a few nice features from the new architecture of x64, like larger memory addressing, more registers (so fast call is the standard up to three registers and the rest get on the stack), and of course, a wider bandwidth of 64 bits, etc. AMD had a once in a life opportunity to change the ISA (instruction set architecture) a bit and to make it much better, but instead, they only added a very few new instructions, canceled a lot, and left the decoding as hard as before. Probably they were in a crazy rush, so that time Intel had to catch up with them!

The problem we face when hooking a function is how many bytes we will need to override. I already talked about Hot Patching and branching in x86. But I have never talked at length about x64. Usually most hookers use the JMP relative instruction (0xE9), which is possibly useful for x64 as well. And once again we are limited to a range of 2GB from the JMP instruction’s address, while in x64, 2GB is really nothing much. I also searched a bit over the inet to look for more info and found some interesting approaches. I decided to talk about them here and describe how they work.

1)
JMP relative instruction, when you know in advance the difference from the hooked function to the target trampoline is less than 2GB, is a very good method, only 5 bytes.

2)

MOV RAX, 
JMP RAX

This one is almost optimal, you can branch everywhere in the address space, it takes only 12 bytes. It suffers a destruction of a register. Of course, by the ABI (Application Binary Interface) which the compiler implements, some registers are defined as volatile, means you can use them almost any time without worrying or needing to restore them. Analyzing the function (using a disassembler) you may be able to know which register you can use safely. That’s a big headache though.
Note that you could replace the JMP RAX with PUSH RAX;RET, but it’s still the same size.

3)

PUSH 
; Only if required (when High half is non zero):
MOV [RSP+4], DWORD 
RET

This one was found on Nikolay Igotti’s blog. You split the QWORD address value into two DWORDs. The first push, although pushes a 32 bits value, really allocates 64 bits value on the stack. Then if the high half of the address is non zero, you will have to write it directly to the stack. This will write a full QWORD to the stack which you can then branch to by RETting. It takes 14 (=5+8+1) bytes, and doesn’t dirty any register.
Note that it’s possible to shave some bytes in a case where the high half is -1, by OR [RSP+4], -1 and the like.
4)
This one takes advantage of RIP relative addressing mode, check it out, yo

JMP [RIP+0]
DQ:

This one is cool, it will branch to the full address in the QWORD value. So it takes 14 bytes as well. On the other hand, I don’t like to mix code and data. In the world of firewalls (which do tons of hookings), for instance, hooking this function twice with two different methods, will probably lead to a crash, since most hooking engines disassemble the instructions, you will get garbage beginning with the second instruction.
This method leads to an interesting idea, you can move the address value around within a range of +-2GB, and then you will need only 6 bytes. In the instruction level, anytime you mention RIP it’s a waste of 4 bytes for a required 32 bits displacement, even if it’s 0, that sucks.
Unlikely, but still possible method sometimes is, JMP <Abs Addr>. Some OS API’s let you choose the allocated address, so you can make sure it’s in the first 2GB.
5)

MOV RAX, [ PointerToAddress64 ]
JMP RAX
;;;
PointerToAdderss64:
DQ

This one is similar to the former, but it can read the QWORD from everywhere. Because the addressing mode really supports 64 bit values. It takes 12 bytes. Needless to say, it dirties a register.
Note that this MOV RAX is a special instruction that supports a 64 bit addresses.

Each method has its own pros and cons. It seems you can do best by choosing a specific method according to the difference from the hooked function to the target trampoline address.
One crazy idea which I haven’t seen before is to use simple JMP relative (0xE9) instruction, and to jump into the MIDDLE of some function nearby, the function will be hooked in the middle to recover the hole (like a transparent proxy) and the hole will be filled with the full jump to any address.

Any other tricks you wanna share with us? Leave a comment.

Branching to Absolute Addresses

September 25th, 2009

The x86-64 architecture is very annoying sometimes. Especially when you want to hook functions or do some magical assembly tricks :)

Suppose you want to hook a function, you will have to change its first instruction to a jump instruction to your code, so the next time the function is being called, the control will be transferred to your stub. Everybody knows this technique, basically you will need 5 bytes, for the opcode E9 (JMP relative) and another DWORD for the relative offset itself. Then you are able to jump backward or forward by 2GB. Sometimes, it’s not enough, just because you have some limitations.
Now the problem with the processor is that there’s no instruction that can branch to an absolute address which is given straight to the instruction as an immediate operand. Try to think of a MOV EAX, 0x12345678,
but instead, JMPABS 0x12345678. What a shame, really. But no worries, of course it’s possible to work it out, the cost is a few more bytes. Sometimes you are short on bytes. However, just to open your appetite you can sometimes even hook a 1 byte function that only RETs in an efficient way. That’s worth another blog post sometime.

Anyway, what you can do is changing the code a bit and do something like this:

MOV EAX, 
JMP EAX ; <--- Suddenly the processor supports absolute addresses, ah ha!

But now, this code takes 7 bytes and it also screws up one of the registers. There are two possible fixes, either store and then reload EAX at the callee site (BAHH!!) or change the code again, which is trivial:

PUSH 
RET

This is very nice one, although, technically, I am not sure whether it's slower than a simple JMP. The algorithm for the RET instruction is quite impressive (to branch predict the caller(/return) address...), mind you. Now the code takes only 6 bytes, plus, you can access the whole address space.

Let's move to another similar issue, suppose you want to CALL an absolute address, what you gonna do now?

PUSH EBX
CALL $ + 5
HERE:
POP EBX
LEA EBX, [EBX + NEXT-HERE]
PUSH EBP
MOV EBP, ESP
XCHG EBX, [EBP + 4] ; 1337
POP EBP
PUSH 
RET
NEXT:

I will leave it to you to figure it out, hint - it doesn't screw any register.

Crashing The VS Compiler

September 16th, 2009

With this code:

template <class T> class bugs {
};
template<> bugs<int>::bugs()
{
}

It seems the compiler doesn’t like that we specialize a function that wasn’t first declared in the class. Oh well.

Cleaning Resources Automatically

September 15th, 2009

HelllllloW everybody!!!

Finally I am back, after 7 months, from a crazy trip in South America! Where I got robbed at gun-point, and denied access to USA (wanted to consult there), saw amazing views and creatures, among other stories, but it’s not the place to talk about them :) Maybe if you insist I will post once about it.

<geek warning>
Honestly, I freaked out in the trip without a computer (bits/assembly/compiler),  so all I could do was reading and following many blogs and stuff.
</geek warning>

I even learned that my post about the kernel DoS in XPSP3 about the desktop wallpaper weakness became a CVE. It seems MS has fixed it already, yey.

And now I need to warm up a bit, and I decided to dive into C++ with an interesting and very useful example. Cleaning resources automatically when they go out of scope. I think that not many people are aware to the simplicity of writing such a feature in C++ (or any other language that supports templates).

The whole issue is about how you destroy resources, I will use Win32 resources for the sake of example. I already talked once about this issue in C.

Suppose we have the following snippet:

HBITMAP h = LoadBitmap(...);
if (h == NULL) return 0;
HBITMAP h2 = Loadbitmap(...);
if (h2 ==  NULL) {
   DeleteObject(h);
   return 0;
}
char* p = (char*)malloc(1000);
if (p == NULL) {
   DeleteObject(h2);
   DeleteObject(h);
   return 0;
}

And so on, every failure handling in the if statements we have to clean more and more resources. And you know what, even today people forget to clean all resources, and it might even lead to security problems. But that was C times, and now we are all cool and know C++, so why not use it? One book which talks about these issues is Effective C++, recommended.

Also, another problem with the above code is while an exception is being thrown in the middle or afterward, you still have to clean those resources and copy/paste some lines, which makes it prone to errors.

Basically, all we need is a nice small class which will be called AutoResource that holds the object itself and will manage it. Personally, it reminds me auto_ptr class, but it’s way less permissive. You will be only able to initialize and use the object. Of course, it will be destroyed automatically when it goes out of scope.
How about this code now:

AutoResource<HBITMAP> h(LoadBitmap(...));
AutoResource<HBITMAP> h2(LoadBitmap(...));
char* p = new char[1000]; // If an exception is thrown, all objects initialized prior to this line are automatically cleaned.

Now you can move on and be totally free of ugly failure testing code and not worry about leaking objects, etc. Let’s get into details. What we really need is a special class that we can change the behavior of the CleanUp()  method according to its object’s type, that’s easily possible in C++ by using a method specialization technique. We will not let to copy or assign to the class. We want a total control over the object, therefore we will let the user to get() it too. And as a type of defense programming, we will force the coder to implement a specialized CleanUp() in a case he uses the class for new types and forgets to implement the new CleanUp code; By using compile time assertion (I used this trick from Boost). Also, there might be a case where the constructor input is NULL, and therefore the constructor will have to inform the caller by throwing an exception, download and check out the complete code later.

template <class T> class AutoResource {
public:
   AutoResource(T t) : m_obj(t) { }

   void CleanUp()
   {
      // WARNING:
      // If the assertion occurred you will have to specialize the CleanUp with the new type.
      BOOST_STATIC_ASSERT(0);
   }

   ~AutoResource()
   {
      CleanUp();
      m_obj = NULL;
   }
   T get() const
   {
      return m_obj;
   }
private:
   T m_obj;
};

//Here we specialize the CleanUp() for the HICON resource.
template<> void AutoResource<HICON>::CleanUp()
{
   DestroyIcon(AutoResource<HICON>::m_obj);
}

You can easily add new types and enjoy the class.  If you have any suggestions/fixes please leave a comment!

Download complete code: AutoResource.cpp.

Thanks to Yan Michalevsky for helping with the code!

P.S – While compiling this stuff, I found a crash in the Optimization Compiler of VS2008 :)  lol.

Arkon Under the Woods

May 2nd, 2009

Yeah, I am not alright, I am even an ass, leaving you all (my dear readers, are there any left?) without saying a word for the second time. At the end of last year I was in South East Asia for 3 months, and now I am in South America for 3 months and counting… It is just that I really wish to keep this blog totally technological, but I guess we are all human after all. So yeah, I have been trekking alot in Chile and Argentina (mostly in Patagonia) and having a great time here, now in Buenos Aires. Good steaks and wines, ohhh and the girls. Say no more.

It is really cool that almost every shitty hostel you go, you will find a WiFi available for free use. So carrying an Ipod touch with me I can actually be online, but apparently not many web developers think about Mobile web pages and thus I couldnt write blog posts with Safari, because there is some problem with the text area object. For some reason, I guess some JS code, doesnt run well on the Ipod and I dont get that keyboard thingy up and cannot type in anything, wordpress…

I am always surprised again to see how many computers here, in coffee-shops or just Internet shops, are not really secured. You run as admin some of the times. And there are not anti virus, which I think are good for the average users. And if you plug in your camera to upload some pictures, the next time you will see some stupid new files on it, named: desktop.ini and autorun.inf, sounds familiar? And then I read some MS blog post about disabling AutoRun for removable storage devices..yipi, about time. What I am also trying to say, that one can easily create a zombies army so easily with all those computers… the ease of access and no protection drives me mad.

Anyhow, I had some free time, of course, I am on a vacation, sort of, after all. And I accidentally reached some amazing blog that I couldnt stopped reading for a few days. Meet NO EXECUTE! If you are low level freaks like me, you will really like it too, although Darek begins with hardware stuff, which will fill some gaps for most people I believe, he talks about virtualizations and emulators (his obsession), and I just read it like some fantasies book, eager to get already to the next chapter everytime. I learnt tons of stuff there, and I really like to see that even today some few people still measure and check optimizations in cycles per instructions rather than seconds or MS. One of the stuff I really liked there was a trick he pulled when the guest OS runs on little endian, for instance, and the host OS runs on big endian. Thus every access to memory has to be swapped when the size of the access is more than 2 bytes, of course. Therefore, in order to eliminate the byte swaps, which is expensive, he kinda turned all the memory of the guest OS upside down, and therefore the endianity changed as well. Now it might sound as a simple matter, but this is awesome, and the way he describes it, you can really feel the excitment behind the invention… He also talks about how lame Intel and AMD are to come up with new instruction sets every Monday, which I already mentioned also in the past.

Regarding diStorm now, I decided that I will discontinue the development of the current diStorm64 version. But hey, dont worry. I am going to open source diStorm3 and I still consider making it dual licensed. The benefits of diStorm3 are structure output, and believe me, the speed is amazing and like the good old days, the structure per instruction is unbelieable tiny in size (relative to other disassemblers I saw out there), and you guys are gonna like it.

Thing is, I have no idea when I am getting home…Now with this Swine Flu spreading like hell, I dont know where I will end up. The only great thing about this Swine Flu, so to speak, is that you can see the Evolution in Progress.

Salud

VML + ANI ZERT Patches

February 3rd, 2009

It is time to release an old presentation about the VML and ANI vulnerabilities that were patched by ZERT. It explains the vulnerabilities and how they were closed. It is somewhat very technical, Assembly is required if you wanna really enjoy it. I also gave a talk using this presentation in CCC 2007. It so happened that I wrote the patches, with the extensive help of the team, of course.

ZERT Patches.ppt