Archive for the ‘C++’ Category

Private Symbols Look Up by Binary Signatures

Friday, July 1st, 2011

This post could really be extended and divided into a few posts, but I decided to try and keep it small as much as I can. If I see it draws serious attention I might elaborate on the topic.

Signature matching for finding functions is a very old technique, but I haven’t found anyone who talks about it with juicy details or at all, and decided to show you a real life example. It is related to the last post about finding service functions in the kernel. The problem is that sometimes inside the kernel you want to use internal functions, which are not exported. Don’t start with “this is not documented story”, I don’t care, sometimes we need to get things done no matter what. Sometimes there is no documented way to do what you want. Even in legitimate code, it doesn’t have to be a rootkit, alright? I can say, however, that when you wanna add new functionality to an existing and working system, in whatever level it might be, you would better depend as much as you can on the existing functionality that was written by the original programmers of that system. So yes, it requires lot of good reversing, before injecting more code and mess up with the process.
The example of a signature I’m going to talk about is again about getting the function ZwProtectVirtualMemory address in the kernel. See the old post here to remember what’s going on. Obviously the solution in the older post is almost 100% reliable, because we have anchors to rely upon. But sometimes with signature matching the only anchors you have are binary pieces of:
* immediate operand values
* strings
* xrefs
* disassembled instructions
* a call graph to walk on
and the list gets longer and can really get crazy and it does, but that’s another story.

I don’t wanna convert this post into a guideline of how to write a good signature, though I have lots of experience with it, even for various archs, though I will just say that you never wanna put binary code as part of your signature, only in extreme cases (I am talking about the actual opcode bytes), simply because you usually don’t know what the compiler is going to do with the source code, how it’s going to look in assembly, etc. The idea of a good signature is that it will be as generic as possible so it will survive (hopefully) the updates of the target binary you’re searching in. This is probably the most important rule about binary signatures. Unfortunately we can never guarantee a signature is to be future compatible with new updates. But always test that the signature matches on a few versions of the binary file. Suppose it’s a .DLL, then try to get as many versions of that DLL file as possible and make a script to try it out on all of them, a must. The more DLLs the signature is able to work on successfully, the better the signature is! Usually the goal is to write a single signature that covers all versions at once.
The reason you can’t rely on opcodes in your binary signature is because they get changed many times, almost in every compilation of the code in a different version, the compiler will allocate new registers for the instructions and thus change the instructions. Or since code might get compiled to many variations which effectively do the same thing, I.E: MOV EAX, 0 and XOR EAX, EAX.
One more note, a good signature is one that you can find FAST. We don’t really wanna disassemble the whole file and run on the listing it generated. Anyway, caching is always a good idea and if you have many passes to do for many signatures to find different things, you can always cache lots of stuff, and save precious loading time. So think well before you write a signature and be sure you’re using a good algorithm. Finding an XREF for a relative branch takes lots of time, try to avoid that, that should be cached, in one pass of scanning the whole code section of the file, into a dictionary of “target:source” pairs, with false positives (another long story) that can be looked up for a range of addresses…

I almost forgot to mention, I used such a binary signature inside the patch I wrote as a member of ZERT, for closing a vulnerability in Internet Explorer, I needed to find the weak function and patch it in memory, so you can both grab the source code and see for yourself. Though the example does use opcodes (and lots of them) as part of the signature, but there’s special reason for it. Long story made short: The signature won’t match once the function will get officially patched by MS (recall that we published that solution before MS acted), and then this way the patcher will know that it didn’t find the signature and probably the function was already patched well, so we don’t need to patch it on top of the new patch.. confusing shit.

The reason I find signatures amazing is because only reversers can do them well and it takes lots of skills to generate good ones,
happy signaturing :)

And surprisingly I found the following link which is interesting:

So let’s delve into my example, at last.
Here’s a real example of a signature for ZwProtectVirtualMemory in a Kernel driver.

Signature Source Code

From my tests this signature worked well on many versions…though always expect it might be broken.

Uh Ah! I Happened To Use POP ESP

Friday, April 15th, 2011

I was telling the story to a friend of mine about me using POP ESP in some code I wrote, and then he noted how special it is to use such an instruction and probably I’m the first one whom he’s heard of that used it. So I decided to share. I’m sorry to be mystical about my recent posts, it’s just that they are connected to the place I work at, and I can’t talk really elaborate about everything.

Here we go.
I had to call a C++ function from my Assembly code and keep the return value untouched so the caller will get it. Usually return values are passed on EAX, in x86 that is. But that’s not the whole truth, they might be passed on EDX:EAX, if you want to return 64 bits integer, for instance.
My Assembly code was a wrapper to the C++ function, so once the C++ function returned, it got back to me, and so I couldn’t touch both EDX and EAX. The problem was that I had to clean the stack, as my wrapper function acted as STDCALL calling convention. Cleaning the stack is pretty easy, after you popped EBP and the stack pointer points to the return address, you still have to do POPs as the number of arguments your function receives. The calling convention also specifies which registers are to be preserved between calls, and which registers are scratch. Therefore I decided to use ECX for my part, because it’s a scratch register, and I didn’t want to dirty any other register. Note that by the time you need to return to the caller and both clean the arguments on the stack, it’s pretty hard to use push and pop instructions to back up a register so you can freely use it. Again, because you’re in the middle of cleaning the stack, so by the time you POP that register, the ESP moved already. Therefore I got stuck with ECX only, but that’s fine with me. After the C++ function returned to me, I read from some structure the number of arguments to clean. Suppose I had the pointer to that structure in my frame and it was easily accessible as a local variable. Then I cleaned my own stack frame, mov esp, ebp and pop ebp. Then ESP pointed the return address.
This is where it gets tricky:

Assume ECX holds the number of arguments to clean:
lea ecx, [esp + ecx*4 + 4]

That calculation gets the fixed stack address, like the ESP that a RET N instruction would get it to. So it needs to skip the number of arguments multiplied by 4, 4 bytes per argument, and add to that the return value itself.

Going on with:
xchg [esp], ecx

Which puts on the stack the fixed stack address, and getting ECX with the return address. This is where usually people get confused, take your time. I’m waiting ;)

And then the almighty:
pop esp
jmp ecx

We actually popped the fixed stack pointer from the stack itself into the stack pointer. LOL
And since we got ECX loaded with the return address, we just have to branch to it.

What I was actually doing is to simulate the RET N instruction, using only ECX. And ESP should be used anyway. Now the function I was returning to, could access both the optional EDX and EAX as return values from the C++ function.

It seems that the solution begged a SMC (self modifying code) so I could just patch the N, in the RET N instruction, which is a 16 bits immediate value. But SMC is bad for performance, and obviously for multi threading…

Also note that I could just clean the stack, and then branched to something like: jmp [esp – argsCount*4 – 4],
but I don’t like reading off my stack pointer, that’s a bad practice (mostly from the days of interrupts…).


Memory Management

Tuesday, December 7th, 2010

In this post I decided to write about a few things you have to keep on mind while writing or designing a big application regarding memory management, I will also try to give you a new way for looking at memory allocations.

The problem is that one day you wake up and you have either memory leaks or memory fragmentation, you don’t know which or why. To eliminate the possibility of memory fragmentation, you will have to rule out memory leaks first.

Trying to find a solution to such a case you find that your dozens of DLLs work (normally but-)sharing the same CRT, so if you pass an allocated object from one DLL to be used and freed by another DLL, you’re screwed. Also, you can’t separate the CRTs, etc. The root of the problem is because each CRT has its own heap, thus if you statically link a DLL with its own CRT, it will have its own heap. If all DLLs use the same CRT, they share the same heap, or even the application’s global heap. Why is it bad? Because then if you have memory leaks in one DLL, it will dirty the whole global heap. I will talk about it soon. An advantage for separating heaps is that each thread that belongs to some DLL can access its own heap faster, without congestion on the global heap lock which all DLLs/threads need.
The idea is to separate the applications into logical groups of components, and each logical group will use its own heap, so you can pass pointers around. This is really hard to do if you have existing code and never thought about this issue. So you’re saying, “oh yeah, but if I don’t have leaks, and my code is doing alright, I don’t need it anyway”. Well I wouldn’t be so quick about thinking that. Keep on reading.

In C++ for instance, overriding new and delete for your class, is really useful, because even if it gets used in a different component, C++ is responsible to bring the “deletion” to the appropriate heap.

Now I wanna go back a bit to some background on how memory allocators work in general, which is usually in a naive fashion. It really means that when you ask for a block of memory, the memory manager (malloc for you) allocates some chunk and returns a pointer to you. Once another request for memory is being done, the next memory address is getting used to supply the memory request… (suppose the first chunk requested is still busy), and so on. What you can see here, is that the memory allocator works linearly (allocates a block after the other) and chronologically (depends on the time you asked for the allocation).

Let me give you some example so you understand what I’m talking about, in this example there are two threads: Thread A and Thread B.
Thread A allocates chunks of 0.5kb in a loop, and at some point shuts down after it’s done with its task.
Thread B allocates chunks of 0.5kb in a loop, simultaneously, and keeps on working.
And suppose they are using the same CRT, thus the same heap.

What we will see from the memory manager perspective is something like:
<0.5kb busy> <0.5kb busy> <0.5kb busy>…………………….and then something like ….

Each chunk might belong to either Thread A or Thread B. Don’t be picky about it, it’s really up to so many environmental issues (processors, locks, etc).

Suppose Thread A is finished with its memory because it was only up to do some initialization job(!) and decides to free everything it has allocated. The memory picture would be something like this now:
<0.5kb free><0.5kb busy><0.5kb free><0.5kb busy> …………………lots of <0.5kb busy belongs to ThreadB> and then ….

What we got now is a hole-y memory map, some blocks are free, the others are busy. Now if we want to allocate another 0.5kb, no problem, we can use the first free block, it will match perfectly, so everybody’s happy so far. As long as the block we want to allocate in such a memory state is smaller or equal to 0.5kb, we will get them for sure, right?
But what happens if we ask to allocate a block of 10kb, ha ah, now we have to scan the whole memory to find a free chunk whose size is bigger than 10kb, right? Can you see what’s wrong here?
What really is wrong is that you have a total free amount of X kbytes, which surely can contain 10kb, but they are all chunked, so the memory allocator has to find the next block which can satisfy such a request and it so happens that all these free kbytes are useless to us.
In short, this example is a classic memory fragmentation scenario. To check how bad a memory fragmentation is, you have to measure it by the total free memory, and largest free block size. This way, if you have 2 megs that are fragmented in tiny chunks over 200mb, and the biggest free block is 2 megs – it means you can’t even allocate a 3mb block while the total free size is 198mb. Memory is cruel.

This brings me to say that each memory-consumer has a specific behavior. A memory-consumer like Thread A, that was doing some initialization job and then got down, and freeing all its blocks; is just an init phase memory user.
A memory-consumer like Thread B, that still works after Thread A is done and uses its allocated memory, is an application life-time memory user.
Basically each memory-consumer can be classified into some behavior! Once you mix between those behaviors on the same heap, you’re sooner or later screwed.

The rule of thumb I would suggest you should follow is:
You should distinguish between each memory-consumer type, if you know something consumes memory and it really caches some objects, make sure it works as early as possible in your application. And only then allocate your application life-time objects. This causes its memory to be allocated first and not cause fragmentations to other consumers with different behaviors.

If you know you have to cache some objects while other application life-time objects have to be allocated from the same heap at the same time(!), this is where you should see the red light and create another heap for each behavior.
A practical example is when you open some application that has a workspace, now while you open a workspace in the first time, the application is designed to init its UI stuff only at that moment. Therefore while you’re loading and parsing the data from the file into memory, you’re also setting up the UI and do lots of allocations, the UI is going to stay there until you shutdown the application, and the data itself is going to be de-allocated once you close the workspace. This will create fragmentations if you don’t split the heaps. Can you see why? Think of loading and unloading workspaces and what happens to the memory. Don’t forget that each time you open a workspace, the memory chunks that need to get allocated have a different size. And that the UI still uses its memory which is probably fragmented, so it only gets worse to some extent. Another alternative would be to make sure the UI allocations happen first, no matter what. And only then you can allocate and de-allocate lots of memory blocks without worrying for fragmentations.
Even if you got the UI allocations done first, then there still might be a problem. Suppose you load the file’s data to memory and some logic works in a way that some of the data stays cached for the application life-time even after the workspace is going down. This is a big no no, it will leave you with fragmentations again, hence you should split your heap again.

That’s it mostly, I hope I managed to make it clear.
I didn’t want to talk about the implementation of the heaps, whether they are (LFH) low-fragmentation-heap or whatever, if you mix your memory-consumers, nothing will save you.

One note to mention, that if your cached-memory leaks, for whatever reason, with the new strategy and a standalone heap, you won’t hurt anybody with your leaks, and it will even be easier to track leaks per heap rather than dozens of DLLs on the same global heap. And if you already have a big application with a jungle of one heap and various memory-consumers types, there are some techniques to save the day…

Deleting a New

Sunday, October 24th, 2010

Recently I’ve been working on some software to find memory leaks and fix its fragmentation too. I wanted to start with some example of a leak. And next time I will talk about the fragmentations also.

The leak happens when you allocate objects with new [] and delete (scalar) it. If you’re a c++ savvy, you might say “WTF, of course it leaks”. But I want to be honest for a moment and say that except from not calling the destructors of the instances in the array, I really did not except it to leak. After all, it’s a simple memory pointer we are talking about. But I was wrong and thought it’s worth a post to share this with you.

Suppose we got the following code:

class A {
  A() { }
  ~A() { }
  int x, y;

A * a = new A[3];
delete a; // BUGBUG 

What really happens in the code, how does the constructors and destructors get called? How does the compiler know whether to destroy a single objects, or how many objects to destroy when you’re finished working with the array ?

Well, first of all, I have to warn you that some of it is implementation specific (per compiler). I’m going to focus on Visual Studio though.

The first thing to understand is that the compiler knows which object to construct and destroy. All this information is available in compilation time (in our case since it’s all static). But if the object was dynamic, the compiler would have called the destructor dynamically, but I don’t care about that case now…

So allocating memory for the objects, it will eventually do malloc(3 * sizeof(A)) and return a pointer to assign in the variable ‘a’. Here’s the catch, the compiler can’t know how many destructors to call when it wants to delete the array, right? It has to bookkeep the count somehow. But Where??
Therefore the call to the memory allocation has more to it. The way MSVC does it is as following (some pseudo code):

int* tmpA = (int*)malloc(sizeof(A) * 3 + sizeof(int)); // Notice the extra space for the integer!
*tmpA = 3; // This is where it stores the count of instances! ta da
A* a = (A*)(tmpA + 1); // Skip that integer, really assigns the pointer allocated + 4 bytes in 'a'.

Now all it has to do is calling the constructors on the array for each entry, easy.
When you work with ‘a’ the implementation is hidden to you, eventually you get a pointer you should only use for accessing the array and nothing else.
At the end you’re supposed to delete the array. The right way to delete this array is to do ‘delete []a;’. The compiler then understands you ask to delete a number of instances rather than a single instance. So it will loop on all the entries and call a destructor for each instance, and at last free the whole memory block. One of the questions I asked in the beginning is how would the compiler know how many objects to destroy? We already know the answer by now. Simple, it stored the count before the pointer you hold.

So deleting the array in a correct manner (and reading the counter) would be as easy as:

int* tmpA = (int*)a - 1; // Notice we take the pointer the compiler handed to you, and get the 'count' from it.
for (int i = 0; i < *tmpA; i++) a[i].~a();
free (tmpA); // Notice we call free() with the original pointer that got allocated! 

So far so good, right? But the leak happens if you call a scalar delete on the pointer you get from allocating a new array. And that's the problem even if you have an array of primitive types (like integers, chars, etc) that don't require to call a destructor you still leak memory. Why's that?
Since the new array, as we saw in this implementation returns you a pointer, which does not point to the beginning of the allocated block. And then eventually you call delete upon it, will make the memory manager not find the start of the allocated block (cause you feed it with an offset into the allocated block) and then it has a few options. Either ignore your call, and leak that block. Crash your application or give you a notification, aka Debug mode. Or maybe in extreme cases cause a security breach...

In some forum I read that there are many unexpected behaviors in our case, one of them made me laugh so hard, I feel I need to share it with you:
"A* a = new A[3]; delete a; Might get the first object destroyed and released, but keep the rest in memory".

Well it doesn't take a genius to understand that the compiler prefers to bulk allocate all objects in the same block...and yet, funny.
The point the guy tries to make is that you cannot know what the compiler implementation is, as weird as it might be, don't ever rely on it. And I totally agree.

So in our case a leak happens in the following code:

int*a = new int[100];
delete a;

The point is that when you new[], you should must call a corresponding delete [].
Except from the need to make your code readable and correct, it won't be broken, and never trust the compiler, just code right in the first place.

And now you can imagine what happens if you alloc a single object and tries to delete[] it. Not healthy, to say the least.

SmartPointer In C++

Wednesday, March 3rd, 2010

Smart pointers, the way I see it, are there to help you with, eventually, two things: saving memory and auto-destruction. There are plenty kinds of smart pointers and only one type of a dumb pointer ;) I am going talk about the one that keeps a reference count to the data. To me they are one of the most important and useful classes I have used in my code. Also the AutoResource class I posted about, here, is another type of a smart pointer. I fell in love with smart pointers as soon as I learnt about them long time ago. However I only happened to write the implementation for this concept only once, in some real product code. Most of the times I got to use libraries that supply them, like ATL and stuff. Of course, when we write code in high level languages like Python, C#, Java, etc. We are not even aware to the internal use of them, mostly anyway.

This topic is not new or anything, it is covered widely on the net, but I felt the need to share a small code snippet with my own implementation, which I wrote from scratch. It seems that in order to write this class you don’t need high skills in C++, not at all. Though if you wanna get dirty with some end cases, like the ones described in ‘More Effective c++’, you need to know the language pretty well.

As I said earlier, the smart pointer concept I’m talking about here is the one that keeps the number of references to the real instance of the object and eventually when all references are gone, it will simply delete the only real instance. Another requirement from this class is to behave like a dumb pointer (that’s just the normal pointer the language supplies), my implementation is not as perfect as the dumb pointer, in the essence of operators and the operations you can apply on the pointer. But I think for the most code usages, it will be just enough. It can be always extended, and besides if you really need a crazy ultra generic smart pointer, Boost is waiting for you.

In order to keep a reference count for the instance, we need to allocate that variable, also the instance itself, and to make sure they won’t go anywhere as long as somebody else still points to it. The catch is that if it will be a member of the SmartPointer class itself, it will die when the SmartPointer instance goes out of scope. Therefore it has to be a pointer to another object, which will hold the number of references and the real instance. Then a few smart pointers will be able to point to this core object that holds the real stuff. I think this was the only challenge in understanding how it works. The rest is a few more lines to add functionality to get the pointer, copy constructor, assignment operator and stuff.

Of course, it requires a template class, I didn’t even mention that once, because I think it’s obvious.
Here are the classes:

template  class SmartPtr {
  SmartPtr(T o)
    // Notice we create a DataObject that gets an object of type T.
    m_Obj = new DataObj(o);
  // ... A few of additional small methods are absent from this snippet, check link below
  // Now, here we define an internal class, which holds the reference count and the real object's instance.
  class DataObj {
    DataObj(T o) : m_ReferenceCount(0)
      m_Ptr = new T(o); // First allocate, this time the real deal
      AddRef(); // And only then add the first reference count
    unsigned int AddRef()
    {  return m_ReferenceCount++;  }
    void Release()
      if (--m_ReferenceCount == 0) {
        delete m_Ptr; // Delete the instance
        delete this; // Delete the DataObj instance too
  T* m_Ptr; // Pointer to a single instance of T
  unsigned int m_ReferenceCount; // Number of references to the instance

// This is now part of the SmartPointer class itself, you see? It points the DataObj and not T !
DataObj* m_Obj;

To see the full source code get it SmartPointer.txt.

I didn’t show it in the snippet above but the assignment operator or copy constructor which get a right hand of a smart pointer class, will simply copy the m_Ptr from it and add a reference to it. And by that, the ‘magic’ was done.

To support multi-thread accesses to the class, you simply need to change the AddRef method to use InterlockedAdd. And to change the Release to use InterlockedSub, ahh of course, use InterlockedAdd with -1.
And then you would be fully thread safe. Also note that you will need to use the returned value of the InterlockedAdd in the Release, rather than compare the value directly after calling the function on it. This is a common bug when writing multi-thread code. Note that if the type object you want to create using the SmartPointer doesn’t support multi-threading in the first place, nothing you can do in the smart pointer method themselves is going to solve it, of course.

I didn’t show it in the snippet again but the code supports the comparison to NULL on the SmartPointer variable. Though you won’t be able to check something like:
if (!MySmartPtr) fail… It will shout at you that the operator ! is not supported. It takes exactly 3 lines to add it.

The only problem with this implementation is that you can write back to the data directly after getting the pointer to it. For me this is not a problem cause I never do that. But if you feel it’s not good enough for you, for some reason. Check out other implementations or just read the book I mentioned earlier.

Overall it’s really a small class that gives a lot. Joy

Crashing The VS Compiler

Wednesday, September 16th, 2009

With this code:

template <class T> class bugs {
template<> bugs<int>::bugs()

It seems the compiler doesn’t like that we specialize a function that wasn’t first declared in the class. Oh well.

Cleaning Resources Automatically

Tuesday, September 15th, 2009

HelllllloW everybody!!!

Finally I am back, after 7 months, from a crazy trip in South America! Where I got robbed at gun-point, and denied access to USA (wanted to consult there), saw amazing views and creatures, among other stories, but it’s not the place to talk about them :) Maybe if you insist I will post once about it.

<geek warning>
Honestly, I freaked out in the trip without a computer (bits/assembly/compiler),  so all I could do was reading and following many blogs and stuff.
</geek warning>

I even learned that my post about the kernel DoS in XPSP3 about the desktop wallpaper weakness became a CVE. It seems MS has fixed it already, yey.

And now I need to warm up a bit, and I decided to dive into C++ with an interesting and very useful example. Cleaning resources automatically when they go out of scope. I think that not many people are aware to the simplicity of writing such a feature in C++ (or any other language that supports templates).

The whole issue is about how you destroy resources, I will use Win32 resources for the sake of example. I already talked once about this issue in C.

Suppose we have the following snippet:

HBITMAP h = LoadBitmap(...);
if (h == NULL) return 0;
HBITMAP h2 = Loadbitmap(...);
if (h2 ==  NULL) {
   return 0;
char* p = (char*)malloc(1000);
if (p == NULL) {
   return 0;

And so on, every failure handling in the if statements we have to clean more and more resources. And you know what, even today people forget to clean all resources, and it might even lead to security problems. But that was C times, and now we are all cool and know C++, so why not use it? One book which talks about these issues is Effective C++, recommended.

Also, another problem with the above code is while an exception is being thrown in the middle or afterward, you still have to clean those resources and copy/paste some lines, which makes it prone to errors.

Basically, all we need is a nice small class which will be called AutoResource that holds the object itself and will manage it. Personally, it reminds me auto_ptr class, but it’s way less permissive. You will be only able to initialize and use the object. Of course, it will be destroyed automatically when it goes out of scope.
How about this code now:

AutoResource<HBITMAP> h(LoadBitmap(...));
AutoResource<HBITMAP> h2(LoadBitmap(...));
char* p = new char[1000]; // If an exception is thrown, all objects initialized prior to this line are automatically cleaned.

Now you can move on and be totally free of ugly failure testing code and not worry about leaking objects, etc. Let’s get into details. What we really need is a special class that we can change the behavior of the CleanUp()  method according to its object’s type, that’s easily possible in C++ by using a method specialization technique. We will not let to copy or assign to the class. We want a total control over the object, therefore we will let the user to get() it too. And as a type of defense programming, we will force the coder to implement a specialized CleanUp() in a case he uses the class for new types and forgets to implement the new CleanUp code; By using compile time assertion (I used this trick from Boost). Also, there might be a case where the constructor input is NULL, and therefore the constructor will have to inform the caller by throwing an exception, download and check out the complete code later.

template <class T> class AutoResource {
   AutoResource(T t) : m_obj(t) { }

   void CleanUp()
      // WARNING:
      // If the assertion occurred you will have to specialize the CleanUp with the new type.

      m_obj = NULL;
   T get() const
      return m_obj;
   T m_obj;

//Here we specialize the CleanUp() for the HICON resource.
template<> void AutoResource<HICON>::CleanUp()

You can easily add new types and enjoy the class.  If you have any suggestions/fixes please leave a comment!

Download complete code: AutoResource.cpp.

Thanks to Yan Michalevsky for helping with the code!

P.S – While compiling this stuff, I found a crash in the Optimization Compiler of VS2008 :)  lol.

String Initialization is Tricksy

Monday, December 29th, 2008

A friend of mine had to hand in an assignment for Computer Science in university. As I understood, it was a relatively easy assignment. And the point is that that friend is very experienced programmer and knows a thing or two about it. Anyway, he had a line in the code which goes like this:
char buf[1024] = “abc”;

You don’t even need to know C in order to understand that line, right? I assume we all agree to that. It simply initializes the buffer with a constant string literal. So his lecturer asked him, what does this line do precisely. And to his surprise his answer was incorrect. The correct answer is that the whole buffer is initialized and then the string constant is copied (this can be done in a few ways, for example copying a buffer with the zeros at the end of it). So today another friend called me on the phone to ask about this thing, why our first friend was wrong about it. Now as a reverser, I suppose I need to know the answer to such a simple matter as well. But, the sad part was that I was wrong as well as the two of them. I fired up the C standard and started to search for the solution. I wanted al iving proof to the matter at hand. Looking here and there it took me around 15 mins to lie my hands on the piece of sentence that settled all that matter down. And I quote:

“If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, or fewer characters in a string literal used to initialize an array of known size than there are elements in the array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration.”

The underlined text is the answer – If there are less characters than the size of the array to initialize, the remainder has to be initialized as well. There is another clause which explains how the initialization is being done, but for now, let it be ‘zeroing’.

Now the reason I was wrong about it is because I happened to see many (for example):
mov [buf+1], ‘a’
mov [buf+2], ‘b’
mov [buf+3], ‘c’
mov [buf+4], ‘\0’

in lots of functions, and that means the source C code is:

char buf[] = “abc”;

The standard says about this case that the size of the buffer is to be acquired from the size of the literal constant string, don’t forget the null termination character as well. So that’s why I didn’t see the memset coming in to initialize the all buffer. Besides, maybe most of the people code it this way:
char buf[1024];
strcpy(buf, “abc”);

Which doesn’t lead to a memset or other way of initialization of the rest of the array.

Delegators #3 – The ATL Way

Saturday, November 24th, 2007

The Active-Template Library (or ATL) is very useful. I think that if you code in C++ under Windows it’s even a must. It will solve your great many hours of work. Although I have personally found a few bugs in this library, the code is very tight and does the work well. In this post I’m going to focus on the CWindow class, although there are many other classes which do the delegations seamlessly for the user, such as: CAxDialogImpl, CAxWindow, etc. CWindow is the main one and so we will examine it.

I said in an earlier post that ATL uses thunks to call the instance’s window-procedure. A thunk is a mechanism to convert a function invocation between callee and caller. Look it up in Wiki for more info… To be honest, I was somewhat surprised to see that the mighty ATL uses Assembly to implement the thunks. As I was suggesting a few ways myself to avoid Assembly, I don’t see a really good reason to use Assembly here. You can say that the the ways I suggested are less safe, but if a window is choosing to be malicious you can make everything screwed anyway, so I don’t think it’s for this reason they used Assembly. Another reason I can think of is because their way, they don’t have to look-up for the instance’s ‘this’ pointer, they just have it, wheareas you will have to call GetProp or GetWindowLong. But come on… so if you got any idea let me know. I seriously have no problem with Assembly, but as most people thought that delegations must be implemented in Assembly, I showed you that’s not true. The reason it’s really surprised me is that the Assembly code is not portable among processors as you know; and ATL is very popular and used library. So if you take a look at ATL’s thunks code, you will see that they support x86 (obviously), AMD64, MIPS, ARM and more. And I ask, why the heck?when you can avoid it all? Again, for speed? Not sure it’s really worth it. The ATL guys know what they do, I doubt they didn’t know they could have done it without Assembly.

Anyhow, let’s get dirty, it’s all about their _stdcallthunk struct in the file atlstdthunk.h. The struct has a few members, that their layout in memory will be the same as it was in the file, that’s the key here. There is an Init function which constructs the members. These members are the byte code of the thunk itself, that’s why their layout in memory is important, because they are get to ran later by the processor. The Init function gets the ‘this’ pointer and the window procedure pointer. And then it will initialize the members to form the following code:

mov dword ptr [esp+4], this
jmp WndProc

Note that ‘this’ and ‘WndProc’ are member values that their values will be determined in construction-time. They must be known in advance, prior to creation of the thunk. Seeing [esp+4] we know they override the first argument of the instance’s window-procedure which is hWnd. They could have pushed another argument for the ‘this’ pointer, but why should they do it if they can recover the hWnd from the ‘this’ pointer anyway…? And save a stack-access? :)

Since the jmp instruction is relative in its behaviour, that is, the offset to the target address is not absolute but rather relative to the address of the jmp instruction itself, upon initialization the offset is calculated as well, like this:

DWORD((INT_PTR)proc – ((INT_PTR)this+sizeof(_stdcallthunk)));

Note that the ‘this’ here is the address of the thunk itself in memory (already allocated).
Now that we know how the thunk really looks and what it does, let’s see how it’s all get connected, from the global window procedure to the instance’s one. This is some pseudo code I came up with (and does not really reflect the ATL code, I only wanna give you the idea of it):


WndProcThunk = AllocThunk((DWORD)this.WndProc, (DWORD)this);
m_hWnd = CreateWindow (…);
SetWindowLong(m_hWnd, GWL_WNDPROC, WndProcThunk);

This is not all yet, although in reality it’s a bit more complicated in the way they bind the HWND with its thunk…Now when the window will be sent a message, the stack will contain the HWND, MESSAGE, WPARAM and LPARAM arguments for the original window procedure, but then the thunk will change the HWND to THIS and immediately transfer control to the global window procedure, but this time they got the context of the instance!

CWindow::WindowProc(HWND hWnd, UINT message, WPARAM wParam, LPARAM lParam)
 CWindowImplBaseT< TBase, TWinTraits >* pThis = (CWindowImplBaseT< TBase, TWinTraits >*)hWnd;
 pThis->ProcessWindowMessage(pThis->m_hWnd, uMsg, wParam, lParam, lRes, 0);

And there we go. Notice the cast from hWnd to the ‘this’ pointer and the call with the context of the instance, at last, mission accomplished ;)

More information can be found here.

 A new problem now arises, I will mention it in the next post in this chain, but in a couple of words, here’s a hint: NX bit.

Delegators #2 – C++ & Win32

Saturday, November 17th, 2007

Last post I was talking about using SetProp/GetProp to link between an HWND and an instance to a class that encapsulates the window widget. There’s another way to do it.

The first time you are taught how to create a window in Win32, they tell you that you have to fill in all fields in the WNDCLASS structure, but they always keep on saying to ignore some of the advanced fields. One of the fields inside that structure is cbWndExtra, this one says to the windows manager how many bytes to allocate after the original windows manager’s structure. So suppose all we need is to save the ‘this’ pointer, it means we will need 4 bytes, in 32bits environment system, of course. So we can set cbWndExtra to 4, and then we have a DWORD we can access to as much as we like with the SetWindowLong/GetWindowLong API.

Something like this would suffice:

wc. fields = …
wc.lpszClassName = “myclass”;
wc.cbWndExtra = 4;
HWND hWnd = CreateWindow(“myclass”, …)
SetWindowLong(hWnd, 0, this);


And later on in the window-procedure, you can do this:

LRESULT WINAPI MyWindow::WndProc(HWND hWnd, UINT message, WPARAM wparam, LPARAM lparam)
 MyWindow* pThis = (MyWindow*)GetWindowLong(hWnd, 0);
 if (pThis != 0) return pThis->WndProc(message, wparam, lparam);
 return DefWindowProc(hWnd, message, wparam, lparam);

 And there you go. There are a few problems with this technique though. You can’t handle the WM_CREATE message inside the instance’s WndProc, that’s because before CreateWindow returns, the global WndProc will get called with this message, and you didn’t have the chance to set the ‘this’ pointer…

The second problem is that this technique is good only for windows you create yourself, because you control the WNDCLASS… You can use pre-created window-classes of the system by changing their fields in your application only, I think this will create a copy of the pre-created classes only for your process.

Anyhow, to solve the first problem, I saw implementation that create the instance of the object when they receive the WM_CREATE inside the global WndProc. This is really a matter of design, but this way you can call an initializer of the instance once WM_CREATE was received. I can’t come up with another way at the moment if the instance is already created. Maybe just call the initializer on the first message the window gets? Well it’s a bit cracky.