RSS Feeder

January 2nd, 2010

Hey guys,
apparently RSS was broken for this blog, since I redirected it to feedburner. But now everything seems to work once again. I appareciate that anonymous fellow’s comment ;)
Gil

diStorm3 – News

December 29th, 2009

Yo yo yo… forgot to say happy xmas last time, never too late, ah? :)

This time I wanted to update you about diStorm3 once again. Yesterday I had a good coding session and I added some of the new features regarding flow control. The decode function gets a new parameter called ‘features’. Which is a bit field flag that lets you ask the disassembler to do some new stuff such as:

  1. Stop on INT instructions [INT, INT1, INT3, INTO]
  2. Stop on CALL instructions [CALL, CALL FAR]
  3. Stop on RET instructions [RET, RETF, IRET]
  4. Stop on JMP instructions [JMP, JMP FAR]
  5. Stop on any conditional branch instructions [JXXX, JXCX, LOOPXX]
  6. Stop on any flow control (all of the above)

I wasn’t sure about SYSCALL and the like and UD2, for now I left them out. So what we got now is the ability to instruct the disassembler to stop decoding after it encounters one of the above conditions. This makes the higer disassembler layer more efficient, because now you can disassemble code by basic blocks. Also building a call-graph or branches-graph faster.

Note that now you will be able to ask the disassembler to return a single instruction. I know it sounds stupid, but I talked about it already, and I had some reasons to avoid this behavior. Anyway, now you’re free to ask how many instructions you want, as long as the disassembler can read them from the stream you supply.

Another feature added is the ability to filter non-flow-control instructions. Suppose you are interested in building a call-graph only, there’s no reason that you will get all the data-control instructions, because they are probably useless for the case. Mixing this flag with ‘Stop on RET’ and ‘Stop on CALL’, you can do nice stuff.

Another thing is that I separated the memory-indirection description of an operand into two forms. First of all, memory indirection operand is when an instruction reads/writes from/to memory. Usually in Assembly text, you will see the brackets characters surrounding some expressions. Something like: MOV [EDX], EAX. Means we write a DWORD to EDX pointer. If you followed me ’till here, you should know exactly what I’m talking about anyway.

When you get the result of such instruction from diStorm3, the type of the operand will be SMEM (stands for simple-memory), which hints there’s only one register in the memory-indirection operand. Although it doesn’t hint anything about the displacement, that’s that offset you usually see in the brackets. Like MOV [EDX+0x12345678], EAX. So you will have to test if the displacement exists in both forms. The other form is MEM (Normal memory indirection, or probably should be called ‘complex’) since it supports the full memory indirection operand, like: MOV [EAX*4 + ESI + 0x12345678], EAX. Then you will have to read another register that supplies the base register, in addition to the index register and scale. Note that this applies for 16 bits mode addressing as well, when you can have a mix of [‘BX+SI]’ or only ‘[BX]’. Also note that sometimes in 32/64 bits mode, you can have a SIB byte, that sets only the base register and the index register is unused, but diStorm3 will return it as an SMEM, to simplify matters. This way it’s really easy to read the instruction’s parameters.

Another feature for text formatting is the ability to tell the disassembler to limit the address to 16 or 32 bits. This is good since the offsets are 64 bits today. And if you encounter an instruction that jumps backwards, you will get a huge negative value, which won’t make much sense if you disassemble 16 bits code…

diStorm3 still supplies the bad old interface. And now it supports two new additional functions. The decompose function, which returns the structures for each instruction read. And another function that formats a given structure into text, which is pretty straight forward. The text format is not an accurate behavior of diStorm64, it’s more simplified, but good enough. Besides I have never heard any special comments about the formatting of diStorm64, so I guess it doesn’t matter much to you guys. And maybe maybe I will add AT&T syntax later on.

Another field that is returned now, unlike diStorm64, is the instruction-set-class type of the instruction, with very broad categories, like Integer instructions, FPU instructions, SSE instructions, and so on. Still might be handy. And the hint about the flow-control type of the instruction.

Also I changed tons of code, and I really mean it, the skeleton is still the same, but the prefixes engine works totally different now. Trying to imitate a real processor this time. By including the last prefix found of that prefix-type. You can read more about this, here. I made the code way more optimized and eliminated double code and it’s still readable, if not for the better. Also I changed the way instruction are fetched, so the locate-instruction function is much smaller and better.

I’m pertty satisfied with the new version of diStorm and hopefully I will be able to share it with you guys soon. Still I got tons of tests to do, maybe I will add that unit-test module in Python to the proejct so you can enjoy it too, not sure yet.

Also I got a word from Mario Vilas, that he is going to help with compiling diStorm for different platforms, and I’m going to integrate his new Python wrappers that use ctypes, so you don’t need the Python extension anymore. Thanks Mario! ;) However, diStorm3 has its own Python module for the new structure output.

If you have more ideas, comments, complaints or you just hate me, this is the time to say so.
Cheers, happy new year soon!
Gil

Terminate Process @ 1 byte

December 28th, 2009

In the DOS days there was this lovely trick that you could just ‘ret’ anytime you wanted and the process would be terminated, I’m talking about .com files. That’s true that there were no real processes back then, more like a big chaotic jungle. Anyways, how it worked? Apparently the RET instruction, popped a zero word from the stack, and jumped to address 0. At address 0, there was this PSP (Program Segment Prefix) which had lots of interesting stuff, like command line buffer, and the like. The first word in this PSP (at address 0 of the whole segment too) was 0x20CD (instruction INT 0x20), or ‘Terminate Process’ interrupt. So branching to address 0 would run this INT 0x20 and close the program. Of course, you could execute INT 0x20 on your own, but then it would cost another byte :) But as long as the stack was balanced it seemed you could totally rely on popping a zero word in the stack for this usage. In other times I used this value to zero a register, by simply pop ax, for instance…

Time passed, and now we all use Windows (almost all of us anyway) and I came with a similar trick when I wrote Tiny PE. What I did was to put a last byte in my code with the value of 0xBC. Now usually there is some extra bytes following your code, either because you have data there or page alignment when the code section was allocated by the PE loader. This byte really means “MOV ESP, ???”. Though we don’t know what comes next to fill ESP with (some DWORD value), probably some junk, which is great.
At the cost of 1 byte, we caused ESP to get some uncontrolled value, which probably doesn’t map to anywhere valid. After this MOV ESP instruction was executed, nothing happens, alas, the next instruction is getting executed too and so on. When the process gets to terminate you ask? That’s the point we have to wait for one of two conditions to occur. Either by executing some junk instruction that will cause an access violation because it touched some random address which is unmapped. Or the second option is that we run out of bytes to execute and hit an unmapped page, but this time because of the EIP pointer itself.
This is where the SEH mechanism comes in, the system sees there’s an AV exception and tries to call the safe-exception handler because it’s just raised an exception. Now since ESP is a junk really it’s in an unrecoverable state already, so the system terminated the process. 1 byte FTW.

Short Update

December 27th, 2009

I’ve slowed down with posting here, been working hard recently both on diStorm and real life job. Hopefully I will be able to release diStorm3 soon (matter of weeks for a start). I’m almost finished with coding everything I wanted, though I’m still left with the features you guys asked for and tons of tests since I also changed lots of code and added new AVX instruction set.

Opening a file by ID – FILE_OPEN_BY_FILE_ID

December 25th, 2009

Sample code to open a file by its file-id. Had to use it for some tests and thought it might be useful for other people out there.

#include windows.h

typedef ULONG (__stdcall *pNtCreateFile)(
  PHANDLE FileHandle,
  ULONG DesiredAccess,
  PVOID ObjectAttributes,
  PVOID IoStatusBlock,
  PLARGE_INTEGER AllocationSize,
  ULONG FileAttributes,
  ULONG ShareAccess,
  ULONG CreateDisposition,
  ULONG CreateOptions,
  PVOID EaBuffer,
  ULONG EaLength
);

typedef ULONG (__stdcall *pNtReadFile)(
	IN HANDLE  FileHandle,
	IN HANDLE  Event  OPTIONAL,
	IN PVOID  ApcRoutine  OPTIONAL,
	IN PVOID  ApcContext  OPTIONAL,
	OUT PVOID  IoStatusBlock,
	OUT PVOID  Buffer,
	IN ULONG  Length,
	IN PLARGE_INTEGER  ByteOffset  OPTIONAL,
	IN PULONG  Key  OPTIONAL    );

typedef struct _UNICODE_STRING {
	USHORT Length, MaximumLength;
	PWCH Buffer;
} UNICODE_STRING, *PUNICODE_STRING;

typedef struct _OBJECT_ATTRIBUTES {
    ULONG Length;
    HANDLE RootDirectory;
    PUNICODE_STRING ObjectName;
    ULONG Attributes;
    PVOID SecurityDescriptor;        // Points to type SECURITY_DESCRIPTOR
    PVOID SecurityQualityOfService;  // Points to type SECURITY_QUALITY_OF_SERVICE
} OBJECT_ATTRIBUTES;

#define InitializeObjectAttributes( p, n, a, r, s ) { \
    (p)->Length = sizeof( OBJECT_ATTRIBUTES );          \
    (p)->RootDirectory = r;                             \
    (p)->Attributes = a;                                \
    (p)->ObjectName = n;                                \
    (p)->SecurityDescriptor = s;                        \
    (p)->SecurityQualityOfService = NULL;               \
    }

#define OBJ_CASE_INSENSITIVE					0x00000040L
#define FILE_NON_DIRECTORY_FILE                 0x00000040
#define FILE_OPEN_BY_FILE_ID                    0x00002000
#define FILE_OPEN								0x00000001

int main(int argc, char* argv[])
{
	HANDLE d = CreateFile(L"\\\\.\\c:", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, 0, 0  );
	BY_HANDLE_FILE_INFORMATION i;
	HANDLE f = CreateFile(L"c:\\bla.bla", GENERIC_WRITE, 0, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
	ULONG bla;
	WriteFile(f, "helloworld", 11, &bla, NULL);
	printf("%x, %d\n", f, GetLastError());
	GetFileInformationByHandle(f, &i);
	printf("id:%08x-%08x\n", i.nFileIndexHigh, i.nFileIndexLow);
	CloseHandle(f);

	pNtCreateFile NtCreatefile = (pNtCreateFile)GetProcAddress(GetModuleHandle(L"ntdll.dll"), "NtCreateFile");
	pNtReadFile NtReadFile = (pNtReadFile)GetProcAddress(GetModuleHandle(L"ntdll.dll"), "NtReadFile");

	ULONG fid[2] = {i.nFileIndexLow, i.nFileIndexHigh};
	UNICODE_STRING fidstr = {8, 8, (PWSTR) fid};

	OBJECT_ATTRIBUTES oa = {0};
    InitializeObjectAttributes (&oa, &fidstr, OBJ_CASE_INSENSITIVE, d, NULL);

    ULONG iosb[2];
    ULONG status = NtCreatefile(&f, GENERIC_ALL, &oa, iosb, NULL, FILE_ATTRIBUTE_NORMAL, FILE_SHARE_READ | FILE_SHARE_WRITE, FILE_OPEN, FILE_OPEN_BY_FILE_ID | FILE_NON_DIRECTORY_FILE, NULL, 0);
	printf("status: %X, handle: %x\n", status, f);
	UCHAR buf[11] = {0};
	LONG Off[2] = {0};
	status = NtReadFile(f, NULL, NULL, NULL, (PVOID)&iosb, (PVOID)buf, sizeof(buf), (PLARGE_INTEGER)&Off, NULL);
	printf("status: %X, bytes: %d\n", status, iosb[1]);
	printf("buf: %s\n", buf);
	CloseHandle(f);
	CloseHandle(d);
}

Optimize My Index Yo

November 26th, 2009

I happened to work with UNICODE_STRING recently for some kernel stuff. That simple structure is similar to pascal strings in a way, you got the length and the string doesn’t have to be null terminated, the length though, is stored in bytes. Normally I don’t look at the assembly listing of the application I compile, but when you get to debug it you get to see the code the compiler generated. Since some of my functions use strings for input but as null terminated ones, I had to copy the original string to my own copy and add the null character myself. And now that I think of it, I will rewrite everything to use lengths, I don’t like extra wcslen’s. :)

Here is a simple usage case:

p = (PWCHAR)ExAllocatePool(SomePool, Str->Length + sizeof(WCHAR));
if (p == NULL) return STATUS_NO_MEMORY;
memcpy(p, Str->buffer, Str->Length);
p[Str->Length / sizeof(WCHAR)] = UNICODE_NULL;

I will show you the resulting assembly code, so you can judge yourself:

shr    esi,1 
xor    ecx,ecx 
mov  word ptr [edi+esi*2],cx 

One time the compiler converts the length to WCHAR units, as I asked. Then it realizes it should take that value and use it as an index into the unicode string, thus it has to multiply the index by two, to get to the correct offset. It’s a waste-y.
This is the output of a fully optimized code by VS08, shame.

It’s silly, but this would generate what we really want:

*(PWCHAR)((PWCHAR)p + Str->Length) = UNICODE_NULL;

With this fix, this time without the extra div/mul. I just did a few more tests and it seems the dead-code removal and the simplifier algorithms are not perfect with doing some divisions inside the indexing for pointers.

Update: Thanks to commenter Roee Shenberg, it is now clear why the compiler does this extra shr/mul. The reason is that the compiler can’t know whether the length is odd, thus it has to round it.

Process File Name Spoofing

November 20th, 2009

I saw an interesting post about spoofing the process file name (and he has other interesting posts so you better check it out anyway). This is really not surprising that many applications fail to retrieve the name correctly, since they access a string in the usermode controlled area, probably something they get from the PEB. So I tried to come up with a quick and reliable way that will be done from usermode without any kernel tendency.
I tried it out myself (I mean with spoofing, using the code he shows in his post), and it worked well.

#include 
#include 
#include 
#pragma comment(lib, "psapi.lib")
void main()
{
 WCHAR buf[260];
 GetMappedFileName(GetCurrentProcess(), main, buf, sizeof(buf));
 printf("%S\n", buf);
}

FYI: GetMappedFileName uses an undocumented info-class for NtQueryVirtualMemory. :)

Getting PID of CSRSS

November 20th, 2009

I thought this one might help some people out there… Instead of scanning all processes, or getting special exports in ntdll.dll or similar ideas. There’s a two-lines code to do it. The trick is to get the desktop’s handle to window, which really belongs to CSRSS and then get its process.

DWORD pid, tid;
tid = GetWindowThreadProcessId(GetDesktopWindow(), &pid);

Also you get the thread id by product, and this code is compatible since 95. I guess it might be handy.

It’s Vexed :)

November 17th, 2009

In the last few week I’ve been working on diStorm to add the new instruction sets: AVX and FMA. You can find lots of information about them on the inet. In a brief, the big advantages, are support of 256 bit registers, called now YMM’s (their low halves are XMM’s) and also support for AES built in, you have a few instruction to do small block encryption and decryption, really sweet. I guess it will help some security companies out there to boost stuff. Also the main feature behind these instruction sets is the 3 registers operands. So now you are not stuck with 2 registers per instruction, you can have up to four sometimes. This is good because you have two source operands and a destination operand, which means it saves you other instructions (to move or backup registers) and you don’t have to ruin your dest-src operand like in the old sets. Almost forgot to mention FMA itself, which is fused multiply-add instructions, so you can do two operations at once, like A*B+C, etc.

I wanted to talk about the VEX prefix itself. It’s really a new design and approach to prefixes that was never seen before. And the title of this post says it all, it’s really annoying.
The VEX (Vector Extension) prefix, is a multi-byte prefix for a change. It can be either 2 or 3 bytes. If you take a look at the one byte opcodes map, all of them are taken. Intel was in the need of a new unused byte, which didn’t really exist. What they did instead, was to share two existing opcodes, for each prefix. The sharing works in a special way, that let them know if you meant to use the original instruction or the new prefix. The chosen instructions are LDS (0xc4) and LES (0xc5). When you examine the second byte of these instructions, the byte upon which the new information is extracted from, you can learn that the most significant two bits can’t be set together (I.E: the value of 0xc0 or higher). However, if they are set, the processor will raise an illegal instruction exception. This is where the VEX prefix enters into the game. Instead of raising an exception they will be decoded as this special multi-byte prefix. Note that in 64 bits mode, all those Load-Segment instruction are invalid, so there is no need for sharing the opcode. When you encounter 0xc4 or 0xc5, you know it’s a VEX prefix, as simple as that. Unfortunately this is not the case in 32 bits mode, and since the second byte has to be with a value higher than 0xc0 (because the two most significant bits have to be set in 32 bits), the field in these corresponding bits is inverted actually, which means you will have to extract a few bits that represent some fields and bitwise-not them. This is seriously gross, but it seems Intel didn’t have much of a choice here. If it were up to me, I would do the same eventually, for the sake of backward compatibility, but it doesn’t make it any prettier to be honest. And for your information, AMD pulled the same trick but with the POP instruction (0x8f) for their new instruction sets (XOP, etc), without full backward compatibility.

To make some order in the bits let’s have a look at the following figure:
vexprefix
Cited Intel
Well, I am not going to talk about all fields, (which somewhat are similar to REX for 64 bits) but just about one interesting feature. Since now the prefix is 2/3 bytes and usually an SSE instruction is at least 3 bytes, this will explode the code segment with huge instruction, and certainly gonna make the processor cry a lot to fetch instructions. The trick that was used by Intel (and AMD too) was to have a field that will imply which prefix byte to put virtually before the VEX prefix itself, so this way we saved one byte. And the same idea was used again to spare the 0x0f escape byte or even two bytes of 0x0f, 0x38 or 0x0f, 0x3a which are very common basic opcodes for SSE instructions. So if we had to use an SSE instruction, for instance:
66 0f 38 17 c0; PTEST XMM0, XMM0 – has first 3 bytes that can be implied in the VEX prefix, thus it stays the same size! Kawabanga

I talked in an earlier post that I am not going to support SSE5 as for now, in the hope it’s gonna die. I believe CPU (or instruction set architectures, to be accurate) wars are bad for the coders and even for end users who can’t really enjoy those great technologies eventually.

A BSWAP Issue

November 7th, 2009

The BSWAP instruction is very handy when you want to convert a big endian value to a little endian value and vice versa. Instead of reversing the bytes yourself with a few moves (or shifts, depends how you implement it), a single instruction will do it as simple as that. It personally reminds me something like the HTONS (and friends) in socket programming. The instruction supports 32 bits and 64 bits registers. It will swap (reverse) all bytes inside correspondingly. But it won’t work for 16 bits registers. The documentation says the result is undefined. Now WTF, what was the issue to support 16 bits registers, seriously? That’s a children’s game for Intel, but instead it’s documented to be undefined result. I really love those undefined results, NOT.
When decoding a stream in diStorm in 16 bits decoding mode, BSWAP still shows the registers as 32 bits. Which, I agree can be misleading and I should change it (next ver). Intel decided that this instruction won’t support 16 bits registers, and yet it will stay a legal instruction, rather than, say, raising an exception of undefined instruction, there are known cases already when a specific destination register can cause to an undefined instruction exception, like MOV CR5, EAX, etc. It’s true that it’s a bit different (because it’s not the register index but the decoding mode), but I guess it was easier for them to keep it defined and behave weird. When I think of it again, there are some instruction that don’t work in 64 bits, like LDS. Maybe it was before they wanted to break backward compatibility… So now I keep on getting emails to fix diStorm to support BSWAP for 16 bits registers, which is really dumb. And I have to admit that I don’t like this whole instruction in 16 bits, because it’s really confusing, and doesn’t give a true result. So what’s the point?

I was wondering whether those people who sent me emails regarding this issue were writing the code themselves or they were feeding diStorm with some existing code. The question is how come nobody saw it doesn’t work well for 16 bits? And what’s the deal with XCHG AH, AL, that’s an equal for BSWAP AX, which should work 100%. Actually I can’t remember whether compilers generate code that uses BSWAP, but I am sure I have seen the instruction being used with normal 32 bits registers, of course.