Assembly « Insanely Low-Level

Archive for the ‘Assembly’ Category

Testing The VM

Saturday, March 22nd, 2008

Since Imri and I started working on this huge project, we finally got to a situation where we got a few layers ready: disassembler to structure output –> structure to expression tress –> VM to run those expressions. This is all for the x86 so far, but the code is supposed to be generic, that’s why there’s the expression trees infrastructure from the beginning, so you can translate any machine code to this language and then the same VM will be able to run it. Sounds familiar in a way? Java, Ahm Ahm…

Imri has described some of our work in his last blog post, where you can find more information on what’s going on. But now that we got the VM somewhat working, my next step is to check all layers, that is, running a piece of code (currently assembly based) and see that I get the same results after running it for real. So after getting the idea of inline-assembly in Python, yes yes :) you heard me right. However, thanks to a friend who hasn’t published it yet. I changed the idea so instead of inline assembly I have a function that takes a chunk of assembly text and runs it both on the VM and natively under Python. Then when both are done executing, I check the result in EAX and see if they match… EAX can hold anything, like the EFLAGS and then I can even see if the flags of an operation like INC were calculated well in my VM…

The way I run the code in Python natively is by using ctypes. I wrote a wrapper for YASM and now I can compile a few assembly lines and then the decomposer (disassembler to expressions layer) is fed with the output of machine code, which is being ran in the VM. Then I take ctypes to run a raw buffer of code, using WINFUNCTYPE and the address of the buffer of the binary code, which I can then execute as simple as calling that instance. And I run it natively inside Python’s own thread’s context. The downsides are that I’m capped to 32 bits (since I’m on 32 bits environment), where x86 can be 16, 32 or 64 bits. Thus, I can’t really check 16 and 64 bits at the moment. More disadvantage are that I can’t write anywhere in memory for the sake of it, I need to allocate a buffer and feed my code with the pointer to it, unlike in the VM where I can read and write from anywhere I wish (it doesn’t imitate a PC). And I better not raise any exceptions whether intantionaly or not, because I will blow up my own Python’s process. ;) So I have to be a good boy and do simple tests like: mov al, 0x7f; inc al. and then see if the result is 0x80 and whether OF is set for example. This is really amazing to see the unit test function returns true on such a function :) So on the way we find bugs in the upper layers that sit above diStorm (which is quite stable on its own) and fixing them immediately and then retrying the unit tests. While I add more unit tests and fix more things. Eventually the goal is to take an inline C compiler and run the code and see the result. I’m interested in checking specific instructions at this phase of the project to know that everything we are based on works great, and then we can go with bigger blocks of code, doing fibunaci and other stuff in C…

I am a bit annoyed about the way that I run the code natively. I mean, I trust the code since I write it myself and there are only simple tests. But I can screw up the whole process by dividing by zero for example. I have to run the tests on an x86 machine with 32 bits environment, so I can’t really check different modes. The good thing is that I can really use the stack, as long as I leave it balanced when I end the test (like recovering all regs, etc)… Though I wish I could have run the tests in a real v86 processor or something like that, it won’t be portable with Linux, etc… Even spawning another process and inject the code inside will require some code for Windows and for Linux. And even then if the code is bogus, I can’t really control everything, it will require more work per OS as well. So for now the infrastructure is pretty cool and it gives me what I need. But if you have any idea of how to better doing it, let me know.

Posted in Assembly, diStorm | 1 Comment »

A Common Bug With LEA

Friday, March 7th, 2008

If we examine the LEA instruction from text output point of view, we can say that it is similar to the MOV instruction, but a one that uses memoy indirection. Hence you got the square brackets to denote a memory address… Only when you interpret LEA you should know to remove the fake memory indirection and to treat the ‘address’ as an immediate value, or as an expression. The expressions can get as complex as eax*8+ebx+1234. In reality, in order to simplify matters – the second (source) operand’s type of LEA will be probably OP_TYPE_MEM. Just like any other instruction that might have only a memory indirection, for instance, cmpxchg… So why shouldn’t we (disassemblers) have a special operand type for LEA? Well, mainly to say because it’s a headache to maintain two types that parse the memory indirection bytes, and eventually they really output the same stuff, so why not having only one type which LEA will be able to piggy back?

In diStorm, I used OP_TYPE_MEM for LEA’s operand. I didn’t want to have another special OP_TYPE_LEA. And I didn’t have any problem with it. Until today, I thought of a ‘bug’. You can use a segment override to change the default segment an address will be read from or written to. Here MOV EAX, [FS:EAX]. Normally the segment overrides you will see are only GS and FS, since DS and ES are usually set to the same value, so you can copy data from a source to dest bufferw using the MOVS instruction. Or you can compare two buffers using CMPS… And of course you cannot run without CS and SS… For that reason in 64bits they got rid off CS, DS, ES and SS segment overrides.

So what’s the bug you’re asking? LEA EAX, [FS:EAX] – Doesn’t really make any sense ah? And why does it happen? Because I use the MEM type… So I had two options, either add a special type so you filter the segment overrides for LEA. Or just filter the segment overrides when you see the instruction you decode is LEA. For simplicity I implemented the latter.

Anyway, diStorm is already updated by now. The bug was found in other disassemblers as well, to name Olly and VS’s debugger. I didn’t try other disassemblers, I guess more have this issue. Not to my surprise, IDA is immune.

The moral of the story? hmmm, don’t be a lazy jackass? No. Just don’t assume too much and try to think of the small details. :)

UPDATE:

Try to feed the assemblers with that buggy instruction and you will see that they generate it errornously with the segment override prefix :)

Posted in Assembly | No Comments »

Fooling Around With LEA

Wednesday, March 5th, 2008

Yesterday it hit me. I just realized something so funny that I had to post it right here. I have been using LEA for years now and so have you I guess. Most of the times LEA is used to load an offset of a local variable in a function, for example:

void f()
{
int x;
g(&x);
}

The parameter &x for calling g will use LEA to load the address of x and pass it to g so g can change it inside. But this is nothing new.

You can write something like this:
LEA EAX, [0x12345678]
And you know what?
EAX will be now 0x12345678
This is somewhat trivial when you get to think about it, but when do you??
I wonder how good it is as anti-disassemblers stuff, I think it will get the disassembler a bit crazy, it worth a test… because now instead of loading immediates with MOV, you can use LEA…

Posted in Assembly | 7 Comments »

Converting An Integer To Decimal – Assembly Style

Wednesday, February 13th, 2008

I know this is one of the most trivial things to implement in a high level language. In C it goes like this:

void print_integer(unsigned int x)
{
if ((x / 10) != 0) pi(x / 10);
putch((x % 10) + ‘0’);
}

The annoying thing is that you have to div twice and do a modulo once. Which in reality can be merged into a single X86 instruction. Another thing is that if you want to be able to print a normal result for an input of 0 you will have to test the result of the division instead of checking simply x itself. The conversion is done from the least significant digit to the most. But when we display the result (or put it in a buffer) we have to reverse it. Therefore the recursion is so handy here. This is my go in 16bit, it’s a code that I wrote a few years ago, and I just decided I should put it here for a reference. I have to admit that I happened to use this same code for also 32bits or even different processors and since it’s so elegant it works so well and easy to port. But I leave it for you to judge ;)

bits 16
mov ax, 12345
call print_integer
xor ax, ax
call print_integer
ret

print_integer:
; base 10
push byte 10
pop bx
.next:; make a 32 bits division, remainder in dx, quotient in ax
xor dx, dx
div bx
push dx ; push remainder
or ax, ax ; if the result if 0, we will stop recursing
jz .stop
call .next ; now this is the coolest twist ever, the IP that is pushed onto the stack…
.stop:
pop ax ; get the remainder (in reversed order)
add al, 0x30 ; convert it to a character
int 0x29 ; use (what used to be an undocumented) interrupt to print al
ret ; go back to ‘stop’ and read the next digit…

I urge you to compile the C code with full optimization and compare the codes for yourself.

Posted in Algorithms, Assembly | 4 Comments »

Crunching 3 More Bytes

Tuesday, February 12th, 2008

I already uploaded the new version, whose size is 213 bytes. Download here.

The trick I used this time is very not popular and I doubt if any of you know it at all. A friend showed me this trick a year a go and it only occurred to me yester night to use it in Tiny PE itself.

C:\Documents and Settings\arkon>ping ragestorm.net
Pinging ragestorm.net [63.247.129.107] with 32 bytes of data:
Ctrl^c
python
>>> 107+129*256+247*256**2+63*256**3
1073185131
>>> len(str(_))
10
>>> len(“ragestorm.net”)
13
>>>

I guess you still didn’t get it :) So check this link out:
http://1073185131
The IP is converted to a decimal integer which the Net API knows and parses it well…
Whooooo’s the biatch now?! http://0x3ff7816b/ also works… but it’s still 10 bytes. heh

BTW – I updated diSlib64 to be able to parse Tiny PE style files well…It’s the only thing that parses them from all the PE tools I got. Though it still doesn’t parse the forwarded export table. Will fix that in next update ;)

Posted in Assembly, Optimization | 4 Comments »

Nop nop nop

Saturday, February 9th, 2008

Two posts ago I talked about NOP in 64bits. It was unclear whether 0x90 acts as a true NOP. Because if you really think about it, the way 64bits processors work when it executes 32bits operations is to do the exchange between EAX and EAX, and then zero the high dword of touched registered. This is why XOR EAX, EAX is similar to XOR RAX, RAX… Only in 64bits!

So in that posts some guy (or girl?) wrote in a comment that exchanging two registers using 0x90 is possible when you use a REX prefix. REX prefix is like the 0x66 of the new 64bits code. It lets you access more registers and indicate the instruction may run with operation size of 64 bits rather than 32 bits (which is the default, hence 0x90 in 64bits is supposedly xchg eax, eax).

The documentation is not readable in shit. But that might be cause I’m the shock here… But let’s put it this way, after I wrote a disassembler and read the docs so many times regarding instructions etc – then if I don’t understand it, who should? FFS

Talking with Stefan And Peter (the guy behind YASM) we got the issue cleared on both processors (Intel/AMD). I just want to state before that Stefan came up with the whole problem and then Peter did some tests too on his end. So thank you both :) we now know that XCHG R8, RAX can be decoded in two ways, but we will focus on the one with the byte code of 0x90. The byte code for that instruction is: 0x49 0x90. 49 for Width (64bits operand size) and Base (access R8) and 0x90 for the XCHG. Together it really works! Same as that 0x41 0x90 works as XCHG R8D, EAX (clearing high dword though…).

diStorm treated this errornously by giving an output of:
DB 0x49
NOP

to indicate that the 0x49 prefix wasn’t used it was (what I call) DB’ed. So I had to change the behavior of this one and it wasn’t so trivial because 0x90 can’t just say “Hey, I’m NOP from now on and always was”. Now it’s up to the prefixes to decide whether 0x90 is XCHG or NOP – runtime detection. The static DB of instructions can’t help it. In addition, don’t forget 0xf3 0x90 is PAUSE which is a completely different instruction for the sake of example when mentioning prefixes of 0x90…

Posted in Assembly | No Comments »

Last Version Hopefully

Saturday, February 9th, 2008

of Tiny PE that is. 216 bytes – over here.

Some juicy details of the new tricks:

1. I had a gap of 4 bytes between two code chunks. 4 bytes is an immediate of 32 bits. Instead of jumping over this 4 bytes (which is a pointer to import data directory), I just skip over them using something stupid like MOV EDX, <DWORD>. Where <DWORD> is a place holder for the pointer. This way I destroy the value in EDX, but I chose EDX because I don’t use it anyway. I end up sparing a byte this way because jmp is 2 bytes long, and opcode for the mov is 1 byte.

2. Another problem was that the opcode of the mov was the same data as part of the export data directory size which poses a limitation on the values I could put there. Make long story short – I changed the base address of my image and adding size field of the directory size (the byte code for the mov is the high byte in the size DWORD field) with the new base address there is no overflow and the exports continue to work well…

3. Immediately following the ‘Number Of Sections’ field I had the decryption loop code. In the beginning I used XOR for simplicity but now it’s changed to ADD R/M8, REG8. This is because this field is two bytes (a WORD) with the value of 1, means the image contains only one section and the second high byte is 0. The ADD byte code is 0. Do you make the math? So now I reuse this 0 byte to be a code for my decryption loop. Cool as it is, I had to change the code which encrypts the data in the image. XORing is really easy to do, but now my instruction looks something like this: add byte [ebx+ecx+off], bl. Mind you I know the value in BL in static time. Therefore I had to change the algorithm. Let plain-text-byte be P and value-of-BL be B, we got the following: Cipher = (P+256-B) & 255 for each byte we want to encrypt. The trick here is the modulo base, since we work with byte units (and it seems they are eventually truncated to 8 bits), we have to consider the add with the modulo. Decryption is actually: P = (Cipher + B) % 256. Anyway this way I saved one byte and it was a good practice on modulo stuff… ;)

4. Another byte was shaved by changing the URL of the image I download and execute by changing the URL from:

http ://ragestorm.net/f.DLL
to
http ://ragestorm.net/kernel32.DLL

Naughty? I need the kernel32.DLL anyway. So this way I can remove the ‘f’ and put the “kernel32” instead. So yes, now you’re downloading a file called ‘kernel32.DLL’ from my site. Though it’s being saved as something else locally… Ahh BTW-this way the image should be Win2000 compliant, but I can’t check it out. Any volunteers?

I wanted to release the code a month ago, but since then I have changed the code so many times. I think this is my final version for now :grins:. So I will pretty the code a bit and document some stuff and release it.

Posted in Assembly, Portable Executable | 2 Comments »

Weird Stuff

Monday, January 28th, 2008

As I am still working on TinyPE NG very hard, I got it to 220 bytes at the moment. I am still not frustrated and I think I will be able to get it a few bytes less. Since my last post, I was talking with Peter Ferrie on the Code Crunchers mailing list, which you’re invited to sign up right here. So Peter suggested that I won’t use WinExec, that instead of executing the downloaded file, which was an .exe by then. I should download a .dll file and LoadLibrary it. Thing was, that I didn’t use LoadLibrary, that was one of the tricks in the new version. Eventually, I removed lots of code (18 bytes so far!) and managed to download the .dll and load it using an export forwarding, but this time on the downloaded file! And then it even spared the ExitProcess trick (one byte…) that I came up with Matthew Murphey in the last challenge. I don’t need to ExitProcess since now the dll is loaded into the same process, and ExitProcess in the dll itself will do the job… My only problem was that my server didn’t let me download any file with an extension of ‘.dll’. I got freaked out and didn’t understand why the damned thing won’t let me download it. So I tried to remove the access list in .htaccess and play with it, but nothing helped. So I almost wanted to give up with the whole idea. Until at the last moment, I thought that since my server is Linux based (so why does it care about dll files in the first place?) I can call the file “.DLL”, notice the capital letters. Now the loader doesn’t really care about big or small letters so everything went ok then…

To a different matter now, a friend (who contributed to diStorm in the past), keeps on using it heavily himself and found something interesting. He was trying to exchange two registers, eax and r8d (xchg eax, r8d). That would be something with the REX prefix (specifially 41) and 90 following. The thing was that no matter what you’re doing (that is prefixing 90 with any byte) it won’t change it’s behavior. It’s like 90 is really hardwire for doing nothing (no-operation). Ahh sorry, 90 is xchg eax, eax which is used to denote a NOP instruction for those who were following me. So image you want to exchange two registers and the assembler generated 41 90 – nothing happens when you run it. Quite absord. So it has to be changed into the 2 bytes of the exchange instruction… The cool thing about this whole story that diStorm showed the output well: DB 0x41; NOP. Now to be honest, I never gave it a thought when I ported diStorm to support the 64 bit instructions. But it so happens that the 0x90 is really being changed to NOP rather than xchg eax, eax. So the prefix is useless and thus dropped… Anyays a nice finding Stefan!

Oh yeah, well I was not saying the whole truth, there is a prefix for the NOP instruction, 0xf3. Together with 0x90, it becomes a PAUSE instruction…

Posted in Assembly | 3 Comments »

TinyPE NG is Out

Thursday, January 17th, 2008

Here you go guys:
http://ragestorm.net/tiny/tinypeng.exe

Source will be released withing a couple of weeks.
Have fun :)
Meanwhile I will be in Turkey for the weekend to relax and leave the bits behind.

Kix$

Posted in Assembly, Portable Executable, Win32 | 4 Comments »

TinyPE NG

Tuesday, January 15th, 2008

I rewrote TinyPE and just got to 240 bytes!!!!!1!!11!!11!!1! * 9**9**9

Downloading a file from the .net (specifically my own site) and running it, while the strings in the .exe must be encrypted in someway. You can find more information by googling the TinyPE Challenge.

Holy shit, I’m kinda excited myself, it was my goal and I have just reached it after a few days of hard work. Now I gotta shave one more byte and then I’m done :)

I will release the source once I am finished. I do new stuff and again no tools can read my .exe file, not even my own diSlib64… bammer.

Stay Tuned :) blee

Posted in Assembly, Optimization, Portable Executable | 2 Comments »

Insanely Low-Level

Archive for the ‘Assembly’ Category

Testing The VM

A Common Bug With LEA

Fooling Around With LEA

Converting An Integer To Decimal – Assembly Style

Crunching 3 More Bytes

Nop nop nop

Last Version Hopefully

Weird Stuff

TinyPE NG is Out

TinyPE NG

Insanely Low-Level

Categories

Recent Posts

Recent Comments

Blogroll

Sites

Meta