Archive for February, 2008

Embedded Python And raw_input

Monday, February 25th, 2008

If you ever messed with embedded Python you would know that you can’t use the Win32 Console so easily. For that you will have to hack around a few bits in order to bind the stdin and stdout with the Win32 ones. Another familiar technique is to override sys.stdout with your own class that implements the ‘write’  method. So every time something goes to output this method is really called with an argument which is then being printed on the screen with whatever Win32 API you want, to name WriteFile

This same technique works also for stderr which is used when one writes some invalid text which doesn’t compile and an exception is thrown.

‘Hooking’ both stdout and stderr we still got stdin out of the loop. Now all these years that embedded Python was around I have never encountered the problem that the stdin doesn’t work with raw_input(). Don’t be confused, the way things work in embedded Python, it doesn’t mean that the interactive interpreter is fed from stdin, it’s a bit more messy. In all examples of embedded Python you have to implement your own interactive interpreter on your own, therefore you bypass stdin from the first moment. Hence if you never used raw_input you will not know if it’s broken or not. So I want to thank a special friend for reporting this bug to me… ;) J

The fix… To fix the problem I tried to replace sys.stdin with my own implementation of a class with a ‘read’ method (by guessing-because ‘write’ is stdout’s). Doing so I found out that the method name should be ‘readline’ according to raw_input’s shouts at me. Then I renamed the method and the implementation was to call a function that I wrote in C and which is wrapped in Python that reads input from the console (Using ReadFile(GetStdHandle(STD_INPUT_HANDLE),…) and returns that input string rawly. That did the trick and now raw_input works again.

I am pretty sure that all embedded Python users out there will need this hack if they already hack stdout…

Converting An Integer To Decimal – Assembly Style

Wednesday, February 13th, 2008

I know this is one of the most trivial things to implement in a high level language. In C it goes like this:

void print_integer(unsigned int x)
{
if ((x / 10) != 0) pi(x / 10);
putch((x % 10) + ‘0’);
}

The annoying thing is that you have to div twice and do a modulo once. Which in reality can be merged into a single X86 instruction. Another thing is that if you want to be able to print a normal result for an input of 0 you will have to test the result of the division instead of checking simply x itself. The conversion is done from the least significant digit to the most. But when we display the result (or put it in a buffer) we have to reverse it. Therefore the recursion is so handy here. This is my go in 16bit, it’s a code that I wrote a few years ago, and I just decided I should put it here for a reference. I have to admit that I happened to use this same code for also 32bits or even different processors and since it’s so elegant it works so well and easy to port. But I leave it for you to judge ;)

bits 16
mov ax, 12345
call print_integer
xor ax, ax
call print_integer
ret

print_integer:
; base 10
push byte 10
pop bx
.next:; make a 32 bits division, remainder in dx, quotient in ax
xor dx, dx
div bx
push dx ; push remainder
or ax, ax ; if the result if 0, we will stop recursing
jz .stop
call .next ; now this is the coolest twist ever, the IP that is pushed onto the stack…
.stop:
pop ax ; get the remainder (in reversed order)
add al, 0x30 ; convert it to a character
int 0x29 ; use (what used to be an undocumented) interrupt to print al
ret ; go back to ‘stop’ and read the next digit…

I urge you to compile the C code with full optimization and compare the codes for yourself.

Crunching 3 More Bytes

Tuesday, February 12th, 2008

I already uploaded the new version, whose size is 213 bytes. Download here.

The trick I used this time is very not popular and I doubt if any of you know it at all. A friend showed me this trick a year a go and it only occurred to me yester night to use it in Tiny PE itself.

C:\Documents and Settings\arkon>ping ragestorm.net
Pinging ragestorm.net [63.247.129.107] with 32 bytes of data:
Ctrl^c
python
>>> 107+129*256+247*256**2+63*256**3
1073185131
>>> len(str(_))
10
>>> len(“ragestorm.net”)
13
>>>

I guess you still didn’t get it :) So check this link out:
http://1073185131
The IP is converted to a decimal integer which the Net API knows and parses it well…
Whooooo’s the biatch now?! http://0x3ff7816b/ also works… but it’s still 10 bytes. heh

 BTW – I updated diSlib64 to be able to parse Tiny PE style files well…It’s the only thing that parses them from all the PE tools I got. Though it still doesn’t parse the forwarded export table. Will fix that in next update ;)

Nop nop nop

Saturday, February 9th, 2008

Two posts ago I talked about NOP in 64bits. It was unclear whether 0x90 acts as a true NOP. Because if you really think about it, the way 64bits processors work when it executes 32bits operations is to do the exchange between EAX and EAX, and then zero the high dword of touched registered. This is why XOR EAX, EAX is similar to XOR RAX, RAX… Only in 64bits!

So in that posts some guy (or girl?) wrote in a comment that exchanging two registers using 0x90 is possible when you use a REX prefix. REX prefix is like the 0x66 of the new 64bits code. It lets you access more registers and indicate the instruction may run with operation size of 64 bits rather than 32 bits (which is the default, hence 0x90 in 64bits is supposedly xchg eax, eax).

The documentation is not readable in shit. But that might be cause I’m the shock here… But let’s put it this way, after I wrote a disassembler and read the docs so many times regarding instructions etc – then if I don’t understand it, who should? FFS

Talking with Stefan And Peter (the guy behind YASM) we got the issue cleared on both processors (Intel/AMD). I just want to state before that Stefan came up with the whole problem and then Peter did some tests too on his end. So thank you both :) we now know that XCHG R8, RAX can be decoded in two ways, but we will focus on the one with the byte code of 0x90. The byte code for that instruction is: 0x49 0x90. 49 for Width (64bits operand size) and Base (access R8) and 0x90 for the XCHG. Together it really works! Same as that 0x41 0x90 works as XCHG R8D, EAX (clearing high dword though…).

diStorm treated this errornously by giving an output of:
DB 0x49
NOP

to indicate that the 0x49 prefix wasn’t used it was (what I call) DB’ed. So I had to change the behavior of this one and it wasn’t so trivial because 0x90 can’t just say “Hey, I’m NOP from now on and always was”. Now it’s up to the prefixes to decide whether 0x90 is XCHG or NOP – runtime detection. The static DB of instructions can’t help it. In addition, don’t forget 0xf3 0x90 is PAUSE which is a completely different instruction for the sake of example when mentioning prefixes of 0x90…

Last Version Hopefully

Saturday, February 9th, 2008

of Tiny PE that is. 216 bytes – over here.

 Some juicy details of the new tricks:

1. I had a gap of 4 bytes between two code chunks. 4 bytes is an immediate of 32 bits. Instead of jumping over this 4 bytes (which is a pointer to import data directory), I just skip over them using something stupid like MOV EDX, <DWORD>. Where <DWORD> is a place holder for the pointer. This way I destroy the value in EDX, but I chose EDX because I don’t use it anyway. I end up sparing a byte this way because jmp is 2 bytes long, and opcode for the mov is 1 byte.

2. Another problem was that the opcode of the mov was the same data as part of the export data directory size which poses a limitation on the values I could put there. Make long story short – I changed the base address of my image and adding size field of the directory size (the byte code for the mov is the high byte in the size DWORD field)  with the new base address there is no overflow and the exports continue to work well…

3. Immediately following the ‘Number Of Sections’ field I had the decryption loop code. In the beginning I used XOR for simplicity but now it’s changed to ADD R/M8, REG8. This is because this field is two bytes (a WORD) with the value of 1, means the image contains only one section and the second high byte is 0. The ADD byte code is 0. Do you make the math? So now I reuse this 0 byte to be a code for my decryption loop. Cool as it is, I had to change the code which encrypts the data in the image. XORing is really easy to do, but now my instruction looks something like this: add byte [ebx+ecx+off], bl. Mind you I know the value in BL in static time. Therefore I had to change the algorithm. Let plain-text-byte be P and value-of-BL be B, we got the following: Cipher = (P+256-B) & 255 for each byte we want to encrypt. The trick here is the modulo base, since we work with byte units (and it seems they are eventually truncated to 8 bits), we have to consider the add with the modulo. Decryption is actually: P = (Cipher + B) % 256. Anyway this way I saved one byte and it was a good practice on modulo stuff… ;)

4. Another byte was shaved by changing the URL of the image I download and execute by changing the URL from:

http ://ragestorm.net/f.DLL
to
http ://ragestorm.net/kernel32.DLL

Naughty? I need the kernel32.DLL anyway. So this way I can remove the ‘f’ and put the “kernel32” instead. So yes, now you’re downloading a file called ‘kernel32.DLL’ from my site. Though it’s being saved as something else locally… Ahh BTW-this way the image should be Win2000 compliant, but I can’t check it out. Any volunteers?

I wanted to release the code a month ago, but since then I have changed the code so many times. I think this is my final version for now :grins:. So I will pretty the code a bit and document some stuff and release it.