Archive for the ‘ReviveR’ Category

Unsigned in Java

Tuesday, October 26th, 2010

Hello everyone,

as you may know we (ReviveR team) chose to use Java as the main language for the framework and maybe the UI too (they are totally separated for now).
We started by converting diStorm to Java using JNI, and converting diSlib64, my robust PE file parser in Python.

While we were doing the conversion we found out that there is no unsigned keyword in Java. Yes, I gotta admit, we are noobs in Java, but we are professional coders using other languages, as a matter of fact. On the other side, everyone knows it’s all about new syntax and the benefits of the language itself that once you’re used to, then you rock with the language. So syntax is easy. And it’s gonna take a short while ’till we learn the benefits of Java. After all we chose Java because it is widely cross platform, and C# is ten times better, I can just claim it, not going to prove it. In this post, however, I’m going to talk about the disadvantages of Java, to name one, unsigned numbers.

When you parse a PE+ file (that’s for AMD64), you need to read some 64 bit integers. Therefore we needed a way to hold an address in 64 bits, usually addresses are unsigned, in contrast to RVAs. The problem was that there is no way to define an unsigned long in Java. This is a really unpleasant welcome to Java, seriously. Wtf did the designer think? And I looked for his stupid comment, it read something like: “ahh, most coders don’t need ‘unsigned’, it only complicates stuff”. What a douche. Now, this is a denial to reality. Looking for other alternatives on the net I found that most people use a bigger size for their integer, suppose they need an unsigned 8 bits integer, then they will use the next bigger size that could hold such an integer as unsigned, which is short… This is so lame, you can even get unsigned 32 bit integers, by using longs, right? But what about using unsigned 64 bit integers? No bigger size, no way.
Others say, you can use BigIntegers, the moment I heard about that I wanted to cry out loud. My guess is that the implementation is a bit vector. So using BigInteger only for representing unsigned longs, that’s useless, if you ask me. Oh and I almost forgot to mention that it accepts the byte[] in big endian only, blee.

I really got pissed off, there were moments I wanted to go back to C++. Although I knew that I’m going to waste time on auto-pointers, data structures, and shit like that, but C++ has unsigned. How cool is that.

I consulted with a friend and he referred me to this link: Unsigned arithmetic in Java.
That seemed a bit helpful, and I liked the general idea. I think there are errors in the code snippets (didn’t check them though). Anyway, the guy suggests to use an “isLessThanUnsigned” comparison, I didn’t want to limit my unsigned long’s interface in such a way.

Therefore I took a look at the interface of BigInteger, saw that they use a compareTo method, and did the same on a new class I wrote, named ULong. The class can accept, byte array, bytebuffer, longs, and also as big endian if necessary.

The compareTo was written from scratch:

public long compareTo(ULong rh)
{
	// If both numbers have the same sign, it's up to their real values.
	if (((mValue ^ rh.mValue) >> 63) == 0) return mValue - rh.mValue;
	// Here they have different signs, if mValue has the MSB set, it's negative _in Java_, thus bigger.
	if (mValue < 0) return 1;
	// Else, the rh.mValue is bigger.
	return -1;
}

Very basic arithmetic operations, and it's pretty quick relative to BigInteger's, mine is 8.5 times faster, on my machine...
The point is that I couldn't accept all the extra stuff it needed to do in order to represent an unsigned long. It bugged me. I'm not going to stop and take my time again (hopefully) on issues like this, but since I don't know Java this well, I was curious to see how things work.

Another issue that I didn't like is that you cannot define global functions (or am I wrong here?), everything has to be in classes, this is annoying sometimes, but I guess the rational was to force a kind of 'namespaces', so it's fine eventually - but let me decide what to do, I know what I'm doing.

Last one, the separation to files based on public classes, it really forces one to divide all his classes into lots of files. Or dump them one after the other as inner classes. And then if you have a third inner enum, for instance, the compiler shouts at you that the outer class has to be static, etc. Consequently, it forces you to move it out, and then you find yourself dividing your code again, and now it's out of context of the class you wanted to put it in...

Oh dear Java, a love begins :(

P.S - I think that the beauty is that I know to use high level languages when I have to, with all due respect to me and low level.

diStorm for Java, JNI

Monday, October 4th, 2010

Since we decided to use Java for the reverse engineering studio, ReviveR, I had to wrap diStorm for Java. For now we decided that the core framework is going to be written in Java, and probably the UI too, although we haven’t concluded that yet. Anyway, now we are thinking about the design of the whole system, and I’m so excited about how things start to look. I will save a whole post to tell you about the design once it’s ready.

I wanted to talk a bit about the JNI, that’s the Java Native Interface. Since diStorm is written in C, I had to use JNI to use it inside Java now. It might remind P/Invoke to people, or Python extensions, etc.

The first thing I had to do is to define the same C structures of diStorm’s API, but in Java. And this time they are classes, encapsulated obviously. After I had this classes ready, and stupid Java, I had to put each public class in a separate file… Eventually I had like 10 files for all definitions and then next step was to compile the whole thing and use the javah tool to get the definitions for the native functions. I didn’t like the way it worked, for instance, any time you rename the package name, add/remove a package the name of the exported C function, of the native .DLL file, changes as well, big lose.
Once I decided on the names of the packages and classes finally I could move on to implement the native C functions that correspond to the native definitions in the Java class that I wrote earlier. If you’re familiar a bit with JNI, you probably know well jobject and its friends. And because I use classes rather than a few primitive type arguments, I had to work hard to wrap them, not mentioning arrays of the instructions I want to return to the caller.

The Java definition looks as such:

public static native void Decompose(CodeInfo ci, DecomposedResult dr);

The corresponding C function looks as such:

JNIEXPORT void JNICALL Java_distorm3_Distorm3_Decompose
  (JNIEnv *env, jobject thiz, jobject jciObj, jobject jdrObj);

Since the method is static, there’s no use for the thiz (equivalent of class’s this) argument. And then the two objects of input and output.
Now, the way we treat the jobjects is dependent on the classes we set in Java. I separated them in such a way that one class, CodeInfo, is used for the input of the disassembler. And the other class, DecomposedResult, is used for output, this one would contain an array to return the instructions that were disassembled.

Since we are now messing with arrays, we don’t need to use another out-argument to indicate the number of entries we returned in the array, right? Because now we can use something like array.length… As opposed to C function: void f(int a[], int n). So I found myself having to change the classes a bit to take this into account, nothing’s special though. Just need to get advantages of high level languages.

Moving on, we have to access the fields of the classes, this is where I got really irritated by the way the JNI works. I wish it were as easy as cTypes for Python, of course they are not parallel exactly, but they solve the same problem after all. Or a different approach like parsing a tuple in Embedded Python, PyArg_ParseTuple, which eases this process so much.

For each field in the class, you need to know both its type and its Id. The type is something you know at compile time, it’s predefined and simply depends on the way you defined your classes in Java, easy. The ugly part now begins, Ids – You have to know to which field you want to access, either for read or write access. The idea behind those Ids was to make the code more flexible, in the way that if you inherit a class, then the field you want to access probably moved to a new slot in the raw structure that contains it.
Think of something like this:

struct A {
int x, y;
};

struct B {
 int z, color;
};

struct AB {
 A;
 B;
};

Suddenly, accessing to AB::B.z has a different index than accessing to B.z. Can you see that?
So they guys who designed JNI came with the idea of querying the class, by using internal reflection to get this Id (or really an index to the variable in the struct, I take a guess). But this reflection thingy is really slow, obviously you need to do string comparisons on all members of the class, and all classes in the derived class… No good. So you might say, “but wait a sec, the class’s hierarchy is not going to change in the lifetime of the application, so why not reuse its value?”. This is where the JNI documentation talks about caching-ids. Now seriously, why don’t you guys do it for us internally, why I need to implement my own caching. Don’t give me this ‘more-control’ bullshit. I don’t want control, I want to access the field as quickly as possible and get on to other tasks.

Well, since the facts are different, and we have to do things the way we do, now we have to cache the stupid Ids for good. While I read how people did it and why they got mysterious crashes, I solved the problem quickly, but I want to elaborate on it.

In order to cache the Ids of the fields you want to have access to, you do the following:

if (g_ID_CodeOffset == NULL) {
    g_ID_CodeOffset = (*env)->GetFieldID(env, jObj, "mCodeOffset", "J");
    // Assume the field exists, otherwise your interfaces are broken anyway.
}
// Now we can use it...

Great right? Well, not so fast. The problem is that if you have a few functions that each accesses this same class and its members, you will need to have this few lines of code everywhere for each use. No go. Therefore the common solution is to have another native static InitIDs function and invoke it right after loading the native library in your Java code, for instance:

static {
	System.loadLibrary("distorm3");
	InitIDs();
}

Another option would be to use the JNI_OnLoad exported function to initialize all global Ids before the rest of the functions get ever called. I like that option more than the InitIDs, which is less artificial in my opinion.

Once we got the Id ready we can use it, for instance:

codeOffset = (*env)->GetLongField(env, jciObj, g_ID_CodeOffset);

Note that I use the C interface of the JNI API, just so you are aware to it. And jciObj is the argument we got from Java calling us in the Decompose function.

When calling the GetField function we have to pass a jclass, that’s a Java-class object’s pointer kinda. In contrast to the class instance, I hope you know the difference. Now since we cache the Ids for the rest of the application life time, we have to keep a reference to this Java-class, otherwise weird problems and nice crashes should (un)surprise you. This is crucial since we use the same Ids for the same classes along the code. So when we call the GetFieldID we should hold a reference to that class, by calling:

(*env)->NewWeakGlobalRef(env, jCls);

Note that jCls was retrieved using:

jCls = (*env)->FindClass(env, "distorm3/CodeInfo");

Of course, don’t forget to remove the reference to those classes you used in your code, by calling DeleteGlobalRef in JNI_OnUnload to avoid leaks…

The FindClass function is very good once you know how to use it. It took me a while to figure out the syntax and naming convention. For example, the String which seems to be a primitive type in Java, is really not, it’s just a normal class, therefore you will have to use “java/lang/String” if you want to access a string member.
Suppose you got a class “CodeInfo” in the “distorm3” package, then “distorm3/CodeInfo” is the package-name/class-name.
Suppose you got an inner class (inside another class), then “distorm3/Outer$Inner” is the package-name/outer-class-name$inner-class-name.
And probably there are a bit more to it, but that’s a good start.

About returning new objects to the caller. We said already that we don’t use out-arguments in Java.
Think of:

void f(int *n)
{
 *n = 5;
}

That’s exactly what an out-argument is, to return some value rather than using the return keyword…
When you want to return lots of info, it’s not a good idea, you will have to pass lots of arguments as well, pretty ugly.
The idea is to pass a structure/class that will hold this information, and even have some encapsulation to it.
The problem at hand is whether to use a constructor of the class, or just create the object and set each of its values manually.
Also, I wonder which method is faster, letting the JVM do it on its own in a constructor, or doing everything using JNI.
Unfortunately I don’t have an answer to this question. I can only say that I used the latter method of creating the raw object and setting its fields. I thought it would be better.
It looks like this:

jobject jOperand = (*env)->AllocObject(env, g_OperandIds.jCls);
if (jOperand == NULL) // Handle error!
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Type, insts[i].ops[j].type);
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Index, insts[i].ops[j].index);
(*env)->SetIntField(env, jOperand, g_OperandIds.ID_Size, insts[i].ops[j].size);

(*env)->SetObjectArrayElement(env, jOperands, j, jOperand);

This is real piece of code taken from the wrapper code. It constructs an Operand class from the Operand structure in C. Notice the way the AllocObject is used, using that jCls we hold a reference to, instead of calling FindClass again… Then setting the fields and setting this object in the array of Operands.

What I didn’t like much in the JNI is that I had to call SetField, GetField and those variations. On one hand, I understand they wanted you to know which type of field you access to. But on the other hand, when I queried the Id of the field, I specified its type, so I pretty much know what type-value I’m setting, so… Well, unless you have bugs in your code, but that will always cause problems.

To another issue, one of the members of the CodeInfo structure in diStorm is a pointer to the binary code that you want to disassemble. It means that we have to get some buffer from Java as well. But apparently, sometimes the JVM decided to make a copy of the buffer/array that is being passed to the native function. In the beginning I used a straight forward byte[] member in the class. This sucks hard. We don’t want to waste time on copying and freeing buffers that are read-only. Performance, if can be better, should be better by default, if you ask me. So reading the documentation there’s an extension to the JNI, to use java.nio.ByteBuffer, which gives you a direct access to the Java buffer without the extra efforts of copying. Note that it requires the caller to the native API to use this class specifically and sometimes you’re limited…

The bottom line is that it takes a short while to understand how to use JNI and then you get going with it. I found it cumbersome a bit… The most annoying part is all the extra preparations you have to do in order to access a class or a field, etc. Unless you don’t care at all about performance but then your code is less readable for sure. We don’t have any information about performance of allocating new objects and array usage. We can’t base our ways of coding on anything. I wish it could be more user friendly or parts of it eliminated somehow.

ReviveR – Request for Features

Wednesday, September 29th, 2010

Hey guys,

finally, we are starting to come up with a huge features-list for the project.
This is *your* time to affect the project.
I can’t promise that we will implement everything, but we are open minded, and we are going to implement tons of stuff.
We really believe we can make a production level reversing platform with top features.

So if you got ideas, about how the GUI is going to look, or what kind of “windows” we should have in the application, or even tiny stuff, like how an instruction is supposed to be colored, new ideas for code analysis, a crazy view to load a few files in the same workspace, or just whatever that’s having to do with reversing, just go ahead.

So please help us, just ask for anything on your mind
Thanks