Piiano Data Protection – My Startup

July 18th, 2023

Hey everybody,
As much as I love writing, I’ve been moving to write for my own startup Piiano.

Piiano celebrates 2 years this month, and we build incredible infrastructure for backend developers to build their apps securely around sensitive data and complying with regulations like GDPR/CCPA and others at the software level.

Piiano’s name is based on the privacy acronym PII, and we pronounce it just as ‘Piano’ the great musical instrument. PII stands for personal identifiable information – aka our identifiers like full name, email, phone number, address, etc.

At Piiano, we help businesses protect their sensitive data (normally, customer data) with APIs for data encryption and tokenization. So for the first time, backend developers can continue to use their own databases, each team with whatever tech stack they like, and using our APIs they can encrypt the sensitive fields on top of it.

In addition, to the backend developers, the security teams will be happy, for a few reasons:
1. Encrypting data in a robust way over different database requires different skills and expertise.
2. Given that each R&D team has different database, it will require lots of different people supporting securing it, which is hard to do in scale, practically.
3. The security team always chases the R&D teams to fix stuff, and when things get broken, it’s sometimes too late, already in production.

With our approach, at Piiano, security teams ask the backend application developers to use our APIs for encrypting data and decrypting data. Friendly APIs that are designed from scratch to be used by developers with simplicity at mind, thus RESTful APIs for example. This way R&D teams streamline securing the sensitive data. In return, the security teams get to control who can access the data, when, why, etc. And everything gets to a unified control center that lets them manage everything at scale.

Building an encryption system is a very hard task, and require real expertise that most technologists aren’t aware to the complicates of implementing them, unfortunately. “Don’t roll your own crypto”, they say, but most don’t even listen or know it. We try to take away this pain and complication. We also try to provide the developers with privacy-aware infrastructure, so upon decryption, we can do data masking, data expiration, data transformation, and providing even more support for privacy compliance.
And we have lots of other innovations there… Cool stuff really. Like IDOR protection, mitigation for SQL injections, object level security, and many more features. Cause we’re tired of seeing stupid bugs in serious applications, lol.

So that’s my current endeavor for the last couple of years, and the next few years for sure.

As someone who loves security engineering and vulnerability research, starting Piiano was a must. Our dream is to move the needle in the data security industry and help businesses really protect their customer data – eventually our own data, as netizens, for whatever apps we use on the internet.

We want to solve this data breaches problem at the source code level, at the core of where data is being stored and accessed. This is where real data security takes place, with all due respect to more firewalls and WAFs that people like myself just sit down for a few hours and manage to bypass eventually…

Wish us like in protecting the data against the bad guys :)

Optimizing diStorm for 2x

May 27th, 2020

I always loved code optimization, because it’s easy to see the goal getting closer every iteration in an objective way, and it’s a hard challenge. You set a bar and then you start to sweat getting there. I decided I want to take latest distorm version and make it 2x faster in processing time. I find it deeply fascinating that the same mechanism with the same input/output can run much faster.

diStorm was designed to be fast from day one, and it was pretty much the case for many (~15) years when comparing diStorm to other disassemblers. Until recently (last few years) a new one claimed to be faster from some benchmarking I saw, although I’ve never validated it on my own. Amyway getting 2x faster wasn’t a small feat.

By designed-for-speed I mean:
1) single threaded!
2) diStorm doesn’t allocate memory at all (it takes buffers from the user)
3) doesn’t call any syscalls
4) it’s very simple in terms of calling-graph/functions
5) works with length’ed strings (as opposed to C strings)
6) parsing/decoding binary is fully separated from formatting the output strings
7) no initialization code is required (imagine distorm_init)

The test case consists of running the disasm tool (only commenting out the printf call) on some input file to decode it in 32 bits. Compiling the disasm itself with full-optimizations on and for x64 arch. My CPU is old: Intel SandyBridge Core i7-2600 @ 3.4 GHz. OS: Windows 10. The input is a dummy ~120MB file of code (c:\windows\system32\mrt.exe), which is good enough for my test. Starting above 5000 milisec and going down all the way to 2500 for a single thread!

The process of researching the code was done in two big steps (mind you, I had lotsa free time):
1) Research – Quick and dirty changes to the code, testing only once in a while for validness of output. Seeing progress fast. Eventually reaching goal (1 month almost full time).
2) Development – Starting coding all over again, verifying every new piece of code, adding more tests/coverage and making it production grade.

Code optimization falls under two categories, the way I see it, the obvious stuff that a profiler points to you and you can easily fix (normally not a great boost in performance, very specific touch). And the non-obvious stuff that can be refactored to get better performance but nothing will guide you the way to achieve it. And this is why I believe code optimization is really hard and more of an art form, as there’s no method directing you how to do it. Normally, one should make sure their code is tight in the algorithmic level and only then try to exhaust all the obvious fixes the profiler hints about. In reality, you go back and forth and after a while that you believe there’s nothing more to squeeze or re-write, you find new creative ways for the new code to work faster, and these are the big boosts.

The tools I used for my performance research were Intel vTune which is now free, and linux’ perf tool. Note that flamegraph wasn’t helping as the major code loop has 2 levels deep call stack. The advantage of using perf on linux is that you can query for the exact HW event, unlike the free version of vTune which queries all events altogether and their statistics are a bit more noisy thereof.

In the rest of this post I’m going to show a few of the best optimizations that had the biggest impact on performance (nonadjacent to research order):

  1. PGOProfile Guided Optimization (feature is courtesy of Visual Studio). This one easily gets your code 10-20% faster and it’s ridiculous that I’ve never heard of it before, and neither many acquainted developers. The drawback is that it requires certain input to ‘learn’ how to prioritize/lay-out the code in order to make the code faster, in a disassembler’s case it’s pretty obvious anyway, so nothing to worry about. Basically, what it does is ordering basic-blocks and instructions to be in hot path (the instructions that mostly happen in a loop, for instance) and the cold paths (like error-handling that doesn’t happen often).
    For example, for “if-else” ladder (or switch-case), it’s preferred that the more likely condition appear first. As to get to it you have less branches. This is one of the optimizations done automatically by PGO if it has sufficient data.
    Now a stream of instructions can run consequently one after the other without branching and that’s amazing. See this picture of decode_inst disassembly of before-PGO here, and after-PGO here, notice the code layout and the latter function’s core is within one page.

    Note that in order to use this feature it’s important to copy some DLL to the release directory, run the tool with the wanted input, and then re-compile the code with ‘optimizing for pgo’ in VS.
    (exemplary path:
    C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\Hostx64\x64\pgort140.dll).

    A tip: I created a compilation target that copies this file, compiles for PGO and runs the input file, and then compiles again with the created profile, this saved a lot of time.
  2. Textual formatting a binary structure from _DInst structure to a textual structure (_DecodedInst) requires to do lots of string copyings of both mnemonic (ADD, MOV, XOR, …) and registers (AL, EAX, XMM0,…) and other constant string literals.

    The main trick that boosted performance around 10% is to change the calls to memcpy with varied length to a fixed length.
    Instead of calling
    memcpy(output->mnemonicText, mnemonicsTable[mnemonicId].buffer, mnemonicsTable[mnemonicId].length);
    We better do:
    memcpy(output->mnemonicText, mnemonicsTable[mnemonicId].buffer,

    Now the compiler can fully omit the call to memcpy and just do a single instruction 16 bytes copy!
    The reason it always copies 16 is because the longest mnemonic length is 16 of the instruction VAESKEYGENASSIST.
    Caveats: 1) Adding padding in the table to avoid out of bounds is crucial. 2) All mnemonics are also zero terminated, so it’s okay to copy full 16 bytes even if the mnemonic is ‘MOV’, it’s really ‘MOV\0’, and yes it will copy other stuff in (but it’s ignored). 3) All final strings also have an integer length accompanying them!

    For copying registers, making sure the table is as tight as possible and also copying a fixed length did the trick. Meaning that if there’s a table of registers in the form of:
    char registersTable[100][10] = {"eax", "ecx", ...};
    It means that the access to each entry is a multiplication of 10, making it a power of 2 is easier for the compiler to optimize and it also saves a lot of memory space too. Biggest register string literal length is 5 (YMM15) and a null termination char is needed, so we can use 6, but better use 8 already. Then the memcpy of a register always copies 8 bytes no matter what.

    Since x86 instructions have both string literal prefixes (rep, repnz, lock) and postfixes (eg CMPSD, CMPSW, CMPSB, etc), doing lookup tables also quickened stuff.

    See code diff here (scroll down all the way).

    Another trick that really helped, surprisingly, a lot is that I noticed string concatenation can work even faster. Note that all distorm strings are already length’ed, meaning it’s a structure with length and a buffer. And when a proprietary version of strcat is used it appends to the string and instead of searching the null termination it knows its length. But the surprise was that appending to the string with
    memcpy(&str->buf[str->length], inputStr, inputLen);
    was very slow according to just use a normal pointer, eg:
    memcpy(str, inputStr, inputLen); str += inputLen;
    So I had to change all distorm string functionality to work with a single pointer, and only at the end calculate the length and stuffing a null termination character.
    Also note that at the same breadth I figured that distorm’s code has fixed sized string literals like the DWORD in ‘mov dword [eax], 8’, etc. So I forced distorm to use fixed sized memcpy’s too, which helped to compiler to eliminate them all and turn them into a simple qword-move. If you open distorm’s disasm code now you won’t see a single call to a memcpy function. This also saved a lot of time, 5-10%.
    See relevant code diff here.
  3. Another set of optimizations that shaved 10% further were to re-arrange if statements and code blocks that happen to run per instruction, but in reality they aren’t always needed.

    The obvious guideline is that if the code does less stuff per iteration it’s likely to run faster.

    Imagine the code checks something in the instruction. For example, do something for privileged instructions only. The first issue is that such instructions are input dependent (i.e out of our hands), the second one is that 99% instructions out of all instruction sets aren’t privileged to begin with. This means that the code should avoid doing checks for them, if possible. So I managed to omit a conditional statement…

    One thing was to move things to that origin, for example, if something was dependent on the optional operand of an instruction, and the test was done in the per-instruction code, obviously if we move the test to the origin (inside the operand parser), then things will run faster.

    Another was that some extra features were done naively (distorm tracks all used/unused prefixes and generates relevant masks), so optimizing these boosted the performance too. Sometimes a profiler makes things clear where it’s worth to optimize stuff.

    Sometimes it’s better to do a=x; if (y) a=z; instead of if (y) a=z; else a=x; which also might help for performance.

    Assuming happy flow wins and optimize for that can also drastically help. Sometimes you can check for faults at the very end of the step instead of every sub step (if the mechanism allows to do so).

    Taking into account data cache locality (which I haven’t ever before dealt with till now). If you touch some memory, be sure to exhaust that piece of data until you don’t need it anymore. Instead of re-visiting that memory again and again. So re-ordering code accordingly saves a lot of time loading stuff to the cache.

    Trying to re-order the multiple conditions in an if statement might help, perhaps PGO does it on its own. But always start with the less likely condition to happen, so you quit checking further and sooner.

    Another example is that sometimes the code used to deref some pointer and not always use that value, so postponing the deref only to the site where it’s really needed, can also boost performance.
    Many small things like that helped to increase performance of the decode_inst function, you can check out the diff here, but it’s spanned across a few commits.
  4. Optimizing operands extraction also saved around 10%. For example some operands need to read further information from the binary stream and reading from the stream can vary with size of 1, 2, 4, 8 bytes. The original function is something like this:
    int read_stream(const char* codeBuf, unsigned size, uint64_t* out)
    switch (size) {
    case 1: *out = *codeBuf; break;
    case 2: *out = *(uint16_t*)codeBuf; break;

    Now given the compiler knows the sizes of how much to read from the stream because all callers to read_stream specify ‘size’ as a constant, then it can really optimize the whole switch statement and just read what’s need from the stream, no biggy.
    The problem is that the function wasn’t inlined at all, and thus each stream reading took a lot of time just because of the branches. Making this function inline saved around 5% by itself, for free! :)

    Another issue was that a few things always happened in the beginning or ending of an operand extraction function that could be done lazily or postponed to the exact right moment, like described in the previous point, and the also saved significant time.

    The operand_extract function originally took around 11 argument, trying to reduce these and work with less argument is normally better. Some operands were calculated again and again (like the operands array index, &ops[index]) which sometimes is already calculated once, so try to pass it along instead, can save also some time. It sounds like small stuff, and it is, but eventually when you stop coding naively and start to see this stuff in the profiler and also by looking at the code, it adds to a lot.

    Since there are some 100 internal types of operands in distorm, they were classified to a few categories. Now the original code did something like: switch (type) case 1, case 2, case… and then another switch (type) case 10, case 11, case 12… and then yet another switch (type) case 20, case 21, case 22, and so on. Meaning that the case of 22 will get evaluated in two switch tables even though it surely gets to run code only in the third switch case. This can generate branch prediction noise (as these switch statements eventually become a jump table) and waste time also because it’s not hitting a target basic block, just going through to the next basic block. So I chose to add a corresponding range checks before each switch statement to spare the wasteful look ups. Alternatively it was possible to coalesce all switch cases under a single switch, but I didn’t like that form of code in my case.
  5. distorm has eventually two APIs, one for textual based disassembling and another for actual binary structure output, internally this functionality runs first, and then it formats the binary to text.
    But since the prototype of the functions take different input and output structure it changes how the core engine works with the output memory.
    In one of the main loops of converting the binary structure into textual output, distorm used to copy the binary structure (from uer’s memory) to a temporary on-stack memory, and then feed it as input into the formatting function, so eventually the user’s memory is the output one. Eventually I decided to eliminate the temporary buffer and make sure the formatting functionality can work with the same input/output buffer in-place!
    Well better see it in pseudo code:
    for (i = 0; i < instsCount; i++) {
    // insts[i] is a binary form of the instruction
    memcpy(&tmp, insts[i], sizeof inst);
    format_inst(&tmp, insts[i]); // in: tmp, out: insts[i]
    // Now insts[i] is a different struct with textual output

    This in-place memory trick saved around ~3% and it required to refactor the formatting function so same memory fields had to be touched carefully, so we don’t override with text some binary fields that we need yet to convert…
    See code diff here.
  6. distorm has some extra features that require extra work CPU time to get them done. One of them is a very simple feature to return the relevant eflags of each decoded instruction (inside _DInst it’s modified, undefined and tested eflags masks). For example a JZ instruction takes ZF as input. ADD instruction writes to a few flags such as: AF, ZF, PF, OF, CF, etc. And some instructions even change some flags into an undefined state.

    This feature works all the time, even in a textual mode of the disassembler, like in most cases, and wastes time. So I turned this code into a feature that the user can select if they want to opt-in, and this saved ~3%.

    Moral of the story: don’t calculate stuff nobody uses!

    Note that what takes time in this is copying the compacted eflags to the extracted eflags compatible masks in the final output buffer…
    See code diff here, nothing interesting.

There are many more tricks, but these are the most performance impact.
Eventually this research taught me a lot about the modern processors pipeline, which is fascinating and worth to read about it from a few links.
Sometimes it was frustrating move one c code line from one place to another, which at times made things faster and sometimes slower. Sometimes eliminating a whole line made things slower, really. And some other times putting a new line and even an access to a memory address made things faster in the right place. Modern pipelines are somewhat voodoo, and be careful not to optimize for a specific one, there was some point that I decided to keep the code in readable C and playing with single lines is not a great idea, unless you can reason with it and a profiler yells at you to change it.

Cache was always an undervalued thing for me, even though I’ve been coding for over 2 decades now, I always took it for granted, but in reality once you understand the cache, you can get your code faster. And many naive things I did (which I didn’t consider myself a naive coder in terms of speed), were also bad. So in this exercise I learned a lot and tried to share some useful and practical points with you and less about the research itself. Hence I avoided talking about the processor itself (for instance, icache vs. DSB) and the tiresome process of benchmarking instructions, branches and uops, etc.
I guess looking at the code of the original version and the final version and comparing them can teach one a lot, in files such as decoder.c, distorm.c, operands.c, prefix.c and textdefs.c.

Here’s the log I kept of the second phase, (headers: timestamp, measurement time in MS, comment):
4:42 PM 4/21/2020 5170 BASELINE # MRT.exe (120MB) decoded in 32 bits. disasm compiled for x64.
5:35 PM 4/21/2020 4964 DF_FILL_FLAGS feature.
11:02 PM 4/21/2020 4832 In-place formatting.
11:58 PM 4/21/2020 4482 Fixed size register & mnemonic copying.
8:58 AM 4/22/2020 4393 str_hex optimized.
9:35 AM 4/22/2020 4386 textdefs.c cosmetics.
5:01 PM 4/22/2020 4130 wstring and textdefs optimizations.
5:00 PM 4/23/2020 3969 format optimizations.
5:02 PM 4/23/2020 3562 PGO enabled for mrt.exe.
4:56 PM 4/25/2020 3300 prefixes and decode_internal optimizations.
10:12 AM 4/26/2020 2890 optimized inst_decode.
3:33 PM 5/4/2020 2690 prefixes and operands optimizations.
11:44 PM 5/7/2020 2610 more prefixes and operands optimizations.
9:53 AM 5/14/2020 2500 prefixes, operands, decoder optimizations.

And these are vTune profiler showing the functions and timings (not fully accurate to my original tests TBH) for the formatting functionality (down from 1474 to 741 MS – 1.98X ) and the decoding functionality (down from 3084 to 1818 – 1.7X), you will see less symbols in the optimized functionality thanks to in-lining.

Future challenges for distorm – there are still 2 functions that drive me nuts that I managed to optimize a lot but are still slow and deserve probably tons of attention: a proprietary version of itoa for converting integers to hex ascii numbers (str_int_impl), and converting binary bytes to hex ascii (str_hex). And optimization there is super helpful.

Some useful links:

Win32k – Smash The Ref

April 1st, 2020

After a year in the making, I’m happy to publish my biggest research so far. It’s a whole bug class of vulnerabilities already patched by MSFT recently.

You can find all the POCs and the paper in my github.


SetWindowsHookEx Leaks A Kernel Pointer – CVE-2019-1469

December 12th, 2019

This is another information disclosure bug I submitted a few weeks ago and was just patched by MSFT. And apparently another group of researchers found it at the same time.
The interesting thing about this bug, as of the previous info disclosure ones on my blog, is that before I found them all, I was asking myself: “How come I don’t find any information disclosure vulnerabilities?”. All I am used to find is normally potential execution vulnerabilities. And then I realized my thought-process of approaching code auditing / reverse engineering is biased toward different types of vulnerabilities, but less to the type of info leaks. And I decided I need to change what I’m looking for. And after a few days of work I found a few info leaks at once. The difference was that I realized that info leaks could be much simpler bugs (normally) and the patterns I’m used to look for, don’t normally cover info leaks.

I started to think about how code works and where kernel might pass kernel data/pointers to usermode.

It always helps me to learn previous work to get ideas and classic examples are:

  1. Data structures that have padding and alignment might contain stale/uninitialized memory and being memcpy’ed to usermode might leak data.
  2. The return value register (RAX/EAX on x64/86) might contain partial pointers once returning from a syscall (because the syscall returns void but the compiler didn’t clean the register which is always copied back to usermode).
  3. Imagine there are two subsystems in the system you’re trying to attack (say, NtUser and GDI syscalls). And then these subsystems also talk to each other in the kernel directly. Many times when NtUser calls GDI functions, it doesn’t use to check the return value, and expects to get an out parameter with some information. Now, what if you could alter the path of execution of such a GDI call from NtUser (for example, by supplying a bad HBITMAP that it will pass to the GDI subsystem). And the out parameter wasn’t initialized to begin with, and the GDI call failed, and the data is copied somehow back to usermode… happy happy joy joy. Always check return values.
  4. Also check out J00ru’s amazing work who took it to the extreme with automation.

This time our bug is different and once I figured it out, thinking how the hooking mechanism of SetWindowsHookEx works, it was trivial. Basically, for any type of events you hook in Windows, there’s 3 parameters involved in the hook procedure (passed as parameters), the event code, wparam and lparam. LPARAM is the long pointer parameter (dated from Win3.1 I believe) and contains normally a pointer to the data structure corresponding for the event code. Meaning that the kernel sends data through the lparam (which will be thunked to usermode).

But then it just occurred to me that there’s another unique type of a hook and that is the debug hook! The debug hook is a hook that will be called for any other hook procedure…

So I imagined that if the kernel passes an lparam internally before it’s being thunked, it means the debug hook sees the kernel pointer of the lparam still. And then they might forget to remove it from the lparam of the debug hook, which is the lparam of another hook procedure just about to be called.

And that worked…

In the POC I set up two hooks: the CALL-WND-PROC hook to get a window message that is sent from another thread, which has the lparam points to a data structure – now we don’t care about the info in the structure as long as it’s eventually represented by a pointer. and the second hook – the Debug, that eventually gives (leaks) the kernel pointer of the data structure of the window-create message.
And then I just SendMessage to start triggering everything.
Note that we need to hook a low level kernel-to-user-callback function that gets the raw pointer from kernel and that function eventually calls the debug procedure we passed in to SetWindowsHookEx, otherwise the pointer isn’t passed to us normally and it stays hidden.

Check it out on my github, it’s really simple.

The case of DannyDinExec – CVE-2019-1440

November 13th, 2019

Yet another pointer leak in the win32k driver, this time it’s a cool bug. It’s inside the DDE (Dynamic Data Exchange) component.

  1. It’s possible creating DDE objects (whatever that means) and each gets a corresponding newly created kernel window by using the NtUserDdeInitialize syscall.
  2. Internally, it adds the new DDE object to a global linked list of all such objects.
  3. In certain cases, it needs to update all DDE windows that some event occurred by sending them a message (by xxxSendMessage). Alas, this message encodes a kernel heap pointer in the LPARAM argument.

Ah ha, I thought to myself: if I manage to get the message routed to a user-mode window procedure, then the LPARAM will reveal sensitive kernel info. Profit.
Looking at the function xxxCsEvent which is the workhorse of step #3:

  1. It first converts the linked list into an array of HWNDs (not pointers, but the window-identifier) – Problem #1
  2. It then walks over the array and verifies the HWND still represents an existing window, and then it sends a DDE message to that window procedure – Problem #2

It looks something like that (which is super simplified):

count = 0;
// Query the list depth.
for (p = g_listHead; p != NULL; p = p->next) count++;
array = malloc(sizeof(HWND) * count);
// Convert the list into an array of handles.
for (i = 0, p = g_listHead; p != NULL; p = p->next, i++)
   array[i] = p->pwnd->hwnd;
// Iterate over the array and dispatch.
for (i = 0; i < count; i++)
  pwnd = verifywindow(array[i]);
  if (pwnd != NULL)
    xxxSendMessage(pwnd, WM_DDE_MSG, 0, kernel_ptr);

The way to attack this algorithm is by understanding that the kernel windows that belong to the DDE objects can be replaced during the first callback to user-mode, so the second iteration (given we created 2 DDE objects) will send a message to a window that we created from user-mode, that has exactly the same HWND (defeating problem #1 and #2), so we get the message directly from kernel, because window verification passes.

Wait, what???
If we could get the message on the first iteration to user-mode, then it's a game over already. The catch is that we can't set a user-mode window-procedure for a kernel DDE window. But we don't need to, as the DDE window procedure (xxxEventWndProc) anyway calls back to user-mode on its own by calling ClientEventCallback. Once we hook that function we're good to get control back to user-mode in the middle of the first iteration.

Now back in user-mode, we can destroy the next iterated DDE object, and it will destroy the corresponding window too, whose HWND we already know.
And now, all we have to do is to brute force to create a window that will have the same HWND, to make the verifywindow() succeed, and then xxxSendMessage sends the raw pointer to us :)

Brute forcing a HWND is very simple, because given the HWND encoding, one has to iterate exactly 16 bits space (hence, loop 0x10000 times) to get the same HWND again (over simplified again, but would work 99% of the cases). Without going into all details.

The following diagram tries to explain the flow that the attacker has to go through in order to exploit described code above:

A POC can be found on github.

Integer or pointer? CVE-2019-1071

November 10th, 2019

A few months ago I found an info-leak in the win32k driver. It was in the xxxCreateWindowEx (CreateWindow API) and it leaked a kernel pointer to (the desktop heap of) a menu object.

The desktop heap is where all UI objects are allocated in kernel and it’s mapped to user address space too. Meaning that if one finds out the kernel address of an object, and she already got the address in user space, then it reveals the delta between the mappings (which is a secret). And now all objects addresses are known.

Note that in the beginning win32k used to map the desktop heap which revealed kernel pointers (because kernel objects contained kernel pointers in them) and it was recently fixed to harden breaking KASLR. So now such a pointer leak is very helpful for exploitation.

The origin of the bug is in the fact that the same field used for a child-window-ID is also used for a menu pointer (through ‘union’) in the window structure in the kernel. Back in the beginning of the 90’s saving memory space was sacred and this very specific optimization costed MS a few vulnerabilities already.

If you look carefully in MSDN at the CreateWindow API you will notice the HMENU argument, which can also be an ID (or an integer eventually). The bug is a confusion between the time we create the window and the time it queries for a menu.

When we create the window we pass NULL for the HMENU, however the class we use specifies a menu resource through the WNDCLASS.lpszMenuName field. Making the window creation code calling back to user-mode (#1 in snippet) to create a menu for that menu name. While back in user-mode we SetWindowLong() the same window to become a child window (WS_CHILD), meaning that from now on that field should be interpreted as an ID/integer. And now back into the kernel at xxxCreateWindow the following if statement (#2) gets confused.

if (window-not-child &&
    NULL == pmenu &&
    NULL != cls.lpszMenuName)
  pmenu = xxxClientLoadMenu(cls.lpszMenuName); ///// #1
if (window is child) ///// #2
  pwnd->menu = pmenu; ///// #3
  // By now pmenu should be an ID, but it's really a pointer!
  HMAssignmentLock(&pwnd->menu, pmenu);

And at last, at #3 our menu is really a pointer and not an ID! Again, the code assumes that if the window is a child then the pmenu variable is always an ID and if the window is not a child, then pmenu is really a pointer to a menu (hence the else with locking the object…).
Now since the window is really a child (as we changed it from user-mode from the xxx callback) and the menu is interpreted as an ID, we managed to leak a kernel pointer to a window-ID.

Finally, it’s accessible just by querying:
GetWindowLongPtr(hWnd, GWL_ID).

Code can be found on my github. Bonus – there’s also a NULL deref bug hidden inside that is exploitable on Windows versions that can map the NULL page.


Flash News

November 13th, 2018

TLDR; There’s a bug in Adobe Flash. [It got allocated with CVE-2018-15981]
The interpreter code of the Action Script Virtual Machine (AVM)
does not reset a with-scope pointer when an exception is caught,
leading later to a type confusion bug, and eventually to a remote code execution.

First, we will start with general information on how the Flash engine, AVM, works,
because I find it fascinating and important for understanding the matter at hand.

Flash runs code by either JITting it or by interpreting it. The design of the engine is to verify the ActionScript3 (AS in short) code and the meta data only once upon loading a (SWF) file and then to run the code while trusting it (without re-validation). This improves significantly the performance of running Flash applications. This affects instructions, they are also designed to have operands as part of their (fixed) byte code. For example, the pushstring instruction will take an index number into the strings array that is defined in the SWF file and it can only point to a valid entry in that array (if it won’t, the verifier component will reject this file from being run). Later, in the JITted code or interpreter, it will assume that the verifier already confirmed this instruction with its operand and just load the requested entry and push it to the stack, as opposed to reading the entry from a register, which can be dynamically set in runtime and thus require a runtime check.
The AS code also supports a stack and registers, and since it is a dynamic language that lets one define members and properties (key/value pairs) on the fly, the engine will have to search for them every time, and to coerce their types too to the expected types; of a function argument for example. This coercion is done all over the place and is the heart of making sure types are safe all over their use. The coercion will even go further and change a type to another expected type automatically to be valid for that function call.

When the AVM runs code, the relevant part in the verifier is doing its important job by following and analyzing statically all instructions by simulating them and it verifies the types of all registers, variables on the stack and more, in order to see how they are about to be used, and if it sees they are used as invalid/wrong types, it will fail loading the file in advance. Otherwise it will just let the operation continue normally with assuming the type is correct and expected. Imagine there’s a function prototype that receives an integer argument:
function foo(bar:int):void {}
but you pass a string to it instead:
then the verifier won’t like it and will fail this code from running. In other times, if you pass an object which has a default-value method, it might invoke it in run time to hopefully get an integer, but it will make sure in runtime that it really got an integer… so some of the checks also happen in runtime when it can’t know in static time (so yes, it’s okay to pass object as an integer).

Another trait of AS is what make properties dynamic, or technically how they are searched for every time. Starting with local scope and going to higher scopes in the hierarchy up to the globals’ scope. This is all documented. It hurts to know that much searching is done in order to find the right property in the scope chain every time code accessed it.
Perhaps Flash gives full OOP abilities on top of a dynamic language but when we take a look at the open source code of the AVM, it’s really horrifying, code dups all over instead of encapsulation, or the same part of the SWF will be parsed in multiple places and each time different functionality will be applied to it and other horrors…that we, security researchers, love.

Anyway, Flash will prefer to use the JIT as much as possible to run actual code, because it’s faster. It will emit real code of the target hosting architecture while doing the same pass for verifying the code. And it can even JIT a function while it’s already running inside the interpreter and continue from that state, impressive. However, remember that the interpreter is always used to run constructors of a user class defined in the SWF. And this is how we make sure our code will get to run inside the vulnerable interpreter. The engine can also handle exceptions on its own (try-catch and throw), employing setbuf and jmpbuf, so if an exception is raised inside an AS function, the interpreter implementation or the JIT infrastructure itself will catch it and pass it along the
next handler in the chain until a correct AS catch-handler with the right type will be found and execution will resume inside the interpreter, in our case, from that spot of the catch handler.
Basically, if you try to make an infinite call-recursion to itself, you won’t see a native (Windows/Linux) exception thrown, but rather the interpreter checks the size of its own stack artificially in some spots and will emulate its own stack overflow/underflow exceptions. If you will try to do an integer division by zero, then they already handle it themselves in the div instruction handler, etc. It’s pretty much robust and a closed system. There are many other interesting Flash internals topics, like atoms, the garbage collector, JIT, etc, but that won’t be covered today.

Now that you have some idea of how the AVM works.
Let’s talk business.
In AS you can use the keyword “with” to be lazy and omit the name of the instance whose members you want to access.
That’s normal in many scripting languages, but I will show it nevertheless.
For example, instead of doing:
point.x = 0x90;
point.y = 0xc3;
point.z = 0xcd;
It can be written as following using the “with” keyword:
with (point)
 x = 0x90;
 y = 0xc3;
 z = 0xcd;
For each assignment the interpreter will first use the with-scope if it’s set to avoid searching the full chain, once it got the property it will make the assignment itself. So, there’s an actual function that looks up the property in the scopes array of a given function. The origin of the bug occurs once an exception is thrown in the code, the catch handler infrastructure inside the interpreter will NOT reset its with-scope variable, so actually it will keep looking for properties in that with-scope in the next times. But (a big BUTT) the verifier, while simulating the exception catching,
will reset its own state for the with-scope. And this, ladies and gentlemen, leads to a discrepancy between the verifier and the interpreter. Thus, opens a type confusion attack. Pow pow pow. In other words, we managed to make the interpreter do one thing, while the verifier thinks it does another (legit) thing.

This is how we do it:
In the beginning we load the with-scope with a legit object. We later raise a dummy exception and immediately catch it ourselves. Now, the interpreter will still use the with-object we loaded, although the verifier thinks we don’t use a with-scope anymore, we will query for a member with a certain controlled type from the with-scope again and now use it as an argument for a function or an operand for an instruction that expects something else, and voila we got a type confusion. Remember, the verifier will think we use a different property that matches the expected type we want to forge and thus won’t fail loading our file.

Let’s see the bug in the interpreter code first, and then an example on how to trigger it.
At line 796, you can see:
register Atom* volatile withBase = NULL;

That’s the local variable of the interpreter function that we’re messing up with (pun intended)! Note that technically the withBase is used as a pointer-offset to the scopeBase array
(that’s all the scopes that the function loaded on its scopes stack), but that doesn’t change the logic or nature of the bug, just how we tailor the bits and bytes to trigger the bug, if you are interested to understand this confusing description, you will have to read findproperty implementation. And at line 2347 which is the handler of the findproperty instruction:
*(++sp) = env->findproperty(scope, scopeBase, scopeDepth, multiname, b1, withBase);

See they pass the withBase to the findproperty, that’s the one to look up a property in the scope chain. This is where we will make it return a different property,
while the verifier will think it returned a valid typed property from our object. Now, we can use throw keyword to raise an exception, and the catch handler infrastructure at line 3540, will handle it.
CATCH (Exception *exception)

You can see that it will reset many variables of its own state machine, and set the interpreter’s PC (program counter, or next instruction’s address) to start with the target handler’s address, etc.
But they forgot to reset the withBase. I bet they didn’t forget, and they did it for performance sake, their code has tons of micro and obsessive optimizations, that today a good engineer wouldn’t just do. You can also note that they clear scopeBase array only in debugger mode (line 3573), and that used to be another bug, until they realized they better do it always.
They used to have many bugs around the scope arrays in the interpreter, but they’re all fixed now in the binaries, since we look at an old source code you can still find them.

Finally, let’s see how we would maneuver this altogether.
I used the great rabcdasm tool to assemble this.
This code is partial and only the relevant guts of the triggering exploit.

; All functions start with some default scope, we keep it too.

; We create a property with the name "myvar" that is really an object itself of type NewClass2.
; This property which is inside the object/dictionary will be the one the verifier sees.
findpropstrict QName(PackageNamespace(""), "NewClass2")
constructprop QName(PackageNamespace(""), "NewClass2"), 0
initproperty QName(PackageInternalNs(""), "myvar")

; We push a second item on the scope array,
; to make the withBase point to its location in the scopes array.
; Note this array contains both normal scope objects and with scope objects.

; Now we throw an exception, just to make the interpreter keeping a stale withBase.
pushbyte 1
; This is our catch handler, we continue to do our thing.
; Now note that the withBase points to the second item on the scopes array,
; which currently won't be used until we put something there (because the scopes array is empty).

; Put back the first scope on the scope stack, so we can work normally.

; Now we're going to create an object which has one property of an integer type and its name is "myvar".
; This property will be returned instead of the correct one from the object we created above.
pushstring "myvar"
; Next is the attacking type to be really used instead!!!1@ :)
; This is the actual value that will be fetched by getproperty instruction below.
pushint 534568
; Create an object with 1 property (name, value pair on the stack)
newobject 1
; Mark it as an object type.
coerce QName(PackageNamespace(""), "Object")

; And now push it to the scope, note we don't use pushwith this time!
; Because we want to keep old value, otherwise verifier will be on to us,
; This makes our object that we just created as the new with-scope object.

; Now, findproperty et al will actually scan for our property using the withBase in practice,
; which has our fake object that we recently created,
; containing a property with the "myvar" name, with a different type from what the verifier sees
; (Remember - it sees the object above, in the beginning of this function).
findproperty Multiname("myvar", [PackageInternalNs(""), PackageNamespace("")])
getproperty Multiname("myvar", [PackageInternalNs(""), PackageNamespace("")])

; Now practically on the stack we have an integer,
; instead of an object, and next executing getslot which assumes an object (NewClass2) is in the stack,
; will crash miserably!
getslot 1


The triggering code can be done with many different variations. Instructions like nextvlaue can be targeted too, because it doesn’t verify its operands in runtime and can leak pointers etc.
When I found this bug at first, I thought there’s small chance it’s a real bug. Particularly, I had my doubts, because the chances to have a forgotten/dangling with-scope is high in a normal Flash application. So how come nobody encountered this bug before as a misbehavior of their app? E.G. by getting a wrong variable, etc. Apparently, the combination to cause this scenario accurately is not that high after all.

Good bye Flash, you’ve been kind…

Exploiting Android Stagefright with ASLR Bypass

March 4th, 2016

Everybody knows that exploiting remote code execution vulnerabilities is a real challenge. Rumors say it that some entities around the world managed to do it but AFAIK, nobody published anything, as they might use it for gathering intelligence. Israeli NorthBit security consultancy company researched the vulnerability and managed to bypass the ASLR through exploiting the bug using a browser on Android 5.0 and 5.1 for Samsung devices.

For more details see http://tinyurl.com/h4deqjg

Armstorm – ARM Disassembler

December 12th, 2012


It seems the distorm-arm project is going to be called Armstorm, after asking the guys on Twitter.
Anyway, as I’m working on this project, I just came up with the following cool snippet and wanted to submit it as a riddle:

int base = 0;
int runLength = 0;
while (mask && (~mask & 1)) base++, mask >>= 1;
while (mask && (mask & 1)) runLength++, mask >>= 1;
if (!mask && runLength > 2) {

Any idea what does it do or why I need it ?


November 11th, 2012

Hey guys

I’ve been pretty busy in the last few months, working, and working.

I am thinking about coming up with a diStorm-ARM version, it’s pretty self-explanatory. ;)
For the ones who are not familiar with diStorm, it is a disassembler that returns a binary structure that describes an x86/x64 instruction, instead of textual output.

Would love to hear your thoughts, as always.