Ending The Race (Condition)

After talking to my co-worker, Jond, he agreed that I will write about him too. Actually we were working on solving that race condition together.
So everything I told you in the last post was in a timeline of around 15 hours, almost consecutive, where Jond and I were debugging the system and trying to track down the bass-turd. So it was around 6 am in the morning, after we had a few hooks on the critsec acquire and leave functions in the kernel. But the log looked fine and this is where I decided to call it a night and went home to sleep a bit. Jond decided to continue, the problem with us, is that we take bugs personally. So he got the logs better and wrote some Python script to analyze it. I was too lazy to do that earlier, I decided to analyze manually once, it is the excuse that if we do it only once, writing a script might take longer. I was wrong. Pity. Then, according to Jond’s story, he actually saw something wrong in the log, at f@cking last. So I’m not sure about the small details, but he noticed that the critsec was entered twice or something imaginary like that from different threads, obvisouly. And that time he knew he nailed the guy down.

There are not many options, once you see that the other ‘waiters’ don’t wait when some guy holds it, right? So he looked at the code again, and yet it looked fine! Now he decided it’s time to act upon “WTF is going on”, and he did the following experiment, trying to acquire the critsec in a loop (he didn’t really need a loop, but after you’re going insane… so he had to write something that totally looks like “I GOT THE CRIT” – or not). And to his surprise other threads continued to work normally as if there was no lock. As if huh. Soooo, this is going to be embarrassing a bit. And then he found out that the call to the critsec acquire function wasn’t correct. It was missing a dereference to a pointer. A single character, you got it right. To make it clearer, he saw something like Enter-Crit (m_ptr), instead of Enter-Crit(*m_ptr), which is a pointer to a pointer of an ERESOURCE.
So obviously, the the lock wasn’t acquired at all, for some odd reason it aligned well in the logs we analyzed together, until he improved the logs and found a quirk. A question I asked myself, after we knew what was the bug, is that we gave it some garbage pointer, instead of an ERESOURCE, so the function obviously failed all the times we called it. But how come we didn’t think of testing the return value even though we knew the lock didn’t work? I guess it has something to do that nobody ever checks the return value of “acquire” crit-sec, even in MS code… Bad practice? Not sure, what can you do if you want the lock, and can’t get it? It means one thing, that you have a bug, otherwise it should wait on the lock… So it’s the kind of stuff nobody checks anyway, but maybe a line of ASSERT could help. Oh well, next time.

That was it, kinda nasty, it always come down to something stupid at the end, no? :(
Now it leaves me totally with that breakpoint we couldn’t do because the system was too slow with it, and I will write about it next week.
See you then.

3 Responses to “Ending The Race (Condition)”

  1. John says:

    I’m surprised your code compiled with this type mis-match. Unless you had an explicit type cast or a PVOID involved there.

    anyway, checking the return probably would not have worked because as far as ExAcquireResourceExclusiveLite is concerned, you managed to acquire a resource, just not the one you intended.

    p.s

    please stop calling ERESOURCE crit-sec. Critical section is a user-mode structure. ERESOURCE is a single-writer multiple-reader lock.

  2. arkon says:

    Hey John,

    It didn’t acquire the ERESOURCE, that’s why he even tried a loop.
    And sorry, technically speaking it is an ERESOURCE, but to simplify things I prefer to call it a critsec. But you are right.

  3. […] posts ago, I was talking about hunting a specific race condition bug we had in some software I work on. At last, I have free time to write this post and get into […]

Leave a Reply