{"id":245,"date":"2010-04-23T19:31:23","date_gmt":"2010-04-23T21:31:23","guid":{"rendered":"http:\/\/www.ragestorm.net\/blogs\/?p=245"},"modified":"2010-04-23T19:31:23","modified_gmt":"2010-04-23T21:31:23","slug":"ending-the-race-condition","status":"publish","type":"post","link":"https:\/\/www.ragestorm.net\/blogs\/?p=245","title":{"rendered":"Ending The Race (Condition)"},"content":{"rendered":"<p>After talking to my co-worker, Jond, he agreed that I will write about him too. Actually we were working on solving that race condition together.<br \/>\nSo everything I told you in the last <a href=\"http:\/\/www.ragestorm.net\/blogs\/?p=243\">post<\/a> was in a timeline of around 15 hours, almost consecutive, where Jond and I were debugging the system and trying to track down the bass-turd. So it was around 6 am in the morning, after we had a few hooks on the critsec acquire and leave functions in the kernel. But the log looked fine and this is where I decided to call it a night and went home to sleep a bit. Jond decided to continue, the problem with us, is that we take bugs personally. So he got the logs better and wrote some Python script to analyze it. I was too lazy to do that earlier, I decided to analyze manually once, it is the excuse that if we do it only once, writing a script might take longer. I was wrong. Pity. Then, according to Jond&#8217;s story, he actually saw something wrong in the log, at f@cking last. So I&#8217;m not sure about the small details, but he noticed that the critsec was entered twice or something imaginary like that from different threads, obvisouly. And that time he knew he nailed the guy down.<\/p>\n<p>There are not many options, once you see that the other &#8216;waiters&#8217; don&#8217;t wait when some guy holds it, right? So he looked at the code again, and yet it looked fine! Now he decided it&#8217;s time to act upon &#8220;WTF is going on&#8221;, and he did the following experiment, trying to acquire the critsec in a <i>loop<\/i> (he didn&#8217;t really need a loop, but after you&#8217;re going insane&#8230; so he had to write something that totally looks like &#8220;I GOT THE CRIT&#8221; &#8211; or not). And to his surprise other threads continued to work normally as if there was no lock. As if huh. Soooo, this is going to be embarrassing a bit. And then he found out that the call to the critsec acquire function wasn&#8217;t correct. It was missing a dereference to a pointer. A single character, you got it right. To make it clearer, he saw something like Enter-Crit (m_ptr), instead of Enter-Crit(*m_ptr), which is a pointer to a pointer of an ERESOURCE.<br \/>\nSo obviously, the the lock wasn&#8217;t acquired at all, for some odd reason it aligned well in the logs we analyzed together, until he improved the logs and found a quirk. A question I asked myself, after we knew what was the bug, is that we gave it some garbage pointer, instead of an ERESOURCE, so the function obviously failed all the times we called it. But how come we didn&#8217;t think of testing the return value even though we knew the lock didn&#8217;t work? I guess it has something to do that nobody ever checks the return value of &#8220;acquire&#8221; crit-sec, even in MS code&#8230; Bad practice? Not sure, what can you do if you want the lock, and can&#8217;t get it? It means one thing, that you have a bug, otherwise it should wait on the lock&#8230; So it&#8217;s the kind of stuff nobody checks anyway, but maybe a line of ASSERT could help. Oh well, next time.<\/p>\n<p>That was it, kinda nasty, it always come down to something stupid at the end, no? :(<br \/>\nNow it leaves me totally with that breakpoint we couldn&#8217;t do because the system was too slow with it, and I will write about it next week.<br \/>\nSee you then.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>After talking to my co-worker, Jond, he agreed that I will write about him too. Actually we were working on solving that race condition together. So everything I told you in the last post was in a timeline of around 15 hours, almost consecutive, where Jond and I were debugging the system and trying to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":""},"categories":[12,19,11,13],"tags":[],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pbWKd-3X","_links":{"self":[{"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/245"}],"collection":[{"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=245"}],"version-history":[{"count":2,"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/245\/revisions"}],"predecessor-version":[{"id":247,"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/245\/revisions\/247"}],"wp:attachment":[{"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=245"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=245"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ragestorm.net\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=245"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}