Hard Problem


We had an old trouble ticket that we wanted to resolve in the next maintenance release. One of the developers got assigned to determine the root cause of the problem. Everyone was hoping the analysis would take around a day to complete. After a week of analysis, this developer made no progress. This was going to cause us to slip our release. Not acceptable.

The analysis got reassigned to me. I was told I had a day to complete the analysis. There was already a work around in place. A script ran hourly on the cron that tried to detect and correct bad records. I traced down one of these corrections in detail.

I was frustrated because I could not figure out how the records were wrong. In fact, I kicked it into high gear and proved that the records were not wrong. That’s when I made a breakthrough. The workaround itself was buggy. It was rooted in a timing issue. Not a rookie mistake.

Once I knew that I could not trust the work around, I found a real example of the bug. Then I reconstructed what happened to cause the problem. It was tricky too. There was an optimization in one of the database triggers that tried to skip some unnecessary updates. However one of those updates was masking crucial changes from another update. Doh.