Fixing the Release

Last week we had a big software release due. The build went to our internal test by the end of the week. The test team found one application blowing up due to an Oracle exception. I told our team lead that we had better get to the bottom of this problem. The lead thought that this was just some discrepancy between the expected and actual database version. He thought a database change would resolve the problem. With that in mind, I left for the weekend.

It turns out the database change did not fix the problem. A bunch of people on the team got together late Friday night to try to figure out the problem. They left me some emails and voice messages. But by that time I was long gone. When I got back to work on Monday, we were in a state of emergency. My team lead said our company was losing money because we were late on the software release.

Apparently my team lead had spent the weekend trying to determine the cause of the problem. He still thought it had something to do with recent database changes. That did not seem encouraging. In software development you cannot think. You have to know. Thus we were nowhere with the problem. I got assigned the task to figure this out. I tried to duplicate the problem by running equivalent SQL against the database. But I had no luck.

I started applying my normal techniques. The next thing I tried was to run the application against an old version of the database. It had the same problems. At that point I eliminated any new database changes as the issue. Finally I started reviewing the history of the files that had the code that was crashing. A developer recently tried to fix a problem in that file. I rolled back those changes and found the source of the problem. At that point we were able to continue with the software release.

It turned out we were only one day late. That is still not a good thing. I did not lose any sleep over this problem. What could we have done to avoid this in the first place? We could have eliminated rushed last minute changes to the application. Or at least we could have run sufficient regression testing on the late changes. Better yet we should have done a better peer review on those changes. Let’s see if our project learns anything from these mistakes.