Reproducing a Race Condition

We have a job at work that runs every Wednesday night. All of a sudden, it aborted the last 2 weeks. This caused some critical data to be late. The main question was what changed? Initial analysis indicated nothing relevant changed. However things did change. An architect identified the most likely candidate. Some new job was trying to improve performance. This new job ran 10 to 11 minutes before the Wednesday job aborted each time.

I babysat the job this week. We ran it one day early to give us time in case we encountered errors. Of course the job ran fine when I ran it during the day. So I declared the next order of business was to replicate this problem in development. My old architect buddy thought it could be done. Maybe not. But I was going to give it the old college try.

I worked with a developer from the team that introduced the new performance job at night. We both ran our scripts at the same time. After a couple tries, I had still not replicated the problem. I told the dude to give me a copy of his code. Then I went to work. I spent 30 minutes reading up on the UNIX Korn shell. Then I wrote a Korn shell script to (1) generate new data, (2) run my job in the background, (3) wait a bit, and (4) run the new job.

Now I varied the amount of new data added each time from 1 record to 300 records. I also varied the amount of time waiting from 1 second to 5 minutes. Made sure to log everything. And since I launched this thing from an interactive shell, I prepended it with a "nohup" and ran my main script in the background too. I estimated best case this thing would finish in 12 hours. However it might run for as long as 24 hours. Luckily it is Friday. I imagine nobody is working in the development environment on a weekend.