Timing is Everything

I was assigned a tricky problem to work on. Some of the data was missing in our production environment. Everything seemed to work fine in development and test. Trouble is, I did not have access to the production environment. I got someone to run some queries on my behalf.
 
The source data was correct. The target tables had records inserted. But they were missing key fields. This was quite the mystery. I had to back up and look at the big picture. Normally this is not needed. But in this case, it was crucial.
 
We have some main data loaded into some staging tables. Then some processes run after the initial load that collect all the data and stuff it into a couple main tables. Then those main table are copied to the reporting machine. It is here that my jobs run to mine the main tables to update the reporting tables.
 
Turns out the copy to the reporting machine is delayed every so often. Wouldn’t you know it? My jobs are not actually scheduled to wait until the copying of source data is completed. Usually it happens way before my jobs run. But not always.
 
The fix is to add in the dependency, and fix all those records that got loaded with blanks. The hard part of this problem was figuring out the source of the problem. Wasn’t really a coding issue. More a timing issue.

You're Doing It Wrong

I got on a conference call for our latest software release. The new project manager asked the test team to verify a bunch of tickets. A bunch came out failed and ended in my queue. The manager asked me if I had any insight on these tickets.

Luckily, any time I do work on a ticket, I enter a note in our ticketing system. I had seen all these tickets before. I researched them and determined a level of effort for them. However none of them got scheduled for a release yet. So the work to resolve the tickets has not even started. Of course they are going to fail if someone tests them now.

I reported what I knew about these tickets. Nobody should have been testing them. The manager said he wanted testing to check the actual state of the problems in the system. Fair enough. But these are not failures. They are in a state where the work has not even started.

At least I am not on the testing team. It is no fund to spend your time testing fixes that have not even be done yet. The good news is that the testers should be ready to verify these as soon as there is a fix. Who knows when that will be?

Softare Upgrade Rumble

We use some software from the same vendor for issue tracking and source code control. The customer has declared that we will be upgrading to a new version of both of these clients. From past experience, I expected this to take about a week to happen.


After given the green light, I tried to download the update. Too many people were doing that. So I got put into a queue. I was number 14, and then 13, then 12. This lasted all afternoon and evening. My position in line got updated once a minute.


Then midnight struck. The updates every minute stopped. Apparently the queue could not handle the transition at midnight to the next day. It got confused and thought instead of an update in 1 minute, it would give me an update 21 hours later. Nope.


I killed my download and started again. This time it work. By the next morning, the software packages had downloaded. They asked me if I agreed to uninstall the old versions. I concurred. Then a reboot was required.


I checked my software. Old versions were gone. Next versions nowhere to be seen. I dialed into a huge conference for help. They were of no help. Some other guy on my project noticed my problems and told me to start again from the beginning.


Well I knew to wait until after midnight to start up the download. This time it went by faster. The software got installed. But when I tried to run it and access some code, it complained that it could not connect to the license manager. I will stop here. There was a lot more pain. Basically a fail. And I hear they want to switch to some other software later this year.

Good-fast-cheap. Pick two.

I got invited to a meeting with the customer today. There was a problem in production. And the customer wanted answers. When it came time, I explained what was going on. Our new system used the same data source as the old one. But we were not doing the exact same transform of data. Therefore they saw different values in the reporting system.

I said we could fix this by updating the function that loads the data. We could apply this transform there. We could also write a script to correct all the wrong data loaded so far. Should close the case. The customer understood that. But then they asked if we could be proactive and prevent this type of problems.

Immediately I saw through the request. I said we could take the action to find out other times when the old system does a transform, and make sure we have those transforms present. However no, I could not do anything to make sure everything was being done 100% correctly. This is a huge system with non-trivial software involved. There are bugs in there.

Essentially I responded that no I could not do anything to make the reporting results 100% correct. It is just not feasible. This is an application of the good/fast/cheap management trifecta. They only get to pick 2 of those. The third one will suffer. In our scenario, the schedule is fixed, and the cost is fixed. Therefore you get the quality that you get. No way to increase that without more money or longer schedules.

Trouble with Technical Books

I self-register as a PL/SQL Oracle database developer with my company. So it came as no surprise when I got a targeted email from my company. It highlighted the book Mastering Oracle PL/SQL: Practical Solutions.

This book is ancient. It was published back in 2004. The funny thing is that most of the information is still relevant today. I checked on Amazon. There seems to be an update to the book from 2013. But it is selling used for $800. WTF?

I went through the introduction section of the book. It had me set up auto trace in my Oracle database instance. Also read up on sql trace and using tkprof. Not bad seeing how I have not even made it to Chapter 1 yet.

The only downside is that the book is made available to me through the Books 24x7 web site. My company has some sort of bulk license with them. The web user interface is painful. I had to go through each page. But the pages do not fit on my screen. So there is a lot of scrolling with their custom controls.

I tried to print out a copy of the book. I have the ability to print to a file. Unfortunately the Books 24x7 site intercepts print requests. All I got in the output was a page with copyright information on it. How useless.

So I went through the whole book, copy and pasting the pages into Word documents, one per chapter. Now I can read in my leisure. I could buy myself a copy from Amazon. New ones go for $19. I can also pick up a used copy there for under $8. Could be a good investment.

Simple Solution Wins

The customer found some problems in our reporting system. Whole parts of data were missing. Counts were not matching up. I was put on the task of resolving this. Had to dig a bit to found the root cause. There were a view that depended on another view that depended on some data being loaded. Many levels of indirection going on here.


To truly fix all the problems, it would require a lot of analysis. It might also slow down the system. Then I decided to look at this from another viewpoint. What was the true minimal cause of the problem? Once I found that, I attacked it with some solutions that were outside of the box.


I had to change just 2 lines of code. I love it when I get a solution like this, even if it is sort of a hack.

Finding a Way

I got word from my manager that a new problem would be resolved imminently. I had not even looked at the problem. So I dug in. I told my manager that it would not be done on time. The customer got angry. All of a sudden, I am approved to work all kinds of overtime.

I was tired. So at first I tried to run the 10 jobs that generate the data responsible for the problem. The first job took almost 3 hours before it died due to an out of memory error. I got the second job to run. The third job never finished and the database shut down for maintenance at night.

Ouch. A new deadline was created. Again I told my boss we would not be making the new deadline. Then I got offers for help from another team. I declined. They would only come, ask question, and literally waste time. Finally time to start working smart.

Instead of running these massive jobs, I decided to inspect the code. I figured I should just study each function until I was 90% sure it wasn't a problem. Then I would skip onto the next one. I got to the 9th of 10 function. This one had some trouble. This was it.

This was the sort of the problem where the research is the majority of the work. I coded up a solution, tested it, and passed it on. It is still going to be a tight schedule with the configuration management and test people. But I got it done.