Failure Propagation

This year our customer wanted some complicated new changes to our system. The boss put me on the job. We actually have a requirements analysis team that is tasked with gathering and documenting the requirements.

Like all complex changes, there was not enough information in the original requirements documentation. Since we had a team that handled this, I sprayed them with all sorts of questions to figure out exactly what the users wanted. They were very responsive in getting answers for me from the customer. This is no small feat. Unfortunately it seems like not all of the answers got recorded in the documentation. I did not care too much since I got all the answers I needed.

I went off and coded a solution. Did a bunch of unit tests. Had to manually set up a lot of data to test the different scenarios. When my code passed all my unit tests, I shipped in to our in house test team. This is where we started to have some problems. The test team could not decipher the requirements documentation. So they just came over and asked me what the software was supposed to do. Then they asked how I performed my unit tests. Then the test team proceeded to run the same unit tests and counted them as their independent tests.

You can see where this is going. But wait. There is more. After our internal test team passed my software, it got delivered to the customer. The customers themselves have a huge acceptance test team. I got deja vu once I got the call from the customer acceptance test team. They too could not decipher the requirements for my changes. So I told them the same thing I told our own unit test team. Explained the requirements as I understood them through the numerous questions I posed and the answers I got back. I also answered the acceptance test team's questions on how I went about unit tests my changes.

So it comes time for the real users to experience my new changes. And it turns out that I had made a coding error in the middle of my changes. I made some assumptions about some other parts of the system that were not accurate. The result was that my processing did not work according to plan. My unit tests masked this error since I set up the test data like I assumed the other parts of the system would do for real. Turns out our internal test team, as well as the customer acceptance test team did the same. These multiple levels of independent tests did not catch the problem.

I will admit that I was part of the problem here. It is tough to turn down requests for information from overworked testers who do not have the information to do their job. However the true solution is not to provide them with details on how I do my job. I think we need to get to the root cause why they do not have sufficient information, and correct that problem. Then we can avoid situations like the one we are in now. The real gauge as to whether we learned our lesson is how these test team handle testing my latest changes to fix the problem. Are they just going to come to me and ask me what went wrong and how did I fix it? If they once again just repeat my unit tests, we will have learned nothing. Let's hope we can be strong here.

Accurate Metrics

I received an e-mail from our project's Help Desk representative. Apparently he devised a new way to track the trouble ticket metrics for the month. So I was curious about my own metrics for trouble tickets I had resolved. I looked at the metrics the Help Desk is reporting for me.

Out of a total of 24 tickets opened, I was assigned 6 of them. This sounds about right. A lot of tickets come my way for resolution. However the "Days Open" statistic for the problems I worked seemed to be very high. My average was 21 days.

If we are talking calendar days, then 21 days is 3 weeks. I normally knock out the problem assigned to me in a few days. Once in a rare while, I will need some information from a user who does not get back to me in a week. But that happens once or twice a year. Maybe they are counting these Days Open as some total time from when the user first encounters the problem to when there is confirmation that the problem is fixed. Sometimes it takes a lot of follow-up to get the user to confirm that the problem has actually been resolved. I don't know. The numbers still seem high to me.

The real kicker in all these metrics is, that even with all these details stats, I still received a comment from our Project Coordinator. Apparently the Project Manager asked her what was the status of our trouble tickets. We have all these detailed metrics but the knowledge about where we are at with problems is unknown.

Customer to Consultants

Our customer had a high priority task for a large volume of data to be updated. The customer thought one of the consultants on the project had done this last year. So they contacted the consultant directly. Another consultant found out and informed the customer that I was the one who did the work last year. So I got the impression the task was coming to me.

I asked my team lead whether I should make this task happen. To my surprise I was informed that the team lead and the project manager decided that we would not do this task. Apparently we first needed an official request for this work. Then we needed to do a cost estimate. And furthermore, the analysis was supposed to tell us that this was a complicated task.

The project manager gets to make these calls. However I inquired whether the PM informed that customer of his decision. Apparently not. Luckily the team lead took the action and let them know this task was not going to be accepted. In the end, the original consultant agreed to take on the task. I wish him luck. There is a short time frame to complete the task, along with some questions about the business rules for the task. Somehow I think my company missed out on some business here. But I do not question when a PM actually comes out and says no. There is not enough of that in the world.

Getting Paid

Back in 2004, our team was starting up a complete rewrite of our system. This was a significant task since the system is complicated and handles a ton of data. During this rewrite the developers got squeezed to produce results based on a crazy schedule. The results was plain insanity.

Management started to realize that things were going bad. So they started a number of efforts to improve morale. One such effort was a contest based on knowledge of the system. Since I was the senior guy on the project, I won this contest. Maybe this was not a fair contest. But I was happy.

My reward for winning the contest was $50. Unfortunately the project manager was low on cash so I got an IOU. For some reason the project manager kept forgetting to pay me, or never had the cash. After a while I wrote off this IOU as bad debt. But I posted the IOU up on my cubicle wall as a badge of courage for enduring the insane rewrite. (As an aside, this rewrite was a failure).

Recently the project manager that was in charge during the rewrite came back to the project. He visited my cubicle and saw my 4 year old IOU. And at this time he actually paid me back with interest. All I had to do was sign that I had received the money. I am not sure what the moral or this story is. Maybe it is that sometimes you get paid in the end? What do you think?

Taskbar Buttons

Our users started complaining about how the icons in the taskbar stack. Apparently some of our windows no longer stack above the program's icon in the taskbar. Some windows get their own new icon in the taskbar.

At first I could not tell what the users were referring to. All the application windows get their own icon my in taskbar. Then with a little research I realized that most of our users employ the "Group similar taskbar buttons" option in the taskbar properties.

So I had to research this option. Google search on the topic did not provide much low level information. The most I got was that a registry setting of TaskbarGlomming gets set to 1 when you choose this properties. And apparently sometimes it depends how many icons are present as to when the windows stack up on the program icon.

I did some research. At first I thought the stacking might be related to the name of the executable the user runs. So I created 2 copies of a dummy application. Their windows did not stack. Apparently you need to be running the same application from the same directory for the operating system to consider the windows to be "similar". The problem with our application suite is that we have multiple executables that compose one program that the user works with. It is going to take some magic to make all our applications show up under one icon in the taskbar. But I am a programmer. So I am up to the task (no pun intended). I will let you know how it turns out.

Experience Counts

A developer came to me for guidance with a trouble ticket. He asked if we ever delete data from the database. Apparently the customer had some evidence that some data had disappeared. This seemed very strange.

I told my colleague that there are some scenarios where we do delete data:
  • if there is an error when loading data, we delete the partially loaded data
  • at the end of the year we delete out the old records

None of these scenarios seemed to cover the case that the customer was experiencing. I agreed to take a quick look at the facts. The customer had printed out evidence that the data was there last month. So I checked our audit data in the Production environment. Sure enough the data was loaded and worked with our application. Then I detected a significant clue. The date when the data was loaded was the day when we do our system startup testing. I realized that this would have been test data that gets loaded and then deleted before the Production system goes live.

Once we realized the situation, it was easy to work this trouble ticket. But being able to correlate that specific date with the load testing activities requires insider knowledge of our system. An outsider would have never deduced that this was the case. This goes to show you that sometimes a little key knowledge can go a long way in problem resolution.

Pair Programming

Our customer had a high priority problem. It got assigned to a developer. But when the complaints about the slow progress on the problem came in, it got reassigned to me. So I started trying to gather information about the problem. Tried to call the customer. Did not get a call back.

Another developer wanted to help out. Turns out he had some good ideas. So I shelved my research and teamed up with him. He had some theories as to why the software was not working. We did not have any means to verify his theories for sure. But we decided to go with his instinct and ship out a script to fix some data (which should also fix the problem if the theory holds up).

The funny thing about this is that I have heard about pair programming. However it is my understanding that this is a technique used in software development, not software maintenance. What do you know? It seems to apply just as well for maintenance. We shall she how it turns out for this specific problem.

Help Self by Helping Others

Our team had a high priority trouble ticket come in. It got assigned to a developer that has a number of years experience, but not a lot of domain knowledge for our system. This developer concluded that the software was working correctly.

My team lead told me that we needed to make sure we got the resolution to this problem correct. So we took a look at the developer's findings. Turns out the software was not working correctly after all. So I gave the developer some pointers.

The developer came back for help. I provided advice on some initial steps that could be taken to bound the problem. Then I showed how to query some of the audit data to get more insight into the problem. I continued to provided advice on which parts of the system to investigate. Basically this was just like they type of help a junior team member needed.

Having spent a long time maintaining the system, I figure this is my duty to help those developers in need. It does take up a lot of my time. But the end goal of serving our customers is met. Today the developer continues to track down the root cause of the high priority problem. However in helping this developer do the research, I have got a couple more ideas about potential causes for a different tough problem I have been researching. My problem is that I cannot get in touch with the user who is encountering the problem. Nevertheless I march on. Maybe it will turn out that assisting a fellow developer will give me the ideas needed to crack this tough problem. I sure hope so.

Instill Excellence

How do you instill excellence in others? Maybe this is an impossible task. But I hope that it can be done.

Being in the software maintenance business, we get a lot of trouble tickets when our users have problems. I frequently like to put myself in my customer's shoes to get the right perspective. A lot of times problem resolution is difficult because our customers use a different language to describe the work they use our software for. Most of the time I take ownership of the problems assigned to me. It is almost as if I am on a mission. I never stop if I find that my software is working correctly. I dig further to check that the data the software is working against is correct. And I also make sure that me and the customer agree and understand how the software is supposed to work. I think I got this software maintenance thing down pertty well.

The trouble starts when I do some research on a problem, and find that another part of our system is at fault. The process is for me to hand off the problem to a developer on the team that supports the other parts of our system. And many times I do not like what I see. I get a lot of questions from these other team members. A common theme I have seen recently is that developers just state that their code is working correctly. These assertions usually have little or nothing to do with our customer's problems. So far I have just been encouraging the developers to provide better customer service. But it seems the words are not getting through. I see developers say things like "I think it is a data problem". This is preposterous. Even if the issue were a data problem, what kind of due diligence have you done if you can only say "I think"?M

Maybe I take my job too seriously. But I think that is what my job is about. I can do top quality work myself. The real challenge is how can I rally other developers to do the same. How can I pass on the motivation? I want to get others excited on resolving customer problems. Maybe I need to take some time to soul search in order to determine where my own excitement comes from. A lot of it has to do with building a relationship with many customers who use my software. Their mission and their careers often hinge on whether the software gets the job done. Now that is a good motivation. I think that if I can channel this motivation to other, I will have achieved leadership. Right now it is a struggle though.

Trouble Ticket Trouble

I had been out sick for a couple days. There were a number of trouble tickets assigned to me. These got reassigned to other developers on the team. When I came back, I talked to each developer to make sure they took over and worked the issues to completion. One such developer researched a problem, got customer input, and wrote a script to fix the problem. Good work.

However I approached another developer that got one of my tickets. I had a bad feeling as they were reading the newspaper online. So I asked about the ticket that got reassigned to them. No progress had been made. They were waiting to get access to the Production data. Since this person did not directly work for me, I just made sure they understood that the ticket was their responsibility now.

Later the newspaper reading developer came to visit me. Said that it would be better if I took back the trouble ticket. Actually this sounded like the best plan of action for the customer. So I dug into the problem and figured out all the questions I needed to ask the customer who encountered the problem. The only encouraging thing I heard from this developer was that they had a lot of business rule questions regarding a different trouble ticket assigned to them. Since I know a lot about the customer's business, I gave the developer the info needed to proceed. And I hoped the work done on that ticket would be sufficient to help the customer.

Passing the Buck

A user noticed that some data was missing in the system. So the user notified their manager. The manager relayed the information up to headquarters. Headquarters asked a system administrator if they knew whether the data was loaded. The system administrator had no idea what the issue was or how to even check whether the data was loaded.

So the system admininstrator contacted a member of our performance engineering team. The performance engineer asked the project manager and develoment team lead is they had any ideas. Now after a while the project manager inquired whether anybody gave the performance engineer any ideas. This was the first time I heard about this issue.

I am a senior developer on my project. It just so happens that I have been here a long time. I know how to check where the breakdown occurred that resulting in the missing data. But I went and had a chat with the project manager. I don't like rewarding backwards requests for information like this. Because doing so normally results in abuse and reliance on doing things the wrong way. So I told our customer's headquarters that indeed the data is missing, and that the best way to proceed would be to follow procedure and open a trouble ticket.

The amusing thing about this problem is that along the way, it reached a multitude of people who had no clue as to how to figure out why the data got loaded. At each step the person just passed the buck to somebody else. I was going to name this post "The Buck Stops Here" because I am one of those rare individuals on the project who knows how everything works. But that is not the point of the post. The point is that information has value. And when you do not have the information or the ability to figure out the information, apparently all you can do is pass the problem along to somebody else. This is an inefficient system.

On a lighter note, I sometimes bend the rules for those I like. And I will do a 5 minute investigation to provide critical data for some of my customers. The problem with this is that if the word ever gets out, a lot of people will want to call me and get help. If all I did was browse the web all day, I could honor these requests. But I have a lot of duties and need to keep these distractions to a minimum. Such is life.

A Day in the Life

I thought I would chronicle a typical day in the life of a Software Maintenance Engineer. Now some of the days events may not be typical for all since I am a senior developer, and have been on the project for almost 8 years. So some of this will also be a day in the life of a Subject Matter Expert.

First thing when I come in is check the ton of e-mails and voice mail messages. I just filter through them to figure out what things I need to do immediately. Then I head off to a meeting chaired by our project manager. There I bring up some issues regarding the team's response to a trouble ticket.

Then I call one of our customers who oversees our software across the nation. I explain some of the updates I provided in our trouble ticket system for a problem she is tracking. I tell her the technical reason why the software is failing and how we plan to fix this problem. Then I take an action item to help the customer gauge the impact of the problem.

Next I check in some code changes to another trouble ticket. Got to apply my changes to all branches in source code control. That is because we maintain more than one version of the software. Once the code is checked in, I generate the documentation to release an update to our users.

I get a confusing e-mail from our requirements analysis team. Apprarently the project manager asked them a confusing question, and they could not comprehend the requirements for a certain piece of our software. So they came to me. I figured this was important enough to explain in person. So I went and gave the requirements guy some background on the software, the inputs, and our required processing. I then gave the same speech to the project manager, and then to the developer working on that piece of the software.

Then I tackled a new software problem. Called up the customer and got some examples of the problem. Checked and verified that this was an actual problem. In fact, I queried the production data and found some new functionality was non-existent. So I spent some time to set up some test data. But I could not make the problem happen in a development environment. So I asked one of our DBAs to get me a copy of the database code that was compiled into the Production environment. All the work is done in an Oracle PL/SQL package.

Finally I looked into one other new problem that was described as "a tough one". At first I thought I could duplicate the problem even in development. However on closer inspection of the steps leading up to the problem, I decided it was too early to tell if this was the same problem. Either way I found at least once scenario that causes a problem in the software. By this time I had worked a long day and it was time to bolt.