ITIL® and Me

In the trenches with ITIL and ITSM.

As I’ve been working on Problem Management at my job I’ve realized that problem solving in this organization is almost like running a mini project.  I’m sure in the perfect ITIL world Problem Management flows beautifully into Change Management once a Root Cause is found and a Change needs to be made to fix it, but one part to that flow is the full understanding of what’s going on in the I.T. environment.  Since we don’t have a CMDB this makes Problem Management and RCA as difficult as having a blind person drive in the Indianapolis 500; they can still do it, but it’s going to be messy.  And keeping with this analogy of a blind driver, it means that to get to the finish line someone is going to have to give detailed instructions on where to go.  This is why I’m thinking to treat Problems as projects and consider a Problem Life-cycle.  So far I’ve broken the life-cycle down into six (6) phases.  First phase is Initiation, in which detection, logging and categorization occurs.  Second phase is Assessment, where Prioritization occurs based on the assessment of the impact on the users.  The third phase is Investigation where investigation and diagnosis occurs, as well as finding a workaround to resolve Incidents and minimize impact on the organization.  The fourth phase is Confirmation/Elimination of Root Causes.  I took part of the Kepner and Tregoe methodology of problem solving and considered that during the Problem Life-cycle several possible Root Causes will probably be found, so they’ll need to be tested.  It’s at this phase that the majority of work will occur if Event Management or a CMDB are not in place.  Here is where tasks need to be given, results need to be reported and if a task isn’t complete, follow-ups will occur.  This is also the phase that mimics a project with meeting minutes, action items and a lot of work to ensure involvement from the different technical teams within the I.T. department.  All this has the goal of trying to find the Root Cause.  Problem resolution efforts will probably also go back and fourth between phase three and phase four since it’s possible that all possible root causes will be eliminated and efforts will have to “go back to the drawing board,” so to speak.  Phase five is Analysis in which the cost and benefits of implementing the Resolution are compared to the costs of keeping with a workaround, assuming a workaround has actually been found.  If it’s decided to implement a resolution, then the final phase is reached; Change and Validation.  Here a Request for Change will be submitted and once complete, reporting will need to take place to validate the reduction in Incidents.  The validation is extremely important and often overlooked in my department because of the assumption that our work is flawless so a Change to resolve a Problem will indeed work.  This phase also helps to keep with the “Check” phase of the Deming Cycle.

Like a project, I’ve broken down the steps of the Problem Management Process into different divisions, each with their defined goals and as I work on this idea, I’ll also develop documentation for each phase.  I know there are other ways for Problem Management, but the beautiful thing about ITIL is that it’s a set of best practices, which means I have the flexibility to mold the processes to my department’s culture and as my organization changes, the implementation of those processes will change as well.

I just read the IT Skeptic’s post about how APGM is raising the limit of students for an ITIL v3 intermediate course from 12 to 18.  The IT Skeptic makes a very good argument in his blog about how this is degrading ITIL standards and will only hurt the industry (http://www.itskeptic.org/apmg-further-degrade-standards-itil-v3-certificati).  I couldn’t agree with the author more on this point and to be honest, I’m a little depressed that I’m now just starting in ITIL and I’ll be receiving my certification in v3 and not v2 just for the sheer fact that the v2 testing standards are much more difficult and the answers require an explanation as opposed to just choosing from “A,B,C, etc.”  I’m certainly not against making the test easier since I’ll be taking those tests, but with easier requirements comes an increase in likeliness that a bunch of yahoos will be going into ITIL so it looks good on their resumes.  Since I’m someone that truly likes the framework and loves learning the material and how it can be applied to I.T., it’s degrading to be part of an “ITIL Generation” that doesn’t have a weed out system.  And I’m not talking about an intelligence filter, but something that helps to keep unmotivated and close-minded people out of ITIL, because the people that take the Intermediate courses should be those that want to see a change for the better and not just because they want to add another line to their resumes or their job requires them to take a certification course.  For my long-term goals this means I have to possibly contend with people that have advanced certifications in ITIl v3 but don’t truly have an understanding on how it can be applied or how it can change an organization for the better.

And what about the short term impact of this change?  For me, it doesn’t matter.  I’m learning something new about ITIL each day and this knowledge will only help me in my current job and organization.  So ultimately the change may mean less competent people will have advanced ITIL certifications, but it doesn’t mean I have to join the ranks of the incompetent.  After all, what do you call the bottom ranking student in medical school?  The answer is “doctor.”

Often in my job I’m referred as a “bad guy” (since I’m 6’4″ I’ve considered wearing a Darth Vader costume to work, but that’s a topic for another day).  The idea of being a bad guy at my job is something I’ve been used to ever since I started working in my department, and my cohort in ITIL (Brett) is now finding out what it’s like to be a loathed member of a team.  Why are we even bad guys?  Is it because we help create policies and procedures that are accepted by the department leadership and meant to be followed by our coworkers?  Is it because when analysts don’t want to stick with policy that we stand our ground and justifiably deny their requests?  If we just look at why we even like policies and processes, is it because we like to have a structure in which the I.T. department can function as efficiently as possibly while providing value to the business?  Whatever the reason, the truth is that I’m often looked at as a bad guy.  Some of you may think I’m paranoid, but today I came across a realization that helps to prove my point.

A manager in my department has been really making my job difficult.  This particular manager is in charge of pretty much all of our I.T. infrastructure systems; servers, virtual applications, desktop images, his group is pretty much doing it all.  This also means that when there’s a Problem I’m required to go and question his staff as to why things break.  Often one analyst will give a “root cause” as being another system failing, and when I question why that system failed I’m often met with “I don’t know.”  When I ask how can we know, the response is either “I don’t know” or “we don’t collect logs from that particular system.”  Needless to say my attempts to get to a Root Cause from a Problem literally accomplishes nothing except having a bunch of meetings which wastes time of having a lot of blank stares (I think the next root cause analysis meeting should be at a local bar so at least I can numb my consciousness).  The manager in question also doesn’t help much by holding his staff accountable for their actions.  In fact, it often seems that whenever he’s involved with a Root Cause Analysis session he tends to not add much value except to constantly say jokes that derails the focus of the meeting.  Why is it so difficult for this manager to follow the processes?  I’ve been racking my brain for the past few days and I think I’ve finally found the answer.  In general, people don’t like to have their actions reviewed and questioned, especially when it has something to do with a failure.  For the past few weeks I’ve been raising a lot of questions about a particular group in my department and when the analysts can’t provide answers, who is responsible?  Who is accountable?  I could be wrong, but something tells me one of the responsibilities of a manager is to make sure his/her staff are following policies and procedures and are working as efficiently as possible.  When someone questions the actions of a group, it questions the managers ability to motivate and lead the staff of that group.  Suddenly it makes sense why this manager is blowing me off and making my job difficult; by my succeeding at my job it potentially could show that he’s not succeeding at his.  Now, do I really want someone to lose their job?  Absolutely not.  A core book in ITIL is Continual Service Improvement.  This could be improving processes, technology, or even people (preferably by training).

Since I don’t want to be paranoid and just assume this manager is “out to get me,” I’d like to talk about a different manager.  This manager is in charge of our network and communications group; a.k.a. “the network.”  During Root Cause Analysis meetings it’s a given that someone will blame the Problem on network instability.  My instinct is to defend the network team and force the members of the meeting to look for other causes.  Why do I do this?  Is it because the manager of the network team bribes me?  Is it because I have a crush on that manager and I show favoritism?  Or maybe it’s because whenever a network failure occurs I’m almost immediately notified of the outage even before users contact our Service Desk?  In fact, this particular manager keeps the team running like a well oiled machine.  Event monitoring is in place, communication is very open, and there’s always plenty of cooperation during Problem investigations.  That manager doesn’t even need to be present and I know the team will work as quickly and as efficiently as if the manager was present.

I’ve given examples of two managers.  One of them seems to be aloof and unaware of their team’s actions while the other one maintains contact with their team, helps during Root Cause Analysis and is always looking to improve how the team functions.  So, am I really a bad guy for asking questions and trying to find where improvements need to be made?  Maybe.  But if you know about the Deming Cycle then you would know that asking those questions is a part of Continual Service Improvement.  I may be viewed as a bad guy by some of my coworkers, but in the spirit of ITIL, I think it’s a good thing.  After all, Darth Vader may have been a bad guy but at least he knew how to maintain order.  And let’s face it, those Stormtroopers were pretty organized and efficient at what they did, even if they can’t find two little droids in a desert.

I’ve been working to help a coworker resolve an issue on a Service that’s involved with multiple systems.  At one point an email was sent from an analyst and they gave the “it’s not me” response.  This is where Customer Service suffers at the hands of the “not my department” mentality.  Too often analysts/programmers/support staff look for a reason as to why they’re not responsible for the cause of a problem and are more than happy to announce that they don’t need to be bothered by the issue since it doesn’t involve them.  Going back to my previous story about the analyst that gave the “it’s not me” response, part of the problem did actually turn out to be an issue with a system he supports.  So instead of looking at the problem and making sure the components that he supports are functioning without a problem, he just immediately passed over the issue and now more time is wasted having to go back to the said analyst in order to get the issue resolved.

This is where the concept of Service Ownership really helps.  The idea that I.T. provides services removes the cultural trap of I.T. analysts living in a bubble of “this is what I support and I don’t care about anything else.”  In ITIL’s Service Transition book there’s the concept of a Service Design Package (SDP).  A SDP is basically the architectural blueprint on what is needed to provide an I.T. Service, which includes the hardware, software and processes required to make that Service available for users.  The person responsible for making sure that Service is available and working is the Service Owner.  So under ITIL there’s an intrinsic person that’s responsible for the availability of a Service and now when someone sends that email of “it’s not me” there’s a person responsible for going back and saying “if you can’t fix it, who can?” instead of everyone else ignoring the problem because they are all thinking the same thing:  “It’s not my problem.”  Not only does ITIL force an I.T. department to assignment responsibility of Service availability, but a positive side effect is a change in the culture.  When analysts and programmers know that a specific person is in charge for making sure something works there’s a feeling that their work is being questioned, which it most likely is.  Suddenly the thinking will change from “it’s not me” to “I better make sure it’s not me.”  This helps to push I.T. staff to act in a proactive manner instead of the issue being passed around the different technical groups like it’s a hot potato.  If there’s no ownership on such an issue then eEventually a user will complain, management will get involved and more than likely someone will call a meeting to pull all the different technical teams together to get everyone focused on finding a resolution.  As much as I love for everyone to work together, I hate to pull technical people together in those types of meetings because most of the conversation involves one technical analyst stating another analyst should be doing a specific task, and so on and so forth.  If all the technical teams just checked their systems to begin with and report this information to the Service Owner, more of those “problem solving” meetings could be avoided and the issue could be resolved in less time, and with less meetings, which means less time of Service unavailability and a savings in money with not having to pay for people to attend a (possibly) pointless meeting.