When does an incident become a problem? This is a question often found on an ITIL exam in one form or another. The answer of course is “never”. Now for some, this may seem like the wrong answer. If you are the person responsible for managing a service that has frequent incidents, or worse the person responsible for implementing a change that is now causing incidents, the fact that incidents exist may be a problem for your career, but the incident itself never turns into a problem.
On the contrary, “When does a problem become an incident?”, is “almost always” or at least that will be the case if your organization does not have a solid problem management process in place. By solid, I mean one that does both reactive and proactive problem management. Reactive Problem Management is doing root cause analysis of incidents to identify and remove the root cause of these problems and is generally conducted as part of Service Operation.
Proactive problem management on the other hand is often addressed in processes belonging to the Service Strategy, Service Design, Service Transition, and Continual Service Improvement stages of the Lifecycle. These include Demand Management, Release and Deployment Management, Availability and Capacity Management, Configuration Management, and Continual Service Improvement.
In Demand Management, look at your patterns of business activity to identify peak transaction periods to identify capacity issues that could create service incidents before they occur.
Ensure that as Release and Deployment activities occur, any known errors or defects that are being accepted into production are transferred to the Known Error Database along with any workarounds.
As part of Availability Management, look for ways to reduce downtime and increase uptime, as well as ways to identify conditions that will aid in predicting when services may be impacted.
As part of Capacity Management, identify capacity thresholds that when reached will alert support teams or kickoff automation that will correct the issue before incidents can occur.
- Use the information in your Configuration Management System to identify CIs that have failure rate and perform an analysis to determine if there is a common root cause.
These are just some of the ways to become more proactive in Problem Management. The key to all of this is to have the right key performance indicators and metrics in place to allow the level of analysis needed for both reactive and proactive problem management.
Posted by: Chuck Spencer