The incident was resolved by replacing the hardware. Should the incident be kept open to manage the repair process

By ITIL® from Experience© with contribution by John Gabel

Here is a scenario:

A piece of hardware caused an incident1
To resolve the incident the hardware is replaced with a spare from storage
To manage the repair of the defective hardware…

1. The incident is kept open but a stop-the-clock action is taken, or

2. The incident is closed and:
a) A Problem Record is logged, or
b) A Request For Change (RFC) is logged, or
c) A Service Request is logged

Let us examine each option to determine which one is best.

1. The incident is kept open but a stop-the-clock action is taken
Although this avoids creating multiple records for the same Event2 , the incident management process is actually completed as it is responsible to: “…ensure that normal service operation is restored as quickly as possible and the business impact is minimized.”3 . Keeping the incident open also adversely affects the metrics of the Incident Management process such as total duration even though it would not breach since the SLA clock has been stopped.

Also, in practical terms, not all ITSM tools can launch a workflow from an incident record, especially after the incident has been resolved. In addition, if the supplier is involved in conducting the repair, the ITSM tool may not be able to add a new SLA to a resolved incident in order to account for the supplier’s agreement.

2a) The incident is closed and a Problem Record is opened
The incident management metrics are not affected since the incident is closed but Problem4 Management is primarily an investigative process. It identifies problems, determines their root cause, prepares known issues for the knowledge base or fixes the cause by initiating a change. In this scenario, the objective is not to determine the cause of the incident, but to manage the repair of the defective hardware. In addition, it is not feasible for most organizations to log a problem for every unexplained incident.

If a lot of incidents are generated by the failure of this type of hardware, a problem may be opened to investigate the cause of these problems. This is discussed further, following the analysis of the options.

2b) The incident is closed and an RFC is logged
The incident management metrics are not affected since the incident is closed, but in this case repairing the hardware is not related to changing the infrastructure. The hardware may not generate a change immediately since after being repaired it would be returned to storage until ready to be deployed.

Also depending on the scope of the change management process, this Configuration Item (CI) may not be under control of the change management process yet the repair process still needs to be managed.

2c) The incident is closed and a Service Request is logged
The incident management metrics are not affected since the incident is closed, and the Request Fulfillment process can be used to manage the repair, with a supplier SLA if this is the case. A workflow may also be available. If the defective hardware cannot be repaired, an acquisition process could be launched to replace the hardware and ensure that it is available in storage to address a future failure.

The recommendation is that the Service Request process (2c.) is the most appropriate process to handle the repair of the replaced hardware. There is no need to open a Problem Record or an RFC (although the Service Request logged in 2c. may lead to an RFC to put the repaired hardware back in service).

Let’s continue the story to understand when a problem would be logged…

The repaired or new hardware comes in and is tested but fails during testing. An incident would not be logged for this Event since the hardware was simply being tested – it didn’t even have time to generate an incident. In this case a problem or a change would not be logged, it would simply be returned to the supplier using the original service request or a new one to take advantage of a workflow and/or a supplier SLA.

Or…

The repaired or new hardware is functional and deployed (via an RFC) but fails the following month. A problem would be logged if this hardware is important enough to investigate why there have been many failures. If this hardware is not mission critical or highly visible, using trend analysis problem management may identify the situation and “proactively prevent incidents from happening and minimizes the impact of incidents that cannot be prevented.”5

Related: