IoT Optimizes Major Incident Handling

Major Incidents occur on a daily basis across all industry types. IoT provides an innovative platform enabling a consistent workflow process which will handle major incidents in a manner that reduces their adverse consequences.

896
Image of a comic book character running away from a cybersecurity incident
Illustration: © IoT For All

Major Incidents occur on a daily basis across all industry types.  These could include rolling blackouts from a power grid under strain or a cyclone hitting a coastline with high winds and that creates a storm surge causing extensive damage and a loss of life. Currently, most of these incidents are handled in a reactive manner and in crisis mode. The Internet of Things (IoT) provides a platform, as well as the enablement of a consistent workflow process, which will facilitate the handling of these major incidents in a manner that reduces their adverse consequences.

The high-level workflow associated with a major incident can be integrated with IoT devices, especially at the sensor and wearable level.  The primary and most important benefit is that IoT will be able to record precisely each step in the major incident workflow. Time is crucial in analysis and is the cornerstone of causation. The below diagram is a presentation of the major incident workflow and serves as an indication of the important times that need to be noted:

Image Credit: Ron Bartels

The detailed steps are as follows:

  • Time when the major incident started: This is the actual time when the disaster/crisis started and when something of significant negative consequence has happened to inventory items or a large risk event has been triggered.
  • Time when an incident was detected: A major incident is detected either by IoT sensors, people assigned to the task or via a third party.
  • Time of diagnosis: This is when the underlying cause is identified, and we know what happened beyond the visual or immediate symptoms presented.
  • Time of repair: The process to fix failures has started or corrective action is initiated.
  • Time of recovery: Components have been recovered which means that the inventory is available again for production and business is ready to be resumed.
  • Time of restoration: Normal operations resume has resumed, and the inventory item or associated service is back in full production.
  • Time of workaround: When a service is back in production with a workaround.  An example would be where the utility power for a data center has failed and the standby generator has kicked in.
  • Time of escalation: When the major incident is escalated to a higher level, such as the introduction of a Tiger Team.
  • Duration service was unavailable: The downtime as measured and determined by Service Level Agreements (SLA).
  • Duration service was degraded: Often overlooked in SLAs, this refers to the period when service was not able to operate at an optimal capacity or expected capacity levels.

These times allow analysts to better understand where problems within the workflow exist when the times are graphed.  As an example, is there a delay in major incidents being digitally detected or a disparate time being spent diagnosing the underlying causes?  It’s also useful to aggregate these statistics over multiple Major Incidents to understand trends. 

These statistics can also be extrapolated to better define and set appropriate SLA times and durations.  Just recording the start and end time of a major incident provides no benefit, as that doesn’t provide guidance or inform where time improvements are possible.  The recommended workflow above is a significant improvement.  An IoT platform or system thus needs to have this embedded knowledge assimilated, and this workflow is the basis upon which AI can be leveraged to deal with major incidents.

The deliverable of IoT platforms in dealing with major incidents also needs to incorporate a method to gauge risks, which in this case are the probabilities of loss, failures and errors.  This aligns to the CIA metrics which are defined as:

  • Inventory is only accessible to those authorized to its use. This is associated with loss.
  • Safeguarding the accuracy and completeness of inventory.  This is associated with failure.
  • Authorized use is provided when required.  This is associated with failure.

The diagram below contextualizes the risk.  Normal operating conditions are impacted by unexpected activity.  If these activities have pro-active mitigations in place, then workarounds are possible to avoid a major incident.  Should the mitigations not address the problems that are being caused by the major incident event, then there is an expectation that reactive counter-measures can reduce any further negative consequences.  A lack or poor counter-measures will trigger a full-blown major incident.

Image Credit: Ron Bartels

IoT should be able to deliver on the digitalization of this risk management.  Being able to manage both the proactive mitigations and trigger the reactive countermeasures.  The benefit here is time. The digitalization of the process can reduce the negative impact of a major incident as algorithms can react faster than humans to immediately identify causation.  The algorithm can learn from the previous incident, creating a feedback loop which allows the entities experiencing the major incidents to incrementally improve operations over time.

IoT platforms that provide workflow and risk management are key to dealing with major incidents; an IoT implementation lacking this functionality is superficial at best.