Measuring Service Dependability Metrics: MTTR vs MTTF vs MTBF vs MTTD

Metrics are useful performance indicators that can be measured across a variety of operational objectives to guide your ITSM functions for improved service delivery. In the domain of IT Service Desk, metrics hold valuable insights on functions and processes that directly impact the operational performance of your business and end-user experience.

 

The overarching goal of an ITSM organization is to ensure dependable IT service delivery and performance, especially when the cost of IT downtime averages $1-5 million per hour for nearly half of the enterprises experiencing outages according to a recent research report. Even for the smaller organizations with up to 50 employees, the cost of hourly downtime averages $50k-100k, excluding legal fees and fines.

 

The first step to reducing the downtime is to measure it with the right metrics that hold the most useful insights into service availability. Popular metrics such as MTTR, MTTF and MTBF are commonly used to describe the frequency and duration of an IT outage incident. Before we take a deep dive into the ITSM metrics calculation, let’s quickly review the characteristics of a service that are described by these metrics:

  • Service Availability: Probability that a system performs adequately and satisfies the defined specifications at the specific time instance (not duration) of its operation. Availability refers to the readiness of an adequate service.
  • Service Reliability: Probability that a system performs adequately and satisfies the defined specifications over a period of time, such that no repair is required to maintain the service operations. Reliability refers to the continuity of the service.

Service Dependability: A holistic view of system performance characteristics that describe how the service can be trusted over a time duration collectively in terms of attributes such as availability, reliability as well as other characteristics such as safety, maintainability and integrity.

Before we describe the dependability metrics further, let’s quickly review the two important parameters used in the calculations:


  • Failure Rate: Frequency of failure, or the number of failure incidents per unit time. Denoted by the Greek letter λ (Lambda).
  • Repair Rate: Frequency of repair operations to a failed system component, or the number of repairs per unit time. Denoted by the Greek letter μ (Mu)

 

Mean Time to Failure (MTTF)

The average time between the failures of a non-repairable system component. The component can deliver adequate performance for a specific time duration. The time between initial operation and the instance of any irreversible failure that renders it inadequate to deliver dependable service performance is calculated. The calculation is repeated for multiple components over time and an average value is calculated as follows:

 

MTTF = Total Hours of Operation / Total Number of Components

MTTF = 1 / λ

Mean Time to Repair (MTTR)

The average time spent to repair a failed component. The metric evaluates the time between the detection of failure and its return to operational state. The component is repairable and only accounts for all the time spent during the repair process – including repair planning and execution. The time spent for maintenance across all component failures over a long duration is measured and the average value is calculated as follows:

 

 

MTTF = Total Hours of Maintenance / Total Number of Repairs

MTTF = 1 / μ

Mean Time to Detect (MTTD)

The average time between the occurrence of a component failure and detection of this failure. This time is variable and depends on the ability to monitor system performance, identify a failure and pinpoint the affected component. It’s not always possible to identify exactly when a failure occurs, especially in a situation where the service continues to meet desired dependability objectives despite a failure incident. This is possible due to risk mitigation mechanisms such as redundancy, which ensure high system availability. The Mean Time to Detect is calculated as follows:



MTTD =
Total Time Spent for Incident Detection / Total Number of Incidents

Mean Time Before Failure (MTBF)

The average time elapsed between the failure of a repairable system component. This metric accounts for the time spent detecting a failure (MTTD), repairing it (MTTR) and time elapsed until the next failure incident (MTTF). The goal of maximizing system dependability can be translated into minimizing the MTBF metric value, which is calculated as follows: 

 

MTBF = MTTD+MTTR+MTTF

Service Availability

Finally, let’s review how these metrics contribute to the service availability. This is one of the most important Service Level Agreement (SLA) metrics that guarantees dependable service performance. It’s commonly measured in terms of 9’s – e.g. five 9’s, or available 99.999% of the time. It’s measured as follows:

 

Availability, A(t) = MTBF / (MTBF+MTTR)

Managing MTTD with Intelligence

As the old business adage goes, What Gets Measured, Gets Managed. ITSM organizations are often overwhelmed by the variety of metrics and business KPIs required to measure the true performance of their IT Service desk. Managing service availability by reducing downtime incidents and resolution time requires a thorough understanding of metrics that describe various stages of the incident and resolution process.

 

While the popular metrics such as MTTF and MTTR get significant attention, the unpredictable and highly fluctuating MTTD metric is often the most impactful in reducing MTBF, which determines the long-term service dependability and end-user experience. Not surprisingly, MTTD performance is also unique to every organization and the secret to reducing MTTD depends on the ability to identify hidden insights into the incident performance, ultimately identifying the root cause of repetitive incidents.

 

For the IT Service desk, analyzing trends in ticketing requests and historical archives using advanced NLP technologies can help converge the problem to known solutions that are not always apparent during the resolution process. This is integral to an end-to-end hyperautomation intelligence strategy with the goal of optimizing the entire IT Service Desk pipeline and workflows. By looking at the right metrics, your IT Service Desk can not only help reduce the ticketing volume, but identify opportunities for proactively eliminating expensive IT outages and downtime incidents.