Intro to Computer Systems

Chapter 10: Equipment Failure

Understanding Failure Statistics

Computers are complex devices made of a number of subsystems and components, each of which have varying resistance to failure. When these components and subsystems do fail, they can have relatively minor effect (e.g. a port no longer works), or serious (a failure that propagates to other devices, causing them to fail as well).

The study of system life cycles and the ways to measure, predict and forecast failures is known as reliability engineering. This subtopic introduces some of the basic reliability engineering concepts, as applied to electronic equipment.

The Bathtub Curve

The bathtub curve is a graphic illustration of a certain failure distribution that commonly holds true for complex electronic equipment.

The bathtub curve.
The bathtub curve.

The bathtub curve is derived from a number of different failure modes:

Infantile Failure

As a systems integrator, infantile failure is the one most commonly experienced when building computer systems. With computer equipment having an increasing level of solid-state components, reliability has been greatly enhanced such that the time between the infantile and wear-out failures of a component are very long.

Component manufacturers typically have a fairly efficient process for dealing with infantile failures, where they can be contacted for a Return Merchandise Authorisation (RMA), where the manufacturer/supplier can be notified of a fault and the authorisation given in advance of the physical return of goods.

Mean Time Between Failure (MTBF)

Mean Time Between Failure is a very commonly quoted statistic that aims to predict the hours of operation of a certain device type before a failure is encountered. Note that the MTBF is calculated as an arithmetic mean of the entire population - and does not refer to an individual device: e.g. a device with a 50,000 hour MTBF doesn't mean that they expect your particular device to last 50,000 hours.

MTBF statistics are not component life expectancy figures.
Computerworld have a useful article on MTBF, in the context of data centres.

MTBF statistics are commonly referred to in mass storage such as hard disks. A typical consumer-grade hard disk might have a MTBF rating up to or exceeding 500,000 hours (57 years) - the correct way of interpreting this information is that if you were to have a pool of a thousand (for example) hard disks, we could reasonably expect from their specification is that 500 of the thousand will fail before 57 years is up, and the other 500 will fail after. Or, in a more usable context, we could reasonably expect one disk in the population to fail each (500 / 57 = 0.114 years =) 1.37 months.

MTBF figures are often extrapolated from "burn-in" analysis, and as such should not be used as a canonical figure of component reliability. Although they may not be useful as an absolute reference, they may be useful as a relative reference. For example, if you had to choose between two hard drive models for a disk array (for e.g. a server room), and one model had a quoted 600,000 hour MTBF and another had 1.5 million hour MTBF, it would be prudent to choose the latter if reliability was of utmost importance.

Google has thousands of data centres around the world, which use consumer grade hardware. In the context of hard disks, they analysed the actual reliability of the storage in their data centre and summarised the results in a paper:

Warranties

Product manufacturers use this kind of reliability data to determine a reasonable warranty length for their product - to get the mix between product reliability, failure rate, profits from sales, and losses from repairs/returns - correct. This type of analysis is known as actuarial science, and is a branch of applied mathematics.

It can be tempting, to use this information to conclude that warranty length is a good proxy for other reliability statistics - but there are so many non-actuarial reasons for warranty length to vary it does not have a solid basis.

The warranty length, however could be used to consider vendors given equivalent reliability statistics. For example, if two hard drive vendors offer a solution with a 1 million hour MTBF, and one offers a three year warranty and the other five - it's prudent to choose the latter if post-failure support is a concern.