Intro to Computer Systems

Chapter 10: Equipment Failure

How Do Computers Fail?

Components can fail either suddenly, or as an ongoing deterioration of conditions until something outright fails. To borrow some medical terminology, you might find it easier to consider the former as "acute" failure, and the latter a "chronic" failure.

Chassis and Cooling Failure

A failure in the exterior chassis may not in itself appear to be critical to the system's overall wellbeing - however, it is very likely to set off a cascading series of failures in other components, which may well be terminal for the system.

Some kinds of failures of the chassis and cooling apparatus may be:

A broken heatsink retention clip.
A broken heatsink retention clip. Photo: DansData

A very wide-spread case of chronic capacitor failures in the early-mid 2000s was due to a poorly implemented dielectric material for capacitors, the result of corporate espionage.

Blown (left) and healthy (right) capacitors.
Blown (left) and healthy (right) capacitors.

This article from Silicon Chip is a good summary of the situation at the time.

The ramifications of these poorly built capacitors was widespread throughout the computer industry, affecting not only component and motherboard manufacturers like Asus and Gigabyte, but also integrators like Dell and Apple.

Chip Failure

The efficiency of transistors in a processor is dependent on the temperature of the silicon; a higher temperature makes them less efficient. This can have a snowballing effect, as the less-efficient CPU draws more power, thus more heat, making it less efficient still. This is known as thermal runaway.

Early CPUs had no such protection for thermal runaway, and thus when overheated (either due to poor conditions, or a failure of the heatsink/fan unit) simply melted. (Modern CPUs have protection circuits to prevent thermal runaway from causing damage.)

Tom's Hardware Guide demonstrated the effects of CPU thermal runaway back in 2001, when such protection was not a common feature.

The video summary is also on YouTube.

Storage Device Failure

Mechanical Storage

Mechanical storage devices such as hard disks and optical drives have many ways in which they can fail.

A hard disk head crash.
A hard disk head crash. Photo: barnold/Flickr

Solid-State Storage

Solid state media is far more robust than mechanical storage, but it is not completely failure proof. The primary non-physical failure mode is to do with the flash memory cells, and their limited write durability.

Diagram of a flash memory cell.
Diagram of a flash memory cell. Diagram: eeherald.com

The limited durability of a flash memory cell is a property of the construction: a 'floating gate' transistor, which has a transistor gate sandwiched between two materials. It is the erosion of these boundary materials that causes the limited lifespan.

Display Failure

The most common failures of LCD monitors include the backlight, which can fail in a couple of ways:

These two failure modes are noticeably absent in an LED-backlit LCD panel, and is one of the reasons why these panels are more reliable.

A dead LCD subpixel.
A dead LCD subpixel.

Failures can also occur due to errors in manufacture, which lead to 'dead' (always off) or 'stuck' (always on) pixels. These are cells in the LCD pixel matrix that are permanently set to a state and cannot be switched.

The back side of an LCD panel.
The back side of an LCD panel.

Screen anomalies can also occur due to the control lines of the LCD panel being damaged. The control circuitry is typically on the upper back side of the panel, with very fragile control lines running over the top edge. If these are damaged, it may manifest itself as lines appearing on the screen at regular intervals.

In the image above, the following can be seen: