Intro to Computer Systems
Chapter 10: Equipment Failure
How Do Computers Fail?
Components can fail either suddenly, or as an ongoing deterioration of conditions until something outright fails. To borrow some medical terminology, you might find it easier to consider the former as "acute" failure, and the latter a "chronic" failure.
- A sudden failure may be due to something outside the component's control: e.g. a clumsy user might drop a hard disk, permanently damaging the heads/platters and as such the drive no longer works.
- An ongoing failure may have a continuous external component to it: e.g. the hard drive may have insufficient cooling, which is a condition that reduces the reliability of its constituent components (e.g. motors, bearings, etc.) until it outright fails.
Chassis and Cooling Failure
A failure in the exterior chassis may not in itself appear to be critical to the system's overall wellbeing - however, it is very likely to set off a cascading series of failures in other components, which may well be terminal for the system.
Some kinds of failures of the chassis and cooling apparatus may be:
 |
| A broken heatsink retention clip. Photo: DansData |
- a failure in systems integration, in that a too-heavy heatsink/fan unit was mounted vertically in the case. Among the various items that make a CPU socket specification, one of them is the way in which the thermal solution is mounted, and any limitations in physical size or weight. This can be particularly important with tower cases, where the CPU's heatsink is not sitting in line with gravitational force, but at right angles to it. This bending motion can overload the mounting points in the case and lead to problems in CPU cooling, or even warped/damaged motherboards.
- a fan failure - most thermal solutions rely on active cooling to achieve their rated thermal dissipation, as per thermal design power standards. A failed fan may not be immediately apparent when a system is at idle, but will cause components to overheat when under load.
- a failure of discrete components - most commonly, capacitors in a power supply (or motherboard). Capacitors are discrete electronic devices for storing and supplying power, and high capacity capacitors (and cheap motherboard capacitors) are built using dielectric fluids rather than solid-state compounds. In a high-temperature environment, or with poor quality dielectric material, the capacitor can break down and no longer work correctly - or even generate gasses within its enclosure such that the component ruptures and spills the fluid all over its neighbour components.
A very wide-spread case of chronic capacitor failures in the early-mid 2000s was due to a poorly implemented dielectric material for capacitors, the result of corporate espionage.
 |
| Blown (left) and healthy (right) capacitors. |
This article from Silicon Chip is a good summary of the situation at the time.
The ramifications of these poorly built capacitors was widespread throughout the computer industry, affecting not only component and motherboard manufacturers like Asus and Gigabyte, but also integrators like Dell and Apple.
Chip Failure
The efficiency of transistors in a processor is dependent on the temperature of the silicon; a higher temperature makes them less efficient. This can have a snowballing effect, as the less-efficient CPU draws more power, thus more heat, making it less efficient still. This is known as thermal runaway.
Early CPUs had no such protection for thermal runaway, and thus when overheated (either due to poor conditions, or a failure of the heatsink/fan unit) simply melted. (Modern CPUs have protection circuits to prevent thermal runaway from causing damage.)
Storage Device Failure
Mechanical Storage
Mechanical storage devices such as hard disks and optical drives have many ways in which they can fail.
- bearing wear: motor bearings can wear, lose their grease, and seize up, rending the motor no longer able to supply a stable rotational speed.
- motor failure: for similar reasons, a failure of a motor unit (spindle motor, or head servo) will also render a drive inoperable.
- loss of alignment: either due to shock, or excessive thermal cycling, the heads of a drive unit may wander so out of alignment, that the internal alignment functionality cannot
- foreign objects/debris: foreign debris, such as dust, is problematic for optical drives, and disasterous for magnetic hard disks.
- head crashing: a significant external blow to a hard disk may cause the heads to crash against the disc surface, destroying both the read/write head and the platter. Parked heads do not remove the risk of head crashes; it just makes it harder for them to occur.
- disc shattering: a poorly-made compact disc, when spun to extreme speeds as part of a CLV access mode, may not be able to withstand the rotational forces and shatter within the drive. This destroys the disc, and in many cases, the drive as well.
Solid-State Storage
Solid state media is far more robust than mechanical storage, but it is not completely failure proof. The primary non-physical failure mode is to do with the flash memory cells, and their limited write durability.
The limited durability of a flash memory cell is a property of the construction: a 'floating gate' transistor, which has a transistor gate sandwiched between two materials. It is the erosion of these boundary materials that causes the limited lifespan.
Display Failure
The most common failures of LCD monitors include the backlight, which can fail in a couple of ways:
- backlight failure, most commonly the breakage of the CCFL backlight globe. In this situation, the LCD panel matrix itself is operating normally, but without any backlight to pass through the panel, nothing appears on the screen. If one were to look closely at a LCD panel with broken backlight, it may be possible to see a faint image of the screen content.
- the CCFL inverter circuit can also fail, with similar symptoms. The inverter is the small piece of circuitry attached to the LCD panel, which generates the high voltage input to excite the gas in a CCFL tube.
These two failure modes are noticeably absent in an LED-backlit LCD panel, and is one of the reasons why these panels are more reliable.
 |
| A dead LCD subpixel. |
Failures can also occur due to errors in manufacture, which lead to 'dead' (always off) or 'stuck' (always on) pixels. These are cells in the LCD pixel matrix that are permanently set to a state and cannot be switched.
 |
| The back side of an LCD panel. |
Screen anomalies can also occur due to the control lines of the LCD panel being damaged. The control circuitry is typically on the upper back side of the panel, with very fragile control lines running over the top edge. If these are damaged, it may manifest itself as lines appearing on the screen at regular intervals.
In the image above, the following can be seen:
- inverter circuit connector (two-pronged connector at the bottom). The inverter usually sits just below the LCD panel, not behind it.
- control circuit (green circuit board on the upper quarter of the panel). This converts the video signals into the required control pulses to enable/disable individual LCD subpixels.
- control lines (orange/brown area just above the control circuit). This is a flexible plastic with the control lines embedded within it. This connects the LCD subpixel matrix with the control circuit.