Intro to Computer Systems

Chapter 10: Equipment Failure

Error Correction and Redundancy

In mission-critical systems, it's often not feasible for the inevitable errors or hardware failures to cause corruption of data, or failure of the system that results in downtime. There are two general approaches to solving the problem of hardening a system from failure or corruption;

The two examples of these strategies we will focus on is error-correcting memory (ECC memory), and redundant disk storage (RAID).

Error-Correcting Memory

Memory is particularly vulnerable to corruption from outside interference as it is takes only very small pulses of energy to change an individual bit's state. Electrical interference from other components and power sources is only a relatively minor cause of these sort of one-off "soft errors"; the primary source of such interference is actually due to background radiation from space, such as cosmic rays.

Depending on the task, any such error may have a negligible impact, or cause a serious issue (for example, financial transactions or scientific computing). Without any process in place, these soft errors are neither detectable or correctable - it's necessary for additional steps to be taken to add some resilience to the memory system.

Parity Memory

The simpler solution, as implemented on early IBM PCs, is memory parity. This is a simple 'check digit' of redundancy attached to every word of data that stores whether the data is 'odd' or 'even' based on the contents of the byte. If external interference caused the corruption of a bit, it could be detected as the parity would no longer match the data.

There were drawbacks to this approach, however:

ECC Memory

ECC (Error Checking and Correction) memory is an improvement on parity memory, that allows for detection of multi-bit data corruption, and correction of single-bit errors, typically through the implementation of a SECDED Hamming code. The additional overhead of calculating and checking these codes is at a small performance penalty, but greatly increases the resilience of the memory system.

Where ultimate reliability is required, enhanced versions of ECC such as IBM's Chipkill offer an even greater level of memory redundancy.

A study by Google in 2009 focused on DRAM error rates, with surprising conclusions that "hard" error rates (due to production defects, etc.) greatly outweighed errors due to "soft" circumstances such as cosmic rays.

An Ars Technica article in October 2009 reported the conclusions of the study.

Redundant Array of Independent Disks (RAID)

Similar to redundant power supplies, mass storage devices can also be arranged to provide redundancy. There are a number of strategies, collected in a standard known as RAID (Redundant Array of Independent Disks).

RAID striping (level 0), mirroring (level 1) and parity (level 5).
RAID striping (level 0), mirroring (level 1) and parity (level 5).

The various RAID levels offer varying levels of redundancy, recoverability, and increased performance through connecting drives together in a number of forms:

Synology, a network-attached storage manufacturer, has an interactive RAID calculator and comparison tool:

https://www.synology.com/en-us/support/RAID_calculator

RAID levels use one or a number of these techniques to add performance and redundancy to a disk system.

The difference between RAID 0+1 and RAID 1+0 may not be immediately apparent. This article from thegeekstuff.com provides extra explanation.