Intro to Computer Systems

Chapter 4: Processors

CPU Architecture

In order to improve computational efficiency, there are a number of design features to modern processors that maximise their performance.

Instructions Per Clock (vs. Clock Speed)

Each type of processor is designed and built differently, with their own set of priorities and concessions: one of these is how much work they will perform per "clock tick" (the speed of the "clock ticks" is what is measured by the Megahertz/Gigahertz rating.)

The amount of work performed per clock tick is referred to as a processor's Instructions Per Clock (IPC). In recent computing history, the most famous architectural showdown between these two design concepts have been Intel's Pentium 4, used in PC systems in the early- to mid-2000s; and Motorola's PPC7400, popularly known as the "PowerPC G4" in Apple Macintosh systems from the same time frame.

This difference in design was made public by Apple, as the "megahertz myth". The basis of this myth is that the clock tick speed alone isn't a reliable indicator of performance -- it all depends on how much work actually gets done per clock tick.

The Pentium 4 processor sacrificed IPC performance for raw clock speed. At each clock tick, the processor wasn't able to compute much; however they ran so fast (at times, twice the speed of competitors) that in the end, their performance was equivalent to other contemporary offerings. This is an example of a processor with a low IPC, but high clock speed.

This is why the Pentium 4's pipeline was designed to be so deep.

The larger each stage of a pipeline is, the longer it takes to do its job. So, with its very long pipeline of small micro-ops, it was relatively easy to make it go fast.

Motorola's 7400 processor was designed with the opposite approach: the clock was to run slowly, but more work would be performed per clock tick. This processor had a high IPC, but low clock speed.

Each approach has its advantages and disadvantages, which eventually led to problems for both processors:

Instruction Level Parallelism

Instruction level parallelism is all about trying to get the CPU to do multiple things in parallel. The common method for achieving this is with instruction pipelining. Processors that exploit this technique are known as superscalar architectures.

Instruction Pipelining

Instruction pipelining is a technique where the processor works on instructions like an assembly line: with each part of the computation process being split into several ordered segments, so that many instructions could be worked on at the same time (albeit in different stages of execution).

A diagram illustrating the basic concept of pipelining.
A diagram illustrating the basic concept of pipelining.

Much like an assembly line, this doesn't make it faster for a single item to be produced (or computed); however it greatly enhances throughput: very handy when there are several million instructions to compute every second!

At its basis, the four main stages of a pipelined CPU are the four execution stages: fetch, decode, execute and store. However, to improve efficiency these stages themselves can be broken up into longer pipelines. These smaller stages are called micro-ops.

The Pentium 4 family took the pipelining concept to extremes: it started with a 20-stage pipeline, and by the end of its useful life, the pipeline had no less than 31 stages!

The "length" of a pipeline is known as its depth - e.g. the Pentium 4 had a very deep pipeline.

Pipeline Stalls

In practice, a pipelined CPU architecture isn't 100% efficient in its throughput of instructions. Be it for reasons of waiting for data from cache or memory, or an unsuccessful branch prediction, these processors can suffer from a pipeline stall - that is, there aren't enough instructions ready to process, such that 'bubbles' of no useful work appear in the pipeline.

Branch Prediction

Pipelining has been further enhanced with speculative branch prediction: this is where the processor tries to guess what future instructions might be, and will process it in any 'spare time' it might happen to have (for example, inside a pipeline stall). This is a lot more effective than you might think!

Out of Order Execution

Out of order execution is a technique where the processor will attempt to re-arrange the instructions queued up for it, such that it can complete the batch of instructions in the shortest possible time, with the fewest instances of pipelines stalling.

Symmetric Multiprocessing (and Multi-Core Processors)

Symmetric Multiprocessing (SMP) is a technique where there is more than one processing unit (of identical type), so that more than one instruction can be executed at the same time. Virtually all consumer multi-processor systems are designed in this way.

When transistors on silicon were somewhat large, this involved having more than one CPU socket on the motherboard, with each CPU holding one processor core. Nowadays, CPU packages can store many cores - these are referred to as multi-core processors.

However, there are issues with symmetric multiprocessing:

You can think of parallelisable workloads with the analogy of trying to cook a meal: if a meal takes 60 minutes to cook in the oven, it doesn't mean you can split it in half, and cook each half in a separate oven for 30 minutes. Cooking is not parallelisable.

However, the more kitchen hands available for chopping onions, the faster onions can be chopped. This workload is very parallelisable.

For these reasons, adding multiple processors doesn't necessarily lead to a doubling (or quadrupling) of performance for a particular task. However, nowdays it's fairly uncommon for mainstream computers to concentrate on performing single tasks - instead, these systems are constantly multitasking; running more than one process at once.

In this use case, multi-core processors can provide significant benefits; multiple (active) tasks can be run at once with little or no performance penalty, as for example, in a 4-core configuration, up to four processes can be active and each have their own execution core.

Simultaneous Multithreading

Simultaneous multithreading refers to the process of exploiting pipeline stalls in the processor, to feed in another thread of execution. The most well-known implementation of this is Intel's Hyper-Threading Technology.

This technique seeks to fill up these bubbles in the pipeline by scheduling another execution thread in them. What this means is that a single physical SMT-capable CPU core will make itself appear to the operating system as two logical processors, and it will interleave the processes scheduled for each of these two processors such that any potential bubbles from one one logical processor can be used to execute the instructions of the other.

A visual representation of Simultaneous Multithreading. Diagram: Intel
A visual representation of Simultaneous Multithreading. Diagram: Intel

Although this isn't as good as having two dedicated processor cores, it does make each individual core more efficient.

CPU Internals

The CPU is where almost all of the operations of the computer are performed. The CPU includes a set of registers where information such as data and addresses may be stored while processing, and it has access to the buses to transfer information to and from memory.

The main components of the CPU are:

Main components of a CPU
Main components of a CPU

To see how these operate, let's consider the instruction:

add E,B
Add the contents of register E to the contents of register B and save the result into register B.

Following the diagram above, this instruction may be implemented as follows:

  1. Transfer the contents of register E to an ALU input via the A-Bus.
  2. Transfer the contents of register B to the other ALU input via the C-Bus
  3. Select add as the ALU operation
  4. Transfer the result from the ALU to register B via the B-Bus

Fetch, Decode and Execute

The standard computer cycle consists of continuously Fetching the next instruction and then Decoding and Executing the instruction. In each iteration:

  1. The PC is moved to the MAR to access the memory location of the next instruction, and the CPU also sets the Read/Write line to read.
  2. The MAR drives the address into the bus, and that is translated to a memory location that is then accessed.
  3. The contents of the accessed memory location are copied onto the Data Bus and moved into the CPUs Instruction Decoder for decoding.
  4. The Instruction Decoder decodes the instruction and uses its Control Lines to prepare to execute the instruction, by clocking data into registers, enabling registers onto buses, selecting ALU functions, and so on.
  5. The ALU executes the instruction.
  6. The PC is properly incremented to the address of the next instruction, ready for the next Fetch.
  7. GOTO 1
Pipeline stalls are especially troublesome with a deeply pipelined CPU - it was no mistake that Intel introduced simultaneous multithreading to their Pentium 4 line.