The original MIPS SPARC and Motorola 88000 CPUs were classic scalar RISC pipelines. Later, Hennessey and Patterson invented yet another classic RISC, the DLX, for use in their seminal textbook "Computer Architecture: A Quantitative Approach."
Each of these designs fetched and attempted to execute one instruction per cycle. Each design had a five stage execution pipeline. During operation, each pipeline stage would work on one instruction at a time.
Each of these stages consisted of an initial set of flip-flops, and combinatorial logic which operated on the outputs of those flops.
=The Classic Five Stage RISC Pipeline=
Table of contents |
2 Decode 3 Execute 4 Access 5 Writeback |
The Instruction Cache on these machines had a latency of one cycle.
During the Instruction Fetch stage, a 32 bit instruction was fetched
from the cache.
At the same time the instruction was fetched, these machines predicted
the address of the next instruction by incrementing the address of the
instruction just fetched. This prediction was always wrong in the
case of a taken branch, jump, or exception, but the penalty for being
wrong was small and simply incrementing takes very little circuitry.
Later machines would use more complicated and accurate algorithms
(branch prediction and branch target prediction) to guess the next
instruction address.
All MIPS and SPARC instructions have at most two register inputs.
During the decode stage, these two register names are identified
within the instruction, and the two registers named are read from the
register file. In the MIPS design, the register file had 32 entries.
At the same time the register file was read, instruction issue logic
in this stage determined if the pipeline was ready to execute the
instruction in this stage. If not, the issue logic would cause both
the Instruction Fetch stage and the Decode stage to stall. On a stall
cycle, the stages would prevent their initial flops from accepting new
bits.
If the instruction decoded was a branch or jump, the target address of
the branch or jump was computed in parallel with reading the register
file. The branch condition is also determined in parallel, and if the
branch is taken or if the instruction is a jump, the target address is
the next cycle's instruction fetch pointer.
Instructions on these simple RISC machines can be divided into three
latency classes:
During this stage, single cycle latency instructions simply have their
results forwarded to the next stage. This forwarding ensures that
both single and two cycle instructions always write their results in
the same stage of the pipeline, so that just one write port to the
register file can be used, and it is always available.
There have been a wide variety of data cache organizations. By far
the simplest is direct mapped and virtually tagged, which is what will
be assumed for the following explanation.
The cache has two SRAMs, one storing data, the other storing tags.
During a load, the SRAMs are read in parallel during the access stage.
The single tag read is compared with the virtual address specified by
the load.
If the two are equal, then the datum we are looking for is resident in
the cache, and has just been read from the data SRAM. The load is a
hit, and the load instruction can complete the writeback stage
normally.
If the tag and virtual address are not equal, then the line is not in
the cache. The CPU pipeline must suspend operation (described below
under Cache Miss Handling) while a state machine reads the required
data from memory into the cache, and optionally writes any dirty data
evicted from the cache back into memory.
During a store, the tag SRAM is read to determine if the store is a
hit or miss, and if a miss, the same state machine brings the datum
into the cache. The store data cannot be written to the cache during
the access stage because the processor does not yet know if the
correct line is resident. Instead, the store data is written to the
cache during the next store instruction. In the interim, the
store data is held in a Store Data Queue, which, in a classic RISC
pipeline is just a single 32 bit register. For reference, the virtual
address written is held in the Store Address Queue, again just a
single 32 bit register. On more complicated pipelines, the queues
actually have multiple hardware registers and variable length.
There is one more complication. A load immediately after a store
could reference the same memory address, in which case the data must
come from the Store Data Queue rather than from the cache's data SRAM.
So, during a load, the load address is also checked against the Store
Address Queue. If there is a match, the data from the Store Data
Queue is forwarded to the writeback stage, rather than the data from
the data SRAM. This operation does not change the flow of the load
operation through the pipeline.
During this stage, both single cycle and two cycle instructions write
their results into the register file.
=Bypassing=
Suppose the CPU is executing the following piece of code:
In cycle 2, the Decode stage fetches r10 from the register file.
Because the SUB instruction writing to r10 is simultaneously in the
execute stage, the value read from the register file is wrong.
The solution to this problem is a pair of bypass multiplexors. These
multiplexors sit at the end of the decode stage, and their flopped
outputs are the inputs to the ALU. Each multiplexor selects between a
register file read port, the current output of the ALU, and the
current output of the access stage (which is either a loaded value or
a forwarded ALU result).
Decode stage logic compares the registers written by instructions in
the execute and access stages of the pipeline to the registers read by
the instruction in the decode stage, and cause the multiplexors to
select the most recent data. These bypass multiplexors make it
possible for the pipeline to execute simple instructions with just the
latency of the ALU, the multiplexor, and a flip-flop. Without the
multiplexors, the latency of reading and writing the register file
would have to be included in the latency of these instructions.
=Control Flow=
The classic RISC pipeline resolves branches in the Decode stage, which
means the branch resolution recurrence is two cycles long. There are
three implications:
The branch resolution recurrence goes through quite a bit of
circuitry: the instruction cache read, register file read, branch
condition compute (which involves a 32 bit compare on the MIPS CPUs),
and the next instruction address multiplexor.
Because branch and jump targets are calculated in parallel to the
register read, RISC ISAs typically do not have instructions which
branch to a register+offset address. Jump to register is supported.
On any branch taken, the instruction immediately after the branch is
always fetched from the instruction cache. If this instruction is
ignored, there is a one cycle per taken branch IPC penalty, which is
quite large. The SPARC, MIPS, and MC88K designers decided to avoid
this penalty by architecting a Branch delay slot into their ISA.
A delayed branch specifies that the jump to a new location happens
after the next instruction. That next instruction is the one
unavoidably loaded by the instruction cache after the branch.
Delayed branches have been criticized as a poor short-term choice in
ISA design. Compilers typically have some difficulty finding
logically independent instructions to place after the branch (the
instruction after the branch is called the delay slot). Superscalar
processors, which fetch multiple instructions per cycle and have
superior branch prediction, do not benefit from delayed branches. The
Alpha ISA left out delayed branches, as it was intended for
superscalar processors.
The most serious drawback to delayed branches is the additional
control complexity they entail. If the delay slot instruction takes
an exception, the processor has to be restarted on the branch, rather
than that next instruction. Exceptions now have essentially two
addresses, the exception address and the restart address, and
generating and distinguishing between the two correctly in all cases
has been a source of bugs for later designs.
=Exceptions=
Suppose a 32-bit RISC is processing an ADD instruction which adds
two large numbers together. Suppose the resulting number does not fit
in 32 bits. What happens?
The simplest solution, provided by every architecture, is wrapping
arithmetic. This is the standard arithmetic provided by the language
C. Numbers greater than the maximum possible encoded value have their
most significant bits chopped off until they fit. In the usual
integer number system, 3000000000+3000000000=6000000000. With
unsigned 32 bit wrapping arithmetic, 3000000000+3000000000=1705032704.
This may not seem terribly useful. The largest benefit of wrapping
arithmetic is that every operation has a well defined result.
But the programmer, especially if programming in a language supporting
Bignums (e.g. lisp or scheme), may not want wrapping arithmetic. Some
architectures (e.g. MIPS), define special addition operations that
branch to special locations on overflow, rather than wrapping the
result. Software at the target location is responsible for fixing the
problem. This special branch is called an exception. Exceptions
differ from regular branches in that the target address is not
specified by the instruction itself, and the branch decision is
dependent on the outcome of the instruction.
The most common kind of software-visible exception on one of the
classic RISC machines is a TLB miss (see Virtual memory).
Exceptions are different from branches and jumps, because those other
control flow changes are resolved in the decode stage. Exceptions are
resolved in the writeback stage. When an exception is detected, the
following instructions (earlier in the pipeline) are marked as
invalid, and as they flow to the end of the pipe their results are
discarded. The program counter is set to the address of a special
exception handler, and special registers are written with the
exception location and cause.
To make it easy (and fast) for the software to fix the problem and restart
the program, the CPU must take a precise exception. A precise exception
means that all instructions up to the excepting instruction have been
executed, and the excepting instruction and everything afterwards have not
been executed.
In order to take precise exceptions, the CPU must commit changes to
the software visible state in the program order. This in-order commit
happens very naturally in the classic RISC pipeline. Most
instructions write their results to the register file in the writeback
stage, and so those writes automatically happen in program order.
Store instructions, however, write their results to the Store Data Queue
in the access stage. If the store instruction takes an exception, the
Store Data Queue entry in invalidated so that it is not written to the
cache data SRAM later.
=Cache Miss Handling=
Occasionally, either the data or instruction cache will not have a
datum or instruction required. In these cases, the CPU must suspend
operation until the cache can be filled with the necessary data, and
then must resume execution. The problem of filling the cache with the
required data (and potentially writing back to memory the evicted
cache line) is not specific to the pipeline organization, and is not
discussed here.
There are two strategies to handle the suspend/resume problem. The
first is a global stall signal. This signal, when activated, prevents
instructions from advancing down the pipeline, generally by gating off
the clock to the flip-flops at the start of each stage. The
disadvantage of this strategy is that there are a large number of flip
flops, so the global stall signal takes a long time to propagate.
Since the machine generally has to stall in the same cycle that it
identifies the condition requiring the stall, the stall signal becomes
a speed-limiting critical path.
Another strategy to handle suspend/resume is to reuse the exception
logic. The machine takes an exception on the offending instruction,
and all furthur instructions are invalidated. When the cache has been
filled with the necessary data, the instruction which caused the cache
miss is restarted. To expedite data cache miss handling, the
instruction can be restarted so that it's access cycle happens one
cycle after the data cache has been filled.Instruction Fetch
Decode
Execute
During the execute stage, the two arguments were fed to a
simple ALU, which generated the result by the end of the execute
stage.
During the execute stage,
the ALU added the two arguments (a register and a constant offset) to
produce a virtual address by the end of the cycle.
During the execute stage, the operands to
these operations were fed to the multi-cycle multiply/divide unit.
The rest of the pipeline was free to continue execution while the
multiply/divide unit did its work. To avoid complicating the
writeback stage and issue logic, multicycle instruction wrote their
results to a separate set of registers.
Access
Writeback
SUB r3,r4 -> r10
AND r10,3 -> r11
The instruction fetch and decode stages will send the second
instruction one cycle after the first. They flow down the pipeline as
shown in this diagram:
cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 Fetch SUB AND Decode SUB AND Execute SUB AND Access SUB AND Writeback SUB AND