The L1 data cache should usually be in the most critical recurrence on the CPU, because few things improve instructions per cycle (IPC) as directly as a larger data cache, a larger data cache takes longer to access, and pipelining the data cache makes IPC worse. This page explains a way of reducing the latency of the L1 data cache access by fusing the address generation sum operation with the decode operation in the cache SRAM.
The address generation sum operation still must be performed, because other units in the memory pipe will use the resulting virtual address. That sum will be performed in parallel with the fused decode described here.
The recurrence we most want to accelerate is a load, followed by a use of that load in a chain of integer operations leading to another load. If we assume that load results are bypassed with the same priority as integer results, then we can summarize this recurrence as a load followed by another load, as if the program was following a linked list.
The rest of this page assumes an Instruction set architecture with a single addressing mode (register+offset), a virtually indexed data cache, and sign-extending loads that may be variable-width. Most RISC ISAs fit this description. The critical recurrence, then, is an adder, a decoder, the SRAM word line, the SRAM bit line(s), the sense amp(s), the byte steering muxes, and the bypass muxes.
For this example, a direct-mapped 16KB data cache which returns doubleword (8-byte) aligned values is assumed. Each line of the SRAM is 8 bytes, and there are 2048 lines, addressed by Addr[13:3]. The sum addressed SRAM idea applies equally well to set associative caches.
The SRAM decoder for this example has an 11 bit input, Addr[13:3],
and 2048 outputs, the decoded word lines. One word line is driven high
in response to each unique Addr[13:3] value.
In the simplest form of decoder, each of the 2048 lines is
logically an AND gate. The 11 bits (call them A[13:3] and their
complements (call them B[13:3]) are driven up the decoder. For each
line, 11 bits or complements are fed into an 11-input AND gate. For
instance, 1026 decimal is equal to 10000000010 binary. The function
for line 1026 would be:
Recall that the SRAM is indexed with the result of an add. Call
the summands R (for register) and O (for the offset to that register).
The sum-addressed decoder is going to decode R+O. For each decoder
line, call the line number L.
Suppose that our decoder drove both R and O over each decoder line,
and each decoder line implemented:
With this formulation, each row in the decoder is a set of full
adders which reduce the base register, the offset, and the row number
to a carry-save format, and a comparator. We'll find that most of
this hardware is redundant below, but for now it's simpler to think of
it all existing in each row.
The formulation above checks the entire result of an add.
However, in a CPU cache decoder, the entire result of the add is a
byte address, and the cache is usually indexed with a larger address,
in our example, that of a 8-byte block. Ideally, we'd like to ignore
a few of the LSBs of the address. We can't ignore the LSBs of the two
addends, however, because they may produce a carry-out which would
change the doubleword addressed.
Suppose that we add R[13:3] and O[13:3] to get some index I[13:3].
The actual address Addr[13:3] is equal to either I[13:3], or I[13:3] +
1, depending on whether R[2:0]+O[2:0] generates a carry-out. We can
fetch both I and I+1 if we have two banks of SRAM, one with even
addresses, and one with odd. The even bank holds addresses 000xxx,
010xxx, 100xxx, 110xxx, etc, and the odd bank holds addresses 001xxx,
011xxx, 101xxx, 111xxx, etc. We can then use the carry-out from
R[2:0]+O[2:0] to select the even or odd doubleword fetched later.
Note that fetching from two half-size banks of SRAM will dissipate
more power than fetching from one full-size bank, since we are
switching more sense amps and data steering logic.
Referring to the diagram to the right, we can see that the even bank
will fetch line 110 when I[13:3]
In general, the odd SRAM bank should fetch line Lo==2N+1 when
either I[13:3]
Similarly, the even SRAM bank fetches line Le==2N when either
I[13:3]
Before we begin collapsing redundancy between rows, lets review:
Each row of each decoder for each of two banks implements a set of
full adders which reduce the three numbers to be added (R[13:3],
O[13:3], and L) to two numbers (S[14:4] and C[13:3]). The LSB
(
We can partially specialize the full adders to 2-input and, or,
xor, and xnor because the L input is constant. The resulting
expressions are common to all lines of the decoder and can be
collected at the bottom.
A simpler data cache path would have an adder followed by a
traditional decoder. For our example cache subsystem, the critical
path would be a 14 bit adder, producing true and complement values,
followed by an 11 bit AND gate for each row of the decoder.
In the sum-addressed design, the final AND gate in the decoder
remains, although 10 bits wide instead of 11. The adder has been
replaced by a four input logical expression at each bit. The latency
savings comes from the speed difference between the adder and that
four input expression, a savings of perhaps three simple CMOS
gates.
If the reader feels that this was an inordinate amount of
brain-twisting work for a three gate improvement in a multi-cycle
critical path, then the reader has a better appreciation for the level
to which modern CPUs are optimized.
Many decoder designs avoid high-fanin AND gates in the decode line
itself by employing a predecode stage. For instance, an 11 bit
decoder might be predecoded into three groups of 4, 4, and 3 bits
each. Each 3 bit group would drive 8 wires up the main decode array,
each 4 bit group would drive 16 wires. The decoder line then becomes
a 3 input AND gate. This reorganization can save significant
implementation area and some power.
This same reorganization can be applied to the sum-addressed
decoder. Each bit in the non-predecoded formulation above can be
viewed as a local two-bit add. With predecoding, each predecode group
is a local three, four, or even five bit add, with the predecode
groups overlapping by one bit.
Predecoding generally increases the number of wires traversing the
decoder, and sum-addressed decoders generally have about twice as many
wires as the equivalent simple decoder. These wires can be the
limiting factor on the amount of feasible predecoding.
Paul Demone has an explanation of sum-addressed caches in a realworldtech article.
Heald et al have a paper
in ISSCC 1998 that explains what may be the original sum-addressed
cache in the Ultrasparc III.Sum-addressed cache: Collapse the adder and decoder
wordline[1026] = A[13] & B[12] & B[11] & B[10] & B[9] & B[8] & B[7] & B[6] & B[5] & A[4] & B[3]
Both the carry chain of the adder and the decoder combine
information from the entire width of the index portion of the address.
Combining information across the entire width twice is redundant. A
sum-addressed SRAM combines the information just once by implementing
the adder and decoder together in one structure.wordline[L] = (R+O)==L
We can use a set of full adders to reduce R+O+~L to S+C (this is
carry save addition). S+C11..1 <=> S
~C. There will be no carries
in the final add. Note that since C is a row of carries, it's shifted
up one bit, so that R[13:3]+O[13:3]+~L[13:3] == S[13:3] + C[14:4]Ignoring the LSBs: Late select on carry
Match generation
I[13:3] even bank
fetches lineodd bank
fetches line100 100 101 101 110 101 110 110 111 101 or I[13:3]
110. The odd bank
will fetch line 101 when I[13:3]100 or I[13:3]
101.2N or I[13:3]
2N+1. We can write these two
conditions as:I[13:3] = Lo-1 => R[13:3] + O[13:3] + ~Lo+1 = 11..11
=> R[13:3] + O[13:3] + ~Lo = 11..10
I[13:3] = Lo => R[13:3] + O[13:3] + ~Lo = 11..11
We ignore the last digit of the compare: (S+C)[13:4]==11..12N or I[13:3]
2N-1. We can write these two conditions
as follows, and once again ignore the last digit of the compare.I[13:3] = Le-1 => R[13:3] + O[13:3] + ~Le = 11..10
I[13:3] = Le => R[13:3] + O[13:3] + ~Le = 11..11
Gate Level Implementation
R13 ... R6 R5 R4 R3
O13 ... O6 O5 O4 O3
L13 ... L6 L5 L4 L3
--------------------------
S13 ... S6 S5 S4 S3
C14 C13 ... C6 C5 C4
S[3]) is discarded. Carry out (
C[14]) is also discarded. The
row matches if S[13:4] == ~C[13:4], which is &( xor(S[13:4],
C[13:4])).S0;i = S(Ri, Oi, 0) = Ri xor Oi
S1;i = S(Ri, Oi, 1) = Ri xnor Oi
C0;i+1 = C(Ri, Oi, 0) = Ri and Oi
C1;i+1 = C(Ri, Oi, 1) = Ri or Oi.
At each digit position, there are only two possible Si,
two possible Ci, and four possible xors between them:Li=0 and Li-1=0: X0;0;i = S0;i xor C0;i = Ri xor Oi xor (Ri-1 and Oi-1)
Li=0 and Li-1=1: X0;1;i = S0;i xor C1;i = Ri xor Oi xor (Ri-1 or Oi-1)
Li=1 and Li-1=0: X1;0;i = S1;i xor C0;i = Ri xnor Oi xor (Ri-1 and Oi-1) = !X0;0;i
Li=1 and Li-1=1: X1;1;i = S1;i xor C1;i = Ri xnor Oi xor (Ri-1 or Oi-1) = !X0;1;i
One possible decoder for our example might calculate these four
expressions for each of the bits 4..13, and drive all 40 wires up the
decoder. Each line of the decoder would select one of the four wires
for each bit, and consist of an 10-input AND.What has been saved?
Furthur optimizations: predecode
References