Sum addressed decoder

The L1 data cache should usually be in the most critical recurrence on the CPU, because few things improve instructions per cycle (IPC) as directly as a larger data cache, a larger data cache takes longer to access, and pipelining the data cache makes IPC worse. This page explains a way of reducing the latency of the L1 data cache access by fusing the address generation sum operation with the decode operation in the cache SRAM.

The address generation sum operation still must be performed, because other units in the memory pipe will use the resulting virtual address. That sum will be performed in parallel with the fused decode described here.

The recurrence we most want to accelerate is a load, followed by a use of that load in a chain of integer operations leading to another load. If we assume that load results are bypassed with the same priority as integer results, then we can summarize this recurrence as a load followed by another load, as if the program was following a linked list.

The rest of this page assumes an Instruction set architecture with a single addressing mode (register+offset), a virtually indexed data cache, and sign-extending loads that may be variable-width. Most RISC ISAs fit this description. The critical recurrence, then, is an adder, a decoder, the SRAM word line, the SRAM bit line(s), the sense amp(s), the byte steering muxes, and the bypass muxes.

For this example, a direct-mapped 16KB data cache which returns doubleword (8-byte) aligned values is assumed. Each line of the SRAM is 8 bytes, and there are 2048 lines, addressed by Addr[13:3]. The sum addressed SRAM idea applies equally well to set associative caches.

Table of contents

1 Sum-addressed cache: Collapse the adder and decoder
2 L <=> R+O-L
3 -1
4 11..1 <=> S
5 Ignoring the LSBs: Late select on carry
6 Match generation
7 101 or I[13:3]
8 100 or I[13:3]
9 2N or I[13:3]
10 2N or I[13:3]
11 Gate Level Implementation
12 S[3]) is discarded. Carry out (
13 What has been saved?
14 Furthur optimizations: predecode
15 References

Sum-addressed cache: Collapse the adder and decoder

The SRAM decoder for this example has an 11 bit input, Addr[13:3], and 2048 outputs, the decoded word lines. One word line is driven high in response to each unique Addr[13:3] value.

In the simplest form of decoder, each of the 2048 lines is logically an AND gate. The 11 bits (call them A[13:3] and their complements (call them B[13:3]) are driven up the decoder. For each line, 11 bits or complements are fed into an 11-input AND gate. For instance, 1026 decimal is equal to 10000000010 binary. The function for line 1026 would be:

wordline[1026] = A[13] & B[12] & B[11] & B[10] & B[9] & B[8] & B[7] & B[6] & B[5] & A[4] & B[3]

Both the carry chain of the adder and the decoder combine information from the entire width of the index portion of the address. Combining information across the entire width twice is redundant. A sum-addressed SRAM combines the information just once by implementing the adder and decoder together in one structure.

Recall that the SRAM is indexed with the result of an add. Call the summands R (for register) and O (for the offset to that register). The sum-addressed decoder is going to decode R+O. For each decoder line, call the line number L.

Suppose that our decoder drove both R and O over each decoder line, and each decoder line implemented:

wordline[L] = (R+O)==L
(R+O)
L <=> R+O-L
0 
         <=> R+O+~L+1==0
         <=> R+O+~L-1
11..1.

We can use a set of full adders to reduce R+O+~L to S+C (this is carry save addition). S+C

11..1 <=> S

~C. There will be no carries in the final add. Note that since C is a row of carries, it's shifted up one bit, so that R[13:3]+O[13:3]+~L[13:3] == S[13:3] + C[14:4]

With this formulation, each row in the decoder is a set of full adders which reduce the base register, the offset, and the row number to a carry-save format, and a comparator. We'll find that most of this hardware is redundant below, but for now it's simpler to think of it all existing in each row.

Ignoring the LSBs: Late select on carry

The formulation above checks the entire result of an add. However, in a CPU cache decoder, the entire result of the add is a byte address, and the cache is usually indexed with a larger address, in our example, that of a 8-byte block. Ideally, we'd like to ignore a few of the LSBs of the address. We can't ignore the LSBs of the two addends, however, because they may produce a carry-out which would change the doubleword addressed.

Suppose that we add R[13:3] and O[13:3] to get some index I[13:3]. The actual address Addr[13:3] is equal to either I[13:3], or I[13:3] + 1, depending on whether R[2:0]+O[2:0] generates a carry-out. We can fetch both I and I+1 if we have two banks of SRAM, one with even addresses, and one with odd. The even bank holds addresses 000xxx, 010xxx, 100xxx, 110xxx, etc, and the odd bank holds addresses 001xxx, 011xxx, 101xxx, 111xxx, etc. We can then use the carry-out from R[2:0]+O[2:0] to select the even or odd doubleword fetched later.

Note that fetching from two half-size banks of SRAM will dissipate more power than fetching from one full-size bank, since we are switching more sense amps and data steering logic.

Match generation

I[13:3]	even bank fetches line	odd bank fetches line
100	100	101
101	110	101
110	110	111

Referring to the diagram to the right, we can see that the even bank will fetch line 110 when I[13:3]

101 or I[13:3]

110. The odd bank will fetch line 101 when I[13:3]

100 or I[13:3]

101.

In general, the odd SRAM bank should fetch line Lo==2N+1 when either I[13:3]

2N or I[13:3]

2N+1. We can write these two conditions as:

I[13:3] = Lo-1 =>  R[13:3] + O[13:3] + ~Lo+1 = 11..11
               =>  R[13:3] + O[13:3] + ~Lo   = 11..10
I[13:3] = Lo   =>  R[13:3] + O[13:3] + ~Lo   = 11..11

We ignore the last digit of the compare: (S+C)[13:4]==11..1

2N or I[13:3]

2N-1. We can write these two conditions as follows, and once again ignore the last digit of the compare.

I[13:3] = Le-1 =>  R[13:3] + O[13:3] + ~Le = 11..10
I[13:3] = Le   =>  R[13:3] + O[13:3] + ~Le = 11..11

Gate Level Implementation

    R₁₃ ... R₆  R₅  R₄  R₃
    O₁₃ ... O₆  O₅  O₄  O₃
    L₁₃ ... L₆  L₅  L₄  L₃
--------------------------
    S₁₃ ... S₆  S₅  S₄  S₃
C₁₄ C₁₃ ... C₆  C₅  C₄

Before we begin collapsing redundancy between rows, lets review: Each row of each decoder for each of two banks implements a set of full adders which reduce the three numbers to be added (R[13:3], O[13:3], and L) to two numbers (S[14:4] and C[13:3]). The LSB (

S[3]) is discarded. Carry out (

C[14]) is also discarded. The row matches if S[13:4] == ~C[13:4], which is &( xor(S[13:4], C[13:4])).

We can partially specialize the full adders to 2-input and, or, xor, and xnor because the L input is constant. The resulting expressions are common to all lines of the decoder and can be collected at the bottom.

S_0;i = S(R_i, O_i, 0) = R_i xor O_i
S_1;i = S(R_i, O_i, 1) = R_i xnor O_i
C_0;i+1 = C(R_i, O_i, 0) = R_i and O_i
C_1;i+1 = C(R_i, O_i, 1) = R_i or O_i.

At each digit position, there are only two possible S_i, two possible C_i, and four possible xors between them:

L_i=0 and L_i-1=0: X_0;0;i = S_0;i xor C_0;i = R_i xor O_i xor (R_i-1 and O_i-1)
L_i=0 and L_i-1=1: X_0;1;i = S_0;i xor C_1;i = R_i xor O_i xor (R_i-1 or O_i-1)
L_i=1 and L_i-1=0: X_1;0;i = S_1;i xor C_0;i = R_i xnor O_i xor (R_i-1 and O_i-1) = !X_0;0;i
L_i=1 and L_i-1=1: X_1;1;i = S_1;i xor C_1;i = R_i xnor O_i xor (R_i-1 or O_i-1)  = !X_0;1;i

One possible decoder for our example might calculate these four expressions for each of the bits 4..13, and drive all 40 wires up the decoder. Each line of the decoder would select one of the four wires for each bit, and consist of an 10-input AND.

What has been saved?

A simpler data cache path would have an adder followed by a traditional decoder. For our example cache subsystem, the critical path would be a 14 bit adder, producing true and complement values, followed by an 11 bit AND gate for each row of the decoder.

In the sum-addressed design, the final AND gate in the decoder remains, although 10 bits wide instead of 11. The adder has been replaced by a four input logical expression at each bit. The latency savings comes from the speed difference between the adder and that four input expression, a savings of perhaps three simple CMOS gates.

If the reader feels that this was an inordinate amount of brain-twisting work for a three gate improvement in a multi-cycle critical path, then the reader has a better appreciation for the level to which modern CPUs are optimized.

Furthur optimizations: predecode

Many decoder designs avoid high-fanin AND gates in the decode line itself by employing a predecode stage. For instance, an 11 bit decoder might be predecoded into three groups of 4, 4, and 3 bits each. Each 3 bit group would drive 8 wires up the main decode array, each 4 bit group would drive 16 wires. The decoder line then becomes a 3 input AND gate. This reorganization can save significant implementation area and some power.

This same reorganization can be applied to the sum-addressed decoder. Each bit in the non-predecoded formulation above can be viewed as a local two-bit add. With predecoding, each predecode group is a local three, four, or even five bit add, with the predecode groups overlapping by one bit.

Predecoding generally increases the number of wires traversing the decoder, and sum-addressed decoders generally have about twice as many wires as the equivalent simple decoder. These wires can be the limiting factor on the amount of feasible predecoding.

References

Paul Demone has an explanation of sum-addressed caches in a realworldtech article.

Heald et al have a paper in ISSCC 1998 that explains what may be the original sum-addressed cache in the Ultrasparc III.