Title: 332:578 Deep Submicron VLSI Design Lecture 17 Functional Units, Multiplication, and Division
1332578 Deep SubmicronVLSI DesignLecture
17Functional Units, Multiplication, and Division
- David Harris and Mike Bushnell
- Harvey Mudd College and Rutgers University
- Spring 2005
2Outline
- Unsigned vs. Signed Numbers
- Boolean Operations
- Error Correcting Codes
- Multi-input Adders
- Multipliers
- Priority Encoders
- Dividers
- Summary
Material from CMOS VLSI Design, by Weste and
Harris, Addison-Wesley, 2005
3Signed vs. Unsigned
- For signed numbers, comparison is harder
- C carry out
- Z zero (all bits of A-B are 0)
- N negative (MSB of result)
- V overflow (inputs had different signs, output
sign ? B) -
4Signed vs. Unsigned
5Boolean Logical Operations
6Circuit Operation
- Assign different P values to get various Boolean
operations - MUX between adder and Boolean unit or merge
Boolean unit into adder as in TTL 181 ALU
7Coding
- Correct SRAM/DRAM soft errors
- Due to a particles or cosmic rays
- Reduce bit error rates of communication links
- Parity tree example
8Hamming Error Correcting Codes (ECCs)
- Hamming distance Hd between 2 numbers -- bits
in which they differ - Add check bits to data words for ECC
- Increase the Hd between legal code words
- If an illegal code word detected, the legal code
word closest to it is the corrected word - Parity has Hd of 2 detects but cannot correct
errors - Make Hd 3 Hamming code of length 2c-1 with c
check bits and N 2c c 1 data bits
9Code Generation Procedure
- Number bits from 1 to 2c 1
- Each bit in a position that is power of 2 is
check bit - Choose check bit value to get even parity for all
bits with a 1 in the same position as the check
bit
10Gray Codes
- Binary-reflected code
- Start with all 0 and keep flipping the right-most
bit that gives a new string - Use to save power in finite state machines
successive states follow Gray code - Use also to synchronize counters across clock
domains - Either get the current or the previous value
because only 1 bit changes per clock
11Gray Code
12Static XOR/XNOR Circuits
13Static XOR/XNOR Circuit
- Does not swing rail-to-rail
14STATIC CMOS XOR
15CPL XOR/XNOR Circuit
16CVSL XOR/XNOR
17Dynamic XOR/XNOR
- Both true complementary inputs needed
- Violates monotonicity rule
- Solutions
- Push XOR/XNOR to end of chain of Domino logic and
built it as static logic - Use dual-rail Domino logic
18Multi-input Adders
- Suppose we want to add k N-bit words
- Ex 0001 0111 1101 0010 _____
19Multi-input Adders
- Suppose we want to add k N-bit words
- Ex 0001 0111 1101 0010 10111
20Multi-input Adders
- Suppose we want to add k N-bit words
- Ex 0001 0111 1101 0010 10111
- Straightforward solution k-1 N-input CPAs
- Large and slow
21Carry Save Addition
- A full adder sums 3 inputs and produces 2 outputs
- Carry output has twice weight of sum output
- N full adders in parallel are called carry save
adder - Produce N sums and N carry outs
22CSA Application
- Use k-2 stages of CSAs
- Keep result in carry-save redundant form
- Final CPA computes actual result
23CSA Application
- Use k-2 stages of CSAs
- Keep result in carry-save redundant form
- Final CPA computes actual result
24CSA Application
- Use k-2 stages of CSAs
- Keep result in carry-save redundant form
- Final CPA computes actual result
25Multiplication
26Multiplication
27Multiplication
28Multiplication
29Multiplication
30Multiplication
31Multiplication
- Example
- M x N-bit multiplication
- Produce N M-bit partial products
- Sum these to produce MN-bit product
32General Form
- Multiplicand Y (yM-1, yM-2, , y1, y0)
- Multiplier X (xN-1, xN-2, , x1, x0)
- Product
3316X16 Mult. Dot Diagram
- Each dot represents a bit
34Array Multiplier
35Rectangular Array
- Squash array to fit rectangular floorplan
36Optimizations
- 1st row adds 1st partial product to pair of 0s
- Change first CSA row to add 1st 3 partial
products together - Reduces row count by 2 and reduces adder
propagation delay - Can also use 1st row of CSAs to add one or two
other inputs with no extra delay - Most common DSP operation Y A B C
- Speed up by replacing bottommost row with CPA or
lookahead or tree adder - Asymmetric circuit some inputs have more logical
effort than others
372s Complement Multiplication
- 2 partial products have negative weight
- Must be subtracted
- Baugh-Woodley algorithm takes 2s comp. of terms
to be subtracted - In example, AND gates replaced by NAND gates in
hatched cells - Extra ones added in unused inputs to take correct
2s complement - Use XORs to conditionally invert some of the
terms to select between signed and unsigned
multiplication
382s Comp. Multiplier
39Simplified Partial Products
40Modified Baugh-Woodley
41Fewer Partial Products
- Array multiplier requires N partial products
- If we looked at groups of r bits, we could form
N/r partial products. - Faster and smaller?
- Called radix-2r encoding
- Ex r 2 look at pairs of bits
- Form partial products of 0, Y, 2Y, 3Y
- First three are easy, but 3Y requires adder ?
42Booth Encoding
- Instead of 3Y, try Y, then increment next
partial product to add 4Y - Similarly, for 2Y, try 2Y 4Y in next partial
product
43Booth Encoding
- Instead of 3Y, try Y, then increment next
partial product to add 4Y - Similarly, for 2Y, try 2Y 4Y in next partial
product
44Booth Encoding
- Instead of 3Y, try Y, then increment next
partial product to add 4Y - Similarly, for 2Y, try 2Y 4Y in next partial
product
45Booth Encoding
- Instead of 3Y, try Y, then increment next
partial product to add 4Y - Similarly, for 2Y, try 2Y 4Y in next partial
product
46Booth Encoding
- Instead of 3Y, try Y, then increment next
partial product to add 4Y - Similarly, for 2Y, try 2Y 4Y in next partial
product
47Booth Encoding
- Instead of 3Y, try Y, then increment next
partial product to add 4Y - Similarly, for 2Y, try 2Y 4Y in next partial
product
48Booth Encoding
- Instead of 3Y, try Y, then increment next
partial product to add 4Y - Similarly, for 2Y, try 2Y 4Y in next partial
product
49Booth Encoding
- Instead of 3Y, try Y, then increment next
partial product to add 4Y - Similarly, for 2Y, try 2Y 4Y in next partial
product
Current
Prev.
50Booth Hardware
- Booth encoder generates control lines for each PP
- Booth selectors choose PP bits
Xi means add in Y 2Xi means add in 2Y M means
negate partial prod.
51Sign Extension
- Partial products can be negative
- Require sign extension, which is cumbersome
- High fanout on most significant bit
52Simplified Sign Ext.
- Sign bits are either all 0s or all 1s
- Note that all 0s is all 1s 1 in proper column
- Use this to reduce loading on MSB
53Even Simpler Sign Ext.
- No need to add all the 1s in hardware
- Precompute the answer!
54Advanced Multiplication
- Signed vs. unsigned inputs
- Higher radix Booth encoding
- Array vs. tree CSA networks
55Wallace Tree Multiplication
- CSA is effectively a ones counter
- Called a (3,2) counter converts 3 inputs into
count encoded as 2 outputs
56Dot Diagram of Array Mult.
57Wallace Tree
- Sum partial products in parallel
58Example Wallace Tree
59Original Wallace Tree
60Use 4,2 Compressor
- Also called (5,3) counter
61Better Wallace Tree
62Hybrid Multiplication
- Tradeoff
- Arrays offer regular layout
- Wallace Trees have fewer levels of CSAs but less
regular layout - Hybrids give tradeoffs
- Odd/even arrays
- Arrays of arrays
- Balanced delay trees
- Overturned-staircase trees
- Have as few levels of logic as Wallace trees but
with more regular wiring
63Fused Multiply-Add
- DSP frequently requires computation of P XY
Z - Can do with multiplier and adder
- Better to do it with fused multiply-add unit
- Ordinary multiplier modified to have another
partial product Z
64Serial Multiplication
- Serial multiplies M-bit multiplicand and N-bit
multiplier in N X M clocks use for wrist
watches - Half-Parallel use this
- Multiplies M-bit multiplicand and N-bit
multiplier in N clocks - Widely used in DSP units
- Needs an M-bit adder and an MN bit shift
register - Obtains final product after N steps
65Half-Parallel Multiplier
66Multiplication Steps
67Priority Encoders
- Also a prefix computation
- Arbitrate among N units requesting a shared
resource - Ai unit i requests service
- Logic
68Prefix Equations
Bitwise precomputation Group logic Output logic
69Priority Encoder Trees
70Priority Encoder Trees
71Other Prefix Computations
- Incrementers
- Decrementers
- 2s complement circuits
- Modified priority encoders
72Wheelers Division Algorithm
- Invert the divisor with hardware that converges
in 3 iterations - Multiply dividend by inverted divisor using
Wallace tree - Better than ordinary division algorithms, which
are usually serial subtraction - x is positive normalized fraction
- p is approximation to 1/x
- Set a1 px and b1 p and iterate
73Wheeler Division
- Converges quadratically, an to 1 and bn to 1 / x
- Inspect 1st 6 digits of x and then generate p
- Use part of Wallace tree to compute this
approximation - Get 40-bit reciprocal in only 3 steps
- Sometimes necessary to extend pseudo-adders in
Wallace tree to guarantee accuracy
74Summary
- Unsigned vs. Signed numbers
- Boolean operations
- Error Correcting Codes
- Multi-input Adders
- Carry Save Addition
- Multipliers
- Booth Encoding
- Array vs. Tree CSA Networks
- Priority Encoders
- Dividers