332:578 Deep Submicron VLSI Design Lecture 17 Functional Units, Multiplication, and Division - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

332:578 Deep Submicron VLSI Design Lecture 17 Functional Units, Multiplication, and Division

Description:

332:578 Deep Submicron VLSI Design Lecture 17 Functional Units, Multiplication, and Division – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 75

Provided by: davidh187

Category:

more less

Transcript and Presenter's Notes

Title: 332:578 Deep Submicron VLSI Design Lecture 17 Functional Units, Multiplication, and Division

1
332578 Deep SubmicronVLSI DesignLecture
17Functional Units, Multiplication, and Division

David Harris and Mike Bushnell
Harvey Mudd College and Rutgers University
Spring 2005

2
Outline

Unsigned vs. Signed Numbers
Boolean Operations
Error Correcting Codes
Multi-input Adders
Multipliers
Priority Encoders
Dividers
Summary

Material from CMOS VLSI Design, by Weste and
Harris, Addison-Wesley, 2005
3
Signed vs. Unsigned

For signed numbers, comparison is harder
C carry out
Z zero (all bits of A-B are 0)
N negative (MSB of result)
V overflow (inputs had different signs, output
sign ? B)

4
Signed vs. Unsigned
5
Boolean Logical Operations

Use a MUX circuit

6
Circuit Operation

Assign different P values to get various Boolean
operations
MUX between adder and Boolean unit or merge
Boolean unit into adder as in TTL 181 ALU

7
Coding

Correct SRAM/DRAM soft errors
Due to a particles or cosmic rays
Reduce bit error rates of communication links
Parity tree example

8
Hamming Error Correcting Codes (ECCs)

Hamming distance Hd between 2 numbers -- bits
in which they differ
Add check bits to data words for ECC
Increase the Hd between legal code words
If an illegal code word detected, the legal code
word closest to it is the corrected word
Parity has Hd of 2 detects but cannot correct
errors
Make Hd 3 Hamming code of length 2c-1 with c
check bits and N 2c c 1 data bits

9
Code Generation Procedure

Number bits from 1 to 2c 1
Each bit in a position that is power of 2 is
check bit
Choose check bit value to get even parity for all
bits with a 1 in the same position as the check
bit

10
Gray Codes

Binary-reflected code
Start with all 0 and keep flipping the right-most
bit that gives a new string
Use to save power in finite state machines
successive states follow Gray code
Use also to synchronize counters across clock
domains
Either get the current or the previous value
because only 1 bit changes per clock

11
Gray Code
12
Static XOR/XNOR Circuits
13
Static XOR/XNOR Circuit

Does not swing rail-to-rail

14
STATIC CMOS XOR
15
CPL XOR/XNOR Circuit
16
CVSL XOR/XNOR
17
Dynamic XOR/XNOR

Both true complementary inputs needed
Violates monotonicity rule
Solutions
Push XOR/XNOR to end of chain of Domino logic and
built it as static logic
Use dual-rail Domino logic

18
Multi-input Adders

Suppose we want to add k N-bit words
Ex 0001 0111 1101 0010 _____

19
Multi-input Adders

Suppose we want to add k N-bit words
Ex 0001 0111 1101 0010 10111

20
Multi-input Adders

Suppose we want to add k N-bit words
Ex 0001 0111 1101 0010 10111
Straightforward solution k-1 N-input CPAs
Large and slow

21
Carry Save Addition

A full adder sums 3 inputs and produces 2 outputs
Carry output has twice weight of sum output
N full adders in parallel are called carry save
adder
Produce N sums and N carry outs

22
CSA Application

Use k-2 stages of CSAs
Keep result in carry-save redundant form
Final CPA computes actual result

23
CSA Application

Use k-2 stages of CSAs
Keep result in carry-save redundant form
Final CPA computes actual result

24
CSA Application

Use k-2 stages of CSAs
Keep result in carry-save redundant form
Final CPA computes actual result

25
Multiplication

Example

26
Multiplication

Example

27
Multiplication

Example

28
Multiplication

Example

29
Multiplication

Example

30
Multiplication

Example

31
Multiplication

Example
M x N-bit multiplication
Produce N M-bit partial products
Sum these to produce MN-bit product

32
General Form

Multiplicand Y (yM-1, yM-2, , y1, y0)
Multiplier X (xN-1, xN-2, , x1, x0)
Product

33
16X16 Mult. Dot Diagram

Each dot represents a bit

34
Array Multiplier
35
Rectangular Array

Squash array to fit rectangular floorplan

36
Optimizations

1st row adds 1st partial product to pair of 0s
Change first CSA row to add 1st 3 partial
products together
Reduces row count by 2 and reduces adder
propagation delay
Can also use 1st row of CSAs to add one or two
other inputs with no extra delay
Most common DSP operation Y A B C
Speed up by replacing bottommost row with CPA or
lookahead or tree adder
Asymmetric circuit some inputs have more logical
effort than others

37
2s Complement Multiplication

2 partial products have negative weight
Must be subtracted
Baugh-Woodley algorithm takes 2s comp. of terms
to be subtracted
In example, AND gates replaced by NAND gates in
hatched cells
Extra ones added in unused inputs to take correct
2s complement
Use XORs to conditionally invert some of the
terms to select between signed and unsigned
multiplication

38
2s Comp. Multiplier
39
Simplified Partial Products
40
Modified Baugh-Woodley
41
Fewer Partial Products

Array multiplier requires N partial products
If we looked at groups of r bits, we could form
N/r partial products.
Faster and smaller?
Called radix-2r encoding
Ex r 2 look at pairs of bits
Form partial products of 0, Y, 2Y, 3Y
First three are easy, but 3Y requires adder ?

42
Booth Encoding

Instead of 3Y, try Y, then increment next
partial product to add 4Y
Similarly, for 2Y, try 2Y 4Y in next partial
product

43
Booth Encoding

Instead of 3Y, try Y, then increment next
partial product to add 4Y
Similarly, for 2Y, try 2Y 4Y in next partial
product

44
Booth Encoding

Instead of 3Y, try Y, then increment next
partial product to add 4Y
Similarly, for 2Y, try 2Y 4Y in next partial
product

45
Booth Encoding

Instead of 3Y, try Y, then increment next
partial product to add 4Y
Similarly, for 2Y, try 2Y 4Y in next partial
product

46
Booth Encoding

Instead of 3Y, try Y, then increment next
partial product to add 4Y
Similarly, for 2Y, try 2Y 4Y in next partial
product

47
Booth Encoding

Instead of 3Y, try Y, then increment next
partial product to add 4Y
Similarly, for 2Y, try 2Y 4Y in next partial
product

48
Booth Encoding

Instead of 3Y, try Y, then increment next
partial product to add 4Y
Similarly, for 2Y, try 2Y 4Y in next partial
product

49
Booth Encoding

Instead of 3Y, try Y, then increment next
partial product to add 4Y
Similarly, for 2Y, try 2Y 4Y in next partial
product

Current
Prev.
50
Booth Hardware

Booth encoder generates control lines for each PP
Booth selectors choose PP bits

Xi means add in Y 2Xi means add in 2Y M means
negate partial prod.
51
Sign Extension

Partial products can be negative
Require sign extension, which is cumbersome
High fanout on most significant bit

52
Simplified Sign Ext.

Sign bits are either all 0s or all 1s
Note that all 0s is all 1s 1 in proper column
Use this to reduce loading on MSB

53
Even Simpler Sign Ext.

No need to add all the 1s in hardware
Precompute the answer!

54
Advanced Multiplication

Signed vs. unsigned inputs
Higher radix Booth encoding
Array vs. tree CSA networks

55
Wallace Tree Multiplication

CSA is effectively a ones counter
Called a (3,2) counter converts 3 inputs into
count encoded as 2 outputs

56
Dot Diagram of Array Mult.
57
Wallace Tree

Sum partial products in parallel

58
Example Wallace Tree
59
Original Wallace Tree
60
Use 4,2 Compressor

Also called (5,3) counter

61
Better Wallace Tree
62
Hybrid Multiplication

Tradeoff
Arrays offer regular layout
Wallace Trees have fewer levels of CSAs but less
regular layout
Hybrids give tradeoffs
Odd/even arrays
Arrays of arrays
Balanced delay trees
Overturned-staircase trees
Have as few levels of logic as Wallace trees but
with more regular wiring

63
Fused Multiply-Add

DSP frequently requires computation of P XY
Z
Can do with multiplier and adder
Better to do it with fused multiply-add unit
Ordinary multiplier modified to have another
partial product Z

64
Serial Multiplication

Serial multiplies M-bit multiplicand and N-bit
multiplier in N X M clocks use for wrist
watches
Half-Parallel use this
Multiplies M-bit multiplicand and N-bit
multiplier in N clocks
Widely used in DSP units
Needs an M-bit adder and an MN bit shift
register
Obtains final product after N steps

65
Half-Parallel Multiplier
66
Multiplication Steps
67
Priority Encoders

Also a prefix computation
Arbitrate among N units requesting a shared
resource
Ai unit i requests service
Logic

68
Prefix Equations
Bitwise precomputation Group logic Output logic
69
Priority Encoder Trees
70
Priority Encoder Trees
71
Other Prefix Computations

Incrementers
Decrementers
2s complement circuits
Modified priority encoders

72
Wheelers Division Algorithm

Invert the divisor with hardware that converges
in 3 iterations
Multiply dividend by inverted divisor using
Wallace tree
Better than ordinary division algorithms, which
are usually serial subtraction
x is positive normalized fraction
p is approximation to 1/x
Set a1 px and b1 p and iterate

73
Wheeler Division

Converges quadratically, an to 1 and bn to 1 / x
Inspect 1st 6 digits of x and then generate p
Use part of Wallace tree to compute this
approximation
Get 40-bit reciprocal in only 3 steps
Sometimes necessary to extend pseudo-adders in
Wallace tree to guarantee accuracy

74
Summary