Intro - PowerPoint PPT Presentation

About This Presentation
Title:

Intro

Description:

Used in games like Far Cry. Optimization for speed( chose this because of market) ... CPUs and DSPs because it's so cool. One ring (circuit) to rule them all! ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 41
Provided by: sonali3
Category:

less

Transcript and Presenter's Notes

Title: Intro


1
Presentation 12 MAD MAC 525
Farhan Mohamed Ali (W2-1)Jigar Vora
(W2-2)Sonali Kapoor (W2-3) Avni Jhunjhunwala
(W2-4)
W2
Design Manager Zack Menegakis
26th April, 2006 Short Final Presentation
Project Objective Design a crucial part of a GPU
called the Multiply Accumulate Unit (MAC) which
will revolutionize graphics.
2
Agenda
  • Marketing (Jigar)
  • Project Description (Farhan)
  • Algorithmic Description (Farhan)
  • Design Process (Sonali)
  • Floorplan Evolution (Sonali)
  • Layout (Avni)
  • Design Specifications (Avni)
  • Conclusion (Jigar)

3
MARKETING
  • Application of product HDR rendering in gaming
    graphics
  • Why HDR? Used in games like Far Cry
  • Optimization for speed( chose this because of
    market)
  • Competition- if enter market, possible barriers
    to entry

4
MAD MAC and HDR
  • What is HDR?
  • Show animation explaining concept

5
MAD MAC and HDR
  • MAD MAC accelerates FP16 blending to enable true
    HDR graphics
  • What is HDR?
  • HDR High Dynamic Range
  • Dynamic range is defined as the ratio of the
    largest value of a signal to the lowest
    measurable value
  • Dynamic range of luminance in real-world scenes
    can be 100,000 1
  • With HDR rendering, pixel intensity are allowed
    to extend beyond 0..1 range of traditional
    graphics
  • Nature isnt clamped to 0..1 and neither should
    CG
  • In lay terms
  • Bright things can be really bright
  • Dark things can be really dark
  • And the details can be seen in both

6
(No Transcript)
7
PROJECT DESCRIPTION
  • Multiply Accumulate unit (MAC)
  • Executes function ABC on 16 bit floating point
    inputs. Inputs will be OpenEXR format.
  • Multiply and add in parallel to greatly speed up
    operation
  • Rounding is only performed only once so greater
    accuracy than individual multiply and add
    functions.
  • Also known as
  • Fused Multiply Add (FMA)
  • Multiply Add (MAD/MADD) in graphics shader
    programs
  • Many applications benefit from a fast FMA
  • Graphics HDR rendering, blending and shader
    ops
  • DSPs computing vector dot-products in digital
    filters
  • Fast division, square root eliminates extra
    hardware
  • Available in many newer CPUs and DSPs because
    its so cool
  • One ring (circuit) to rule them all!

8
ALGORITHMIC DESCRIPTION
  • Step through entire process
  • Multiply and align occurs concurrently- always
    align C to AB
  • Outputs go to adder, normalize, round, overflow
    checker and output register

9
Block Diagram
Input
Input
Input
16
16
16
5
RegArray A
RegArray B
RegArray C
10
10
10
5
5
Multiplier
Exp Calc
Align
1
5
14
22
35
Control Logic Sign Dtrmin
Leading 0 Anticipator
Adder/Subtractor
36
4
Normalize
14
5
1
Round
Reg Y
10
5
Output
16
15
1
1
Ovf Checker
10
IMPLEMENTATION
  • Implementation of each module- how and why we
    chose a particular method keeping in mind goal of
    speed( multiplier, adder)

11
Design Decisions (contd.)
  • Multiplier Implementation
  • 11 x 11 Carry-Save Multiplier
  • Reasons
  • Fast because it avoids having ripple carry in
    every stage
  • Enables Compact Layout

12
Design Process
  • Verilog-gt Schematic-gt Layout
  • Behavioral -gt Structural Verilog
  • Transistors/gates -gt Full Schematic
  • Gate/Component Layout -gt Top Level
  • Transistor Count fluctuated from 20,200 to 12,800
  • Major design decisions
  • Decided against implementing denormal arithmetic
    because it would increase the complexity of the
    project beyond the scope of the class
  • Round performed only once at the end.
  • Picked nPass over Tgate in the normalize shifter
  • Adder variable length carry select-gt Han-Carlson
    binary tree adder

13
VERIFICATION OF DESIGN
  • Verilog Simulations ( show outputs)
  • Overview
  • How/Why it works
  • Behavioral/Structural
  • Explain why we couldnt get a high-level
    simulator and how we tested our verilog design.

14
SCHEMATICS
  • Show schematics of major blocks adder,
    multiplier, and top-level
  • HOW WE VERIFIED analog simulation

15
Top Level Schematic
16
Multiplier Schematic
17
Adder Schematic
18
FLOORPLAN EVOLUTION
  • Initial floorplan
  • How it evolved (with animation)- why and how we
    changed it

19
Main Floorplan
Multiplier
Reg A
Reg C
Exp Calc
Reg B
Align C
Pipeline Reg
Pipeline Reg
Adder
Ld Zero
Pipeline Reg
Round
Normalize
Reg Y
20
Floorplan
21
Full Chip Layout
Exponent
Multiplier
Zero
Align
Adder
O v f
N o r m a l i z e
R o u n d
22
Pipelining
  • Initially planned 5-6 pipeline stages
  • Reduced to 4 pipeline stages made possible by
    implementing fast carry lookahead adders in
    critical path modules (adder and multiplier)

23
Pipelining Stages
Reg C
Multiplier
Reg A
Exp Calc
Reg B
Pipeline Reg
Pipeline Reg
Align C
Pipeline Reg
Pipeline Reg
Adder
Ld Zero
Pipeline Reg
Round
Normalize
Overflow checker

Reg Y
24
LAYOUT
  • Final Layout
  • Layout of large blocks such as multiplier, adder
    and normalize

25
Layout Decisions
  • 3 standard cell heights
  • Uniform width vdd and ground rails
  • Wider vdd and ground rails in power hungry
    modules
  • Max of 8 flip flops per clock pulse generator
  • Metal directionality

26
Multiplier Layout with pipelining
27
Adder Layout
28
Normalize Layout
29
FINAL LAYOUT
30
Design Specifications
  • Worst case delay 2.25ns
  • Long buses are all buffered (not tested yet)
  • Estimated clocking speed 400MHz
  • Height by width 193.86 um 301.545 um
  • Area 58,458 um2
  • Aspect ratio 11.55
  • Total Transistor density 0.22

31
Layout densities
  • Active 14.05
  • Poly 9.25
  • Metal 1 33.89
  • Metal 2 18.00
  • Metal 3 14.99
  • Metal 4 6.29

32
Layer Masks - Poly
33
Layer Masks Metal 1
34
Layer Masks Metal 2
35
Layer Masks Metal 3
36
Layer Masks Metal 4
37
Schematic Power mW (350Mhz) Layout Power mW Schematic Delay Layout Delay
Multiplier -w/ pipeline 2.97 ?? N/A ?? 3.38n 1.9n N/A 2.25n
Exponents 1.608 2.21 1.01n 1.2n
Align 0.094 0.113 480p 637p
Adder 8.48 9.73 1.34n 1.7n
Leading 0 0.232 0.857 506p 551p
Normalize 1.458 1.546 407p 437p
Round 0.631 1.21 864p 986p
OvfCheck 0.13 0.19 453p 475p
Registers ?? ?? 179p 193p
Total ?? ?? - -
38
Area um2 Transistor Count Transistor Density
Multiplier -w/ pipeline 20388 4496 0.22
Exponents 5,163 738 0.14
Align 3,995 500 0.13
Adder 13,202 3174 0.24
Leading 0 1,253 364 0.29
Normalize 3,190 942 0.3
Round 1,802 494 0.28
OvfCheck 200 70 0.35
Registers, etc N/A 1948 N/A
Total 58,458 12,730 0.22
39
Conclusion
  • More marketing
  • Summarize chip functionality
  • Extending applications of chip

40
Comments?
Write a Comment
User Comments (0)
About PowerShow.com