Spatial%20Computation%20Computing%20without%20General-Purpose%20Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Spatial%20Computation%20Computing%20without%20General-Purpose%20Processors

Description:

sum's loop. Long. latency. pipe. predicate. step 7. 38. Predicate ack. edge is on ... 41. Outline. Problems of current architectures. Compiling ASH. Pipelining ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 73
Provided by: MIh73
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Spatial%20Computation%20Computing%20without%20General-Purpose%20Processors


1
Spatial ComputationComputing without
General-Purpose Processors
  • Mihai Budiu
  • mihaib_at_cs.cmu.edu
  • Carnegie Mellon University
  • July 8, 2004

2
Spatial Computation
Spatial Computation
  • A computation model based on
  • application-specific hardware
  • no interpretation
  • minimal resource sharing

Mihai Budiu mihaib_at_cs.cmu.edu Carnegie Mellon
University
3
The Engine Behind This Talk
  • main( )
  • signal(SIGINT, welcome)
  • while (slides( ) time( ))
  • talk( )

4
Research Scope
Object future architectures
Toolcompilers
Evaluationsimulators
5
Research Methodology
Y (e.g., cost)
reasonable limits
state-of-the-art
X (e.g., power)
Constraint Space
6
Outline
  • Introduction problems of current architectures
  • Compiling Application-Specific Hardware
  • Pipelining
  • ASH Evaluation
  • Conclusions

1000
Performance
7
Resources
Intel
  • We do not worry about not having hardware
    resources
  • We worry about being able to use hardware
    resources

8
Design Complexity
1010
109
108
107
Chip size
Transistors
106
105
Designer productivity
104
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
9
Communication vs. Computation
wire
gate
5ps
20ps
Power consumption on wires is also dominant
10
Power Consumption
Toasted CPU about 2 sec after removing cooler.
(Toms Hardware Guide)
11
Energy Efficiency
Pentium 4
12
Clock Speed
3GHz 6GHz 10GHz
Cannot rely on global signals
(clock is a global signal)
13
Instruction-Set Architecture
Software
ISA
Hardware
14
Our Proposal
  • ASH addresses these problems
  • ASH is not a panacea
  • ASH complementary to CPU

15
Outline
  • Problems of current architectures
  • CASH Compiling ASH
  • program representation
  • compiling C programs
  • Pipelining
  • ASH Evaluation
  • Conclusions

16
Application-Specific Hardware
C program
Compiler
Dataflow IR
Reconfigurable/custom hw
17
Application-Specific Hardware
Soft
C program
Compiler
Dataflow IR
SW backend
Machine code
CPU predication
18
Key Intermediate Representation
Traditionally
Our IR
  • SSA predication speculation
  • Uniform for scalars and memory
  • Explicitly encodes may-depend
  • Executable
  • Precise semantics
  • Dataflow IR
  • Close to asynchronous target

may-dep.
CFG
...
def-use
19
Computation Dataflow
Programs
Circuits
a
7
x a 7 ... y x gtgt 2

2
x
gtgt
  • Operations ) functional units
  • Variables ) wires
  • No interpretation

20
Basic Computation

latch
data
ack
valid
21
Asynchronous Computation

data
ack
valid
1
22
Distributed Control Logic
ack
rdy

-
short, local wires
asynchronous control
23
Outline
  • Problems of current architectures
  • CASH Compiling ASH
  • program representation
  • compiling C programs
  • Pipelining
  • ASH Evaluation
  • Conclusions

24
MUX Forward Branches
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
f
y
critical path
Conditionals ) Speculation
25
Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
26
Loops
  • int sum0, i
  • for (i0 i lt 100 i)
  • sum ii
  • return sum

27
Predication and Side-Effects
addr
token
to memory
Load
pred
data
token
28
Memory Access
LD
Monolithic Memory
pipelined arbitrated network
ST
LD
local communication
global structures
Future work fragment this!
related work
complexity
29
CASH Optimizations
  • SSA-based optimizations
  • unreachable/dead code, gcse, strength reduction,
    loop-invariant code motion, software pipelining,
    reassociation, algebraic simplifications,
    induction variable optimizations, loop unrolling,
    inlining
  • Memory optimizations
  • dependence alias analysis, register promotion,
    redundant load/store elimination, memory access
    pipelining, loop decoupling
  • Boolean optimizations
  • Espresso CAD tool, bitwidth analysis

30
Outline
  • Problems of current architectures
  • Compiling ASH
  • Pipelining
  • Evaluation CASH vs. clocked designs
  • Conclusions

31
Pipelining
i
1

100

lt
pipelined multiplier (8 stages)
sum
  • int sum0, i
  • for (i0 i lt 100 i)
  • sum ii
  • return sum


step 1
32
Pipelining
i
1

100

lt
sum

step 2
33
Pipelining
i
1

100

lt
sum

step 3
34
Pipelining
i
1

100

lt
sum

step 4
35
Pipelining
i
1

100
i1
lt
i0
sum

step 5
36
Pipelining
i
1

100

i1
lt
i0
sum

step 6
37
Pipelining
i
1

100

lt
sum

step 7
38
Pipelining
i
1

100

critical path
lt
Predicate ackedge is on the critical path.
sum

39
Pipeline balancing
i
1

100

lt
decoupling FIFO
sum

step 7
40
Pipeline balancing
i
1

100

lt
critical path
is loop
decoupling FIFO
sum
sums loop

41
Outline
  • Problems of current architectures
  • Compiling ASH
  • Pipelining
  • Evaluation CASH vs. clocked designs
  • Conclusions

42
Evaluating ASH
Mediabench kernels (1 hot function/benchmark)
C
CASHcore
Verilog back-end
Synopsys,Cadence P/R
180nm std. cell library, 2V
1999 technology
ModelSim (Verilog simulation)
performancenumbers
Mem
ASIC
43
ASH Area
P4 217
minimal RISC core
normalized area
44
ASH vs 600MHz CPU .18 mm
45
Bottleneck Memory Protocol
LD
Memory
ST
46
Power
Xeon cache 67000
mP 4000
DSP 110
47
Energy Efficiency
Dedicated hardware
ASH media kernels
Asynchronous ?P
FPGAs
General-purpose DSP
Microprocessors
0
.
1
1
0
1
1
0
0
0
0
0
1
1
0
0
.
Energy Efficiency Operations/nJ
48
Outline
  • Problems of current architectures
  • Compiling ASH
  • Pipelining
  • ASH Evaluation
  • Future/related work conclusions

49
Related Work
Nanotechnology
Dataflowmachines
Asynchronouscircuits
High-levelsynthesis
Embeddedsystems
Reconfigurablecomputing
Computerarchitecture
Compilation
50
Future Work
  • Optimizations for area/speed/power
  • Memory partitioning
  • Concurrency
  • Compiler-guided layout
  • Explore extensible ISAs
  • Hybridization with superscalar mechanisms
  • Reconfigurable hardware support for ASH
  • Formal verification

51
Grand VisionCertified Circuit Generation
  • Translation validation input output
  • Preserve input properties
  • e.g., C programs cannot deadlock
  • e.g., type-safe programs cannot crash
  • Debug, test, verify only at source-level

How far can you go?
HLL
IR
IRopt
Verilog
gates
layout
formally validated
52
Conclusions
Spatial computation strengths
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Design productivity, no ISA
53
Backup Slides
  • Reconfigurable hardware
  • Critical paths
  • Control logic
  • ASH vs ...
  • ASH weaknesses
  • Exceptions
  • Normalized area
  • Why C?
  • Splitting memory
  • More performance
  • Recursive calls

54
Reconfigurable Hardware
55
Main RH Ingredient RAM Cell
data in
0
control
Switch controlled by a 1-bit RAM cell
back
56
Critical Paths
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
57
Lenient Operations
x
b
0
if (x gt 0) y -x else y bx

!
y
Solves the problem of unbalanced paths
back to talk
back
58
Asynchronous Control
ackout
C
rdyin
D
rdyout
ackin

Reg
datain
dataout
back
back to talk
59
HLL to HW
High-level Synthesis Behavioral HDL Synchronou
s Hardware
ReconfigurableComputing C subsets Hardware
configuration (spatial computation)
Asynchronous circuits Concurrent Language Async
hronous Hardware
Prior work
This research
60
CASH vs High-Level Synthesis
  • CASH the only existing tool to translate
    complete ANSI C to hardware
  • CASH generates asynchronous circuits
  • CASH does not treat C as an HDL
  • no annotations required
  • no reactivity model
  • does not handle non-C, e.g., concurrency

back
61
ASH Weaknesses
  • Low efficiency for low-ILP code
  • Does not adapt at runtime
  • Monolithic memory
  • Resource waste
  • Not flexible
  • No support for exceptions

62
ASH Weaknesses (2)
  • Both branch and join not free
  • Static dataflow (no re-issue of same instr)
  • Memory is far
  • Fully static
  • No branch prediction
  • No dynamic unrolling
  • No register renaming
  • Calls/returns not lenient

back
63
Branch Prediction
i
1
  • for (i0 i lt N i)
  • ...
  • if (exception) break

lt
exception
!

back
64
Exceptions
  • Strictly speaking, C has no exceptions
  • In practice hard to accommodate exceptions in
    hardware implementations
  • An advantage of software flexibility PC is
    single point of execution control

CPU
ASH
Low ILP computation OS VM exceptions
High-ILP computation

Memory
back
65
Why C
  • Huge installed base
  • Embedded specifications written in C
  • Small and simple language
  • Can leverage existing tools
  • Simpler compiler
  • Techniques generally applicable
  • Not a toy language

back
66
Performance
67
Parallelism Profile
68
Normalized Area
back
back to talk
69
Memory Partitioning
  • MIT RAW project Babb FCCM 99, Barua HiPC
    00,Lee ASPLOS 00
  • Stanford SpC Semeria DAC 01, TVLSI 02
  • Berkeley CCured Necula POPL 02
  • Illinois FlexRAM Fraguella PPoPP 03
  • Hand-annotations pragma

back
back to talk
70
Memory Complexity
RAM
LSQ
addr
data
back
back to talk
71
Recursion
save live values
recursive call
restore live values
stack
back
72
Me?
Write a Comment
User Comments (0)
About PowerShow.com