CASH: Compiling Application-Specific Hardware - PowerPoint PPT Presentation

About This Presentation
Title:

CASH: Compiling Application-Specific Hardware

Description:

CASH: Compiling Application-Specific Hardware. Mihai Budiu. ST Microelectronics, June 11, 2003 ... fast, local communication. Inexpensive large bandwidth: ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 76
Provided by: MIh73
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: CASH: Compiling Application-Specific Hardware


1
CASH Compiling Application-Specific Hardware
  • Mihai Budiu
  • ST Microelectronics, June 11, 2003

2
CPU Problems
  • Design Complexity
  • Power
  • Global Signals
  • Limited issue window ) limited ILP

3
Communication vs. Computation
wire
gate
5ps
20ps
Power consumption on wires is also dominant
4
Global Communication
Instruction unit
Network
Reg
5
Our Approach ASH Application-Specific
Hardware
6
1) Unroll Pipeline
Instruction unit
Network
Network
Reg
Reg
Network
Reg
7
Resource Binding Time

1.
1.
Programs
2.
2.
Programs
CPU
ASH
8
2) Specialize Pipeline
Fixed program
Instruction unit
Network
Network
Reg
Reg
Network
Reg
9
2) Specialize PipelineFunctional Units
Fixed program
Instruction unit
Network
Network
Reg
Reg
Network
Reg
10
2) Specialize Pipeline Interconnection Network
Fixed program
Instruction unit
Reg
Reg
Reg
11
2) Specialize Pipeline Register Files
Fixed program
Instruction unit
0
1
12
2) Specialize Pipeline Shrink Wires
Fixed program
Instruction unit
0
1
13
2) Specialize Pipeline No Instruction Fetch,
Decode, Issue
0
1
14
Loops and Memory
Spatial Computation
0
1
LSQ
To memory
15
Outline
  • Introduction spatial computation
  • ASH vs CPU
  • CASH Compiling for ASH
  • ASH at run-time
  • Conclusions

16
Proposed Architecture
CPU
ASH
Low ILP computation OS VM
High-ILP computation
Memory
17
Computational Bandwidth
  • FU clock freq

CPU
ASH
18
Registers
  • Fixed

Unbounded
R0 R1 ... R31
ASH
CPU
19
Register Bandwidth
Fixed
Unbounded
R1 R2 R3 W1 W2
CPU
ASH
20
Parallelism
In-order
HLL program
Compiler
Fetch
Decode
Dispatch
Execute
Commit
Limited by window
Circuit
ASH
CPU
Compilers window is unbounded
21
ASH vs CPU- summary -
  • Late resource binding ) match application
    needs
  • No centralized structures ) fast, local
    communication
  • Inexpensive large bandwidth computation,
    registers

22
Outline
  • Introduction
  • ASH vs CPU
  • CASH Compiling for ASH
  • ASH at run-time
  • Conclusions

23
Our Solution
General applicable to todays software Automatic
compiler-driven RISC approach Scalable with
technology, hardware prog size Parallelism
exploit application parallelism bit-level, ILP,
pipeline, loop-level
24
Application-Specific Hardware
C program
Dataflow IR
Compiler
dataflow machine
Reconfigurable/custom hw
25
Intermediate Representation
Traditionally
Our proposal
  • SSA predication
  • Uniform for scalars and memory
  • Explicitly encode may-depend
  • Summarize control-flow
  • Executable

may-dep.
CFG
...
def-use
26
New
  • Entire C applications
  • Dynamically scheduled circuits
  • Custom dataflow machines
  • - application-specific
  • - direct execution (no interpretation)

27
CASH Compiling for ASH
C Program
RH
Circuits
Memory partitioning
Interconnection net
28
Asynchronous Computation

data
ack
data valid
29
Distributed Control Logic
ack
rdy

-
30
Outline
  • Introduction
  • ASH vs CPU
  • CASH Compiling for ASH - some details -
  • ASH at run-time
  • Conclusions

31
Forward Branches
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
Decoded mux
Conditionals ) Speculation
32
Control Flow ) Data Flow
data
Merge
data
data
predicate
Gateway
33
Loops
  • int sum0, i
  • for (i0 i lt 100 i)
  • sum ii
  • return sum

34
Read-write Sets
Memory
p if (x) q else r
35
Token Edges
Memory
p if (x) q else r
36
Tokens in Hardware
addr
token
pred
LSQ
Load
Memory
data
token
37
Outline
  • Introduction
  • ASH vs CPU
  • Compiling for ASH
  • ASH at run-time
  • Conclusions

38
Critical Paths
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
39
Lenient Operations
x
b
0
if (x gt 0) y -x else y bx

!
y
Solve the problem of unbalanced paths
40
Pipelining
i
1

100

lt
pipelined multiplier (8 stages)
sum
  • int sum0, i
  • for (i0 i lt 100 i)
  • sum ii
  • return sum


cycle1
41
Pipelining
i
1

100

lt
sum

cycle2
42
Pipelining
i
1

100

lt
sum

cycle3
43
Pipelining
i
1

100

lt
sum

cycle4
44
Pipelining
i
1

100

lt
sum

cycle5
45
Pipelining
i
1

100

i1
lt
i0
sum

cycle6
46
Pipelining
i
1

100

lt
sum

cycle7
47
Pipelining
i
1

100

critical path
lt
Predicate ackedge is on the critical path.
sum

48
Pipelining
i
1

100

lt
decoupling FIFO
sum

cycle7
49
Pipelining
i
1

100

lt
critical path
is loop
decoupling FIFO
sum
sums loop

50
FIFO Impact
i
1

100

lt
Pipe FIFO Cycles
N 0 903
N 1 903
Y 0 653
Y 1 474
Y 2 408
Y 3 408
decoupling FIFO
sum

51
Dataflow Loop Pipelining
  • Related to software pipelining
  • Copes with unknown latencies
  • control-flow
  • memory accesses
  • Does not require parallelization
  • Applicable to memory accesses as well

52
Performance of Selected Kernels
25
17/12
19/16
10.5
mpeg2_d
gsm_e
gsm_d
g721_d
mpeg2_e
pegwit_e
g721_e
jpeg_d
pegwit_d
jpeg_e
adpcm_e
adpcm_d
Mediabench
53
OpenDIVX IDCT, Normalized Running Time
54
OpenDIVX IDCT,Sustained IPC
includes speculative ops
no data
55
Full Dataflow Speed
wrong!
  • ASH runs at full dataflow speed, so CPU cannot
    do any better(if compilers equally good)
  • CPU weapons
  • speculation (branch prediction)
  • centralized memory access

56
Muxes Speculation Squash
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
57
Control-Flow Speculation
i
1
  • for (i0 i lt N i)
  • ...
  • if (exception) break

lt
exception
!

58
Summary
  • ANSI C automatically translated to
    dynamically-scheduled hardware circuits
  • EPIC-based compilation technology (predication,
    speculation, hyperblocks)
  • Novel specific optimizations
  • leniency, ack-on-critical path, loop decoupling,
    bitwidth
  • ASH does not naturally support control-speculation
    (aka branch prediction)

59
Conclusions
  • ASHcompiler-synthesized hw from HLL
  • Exposes program parallelism
  • ILP and loop pipelining
  • Dataflow techniques applied to hardware
  • Impressive performance on data-intensive
    kernels

60
Backup Slides
  • Compiler structure
  • Predication speculation
  • Procedure calls
  • Evaluation model
  • Program partitioning
  • Status/resources
  • Control logic

61
Compiler Structure
Pegasus
SUIF
C/FORTRAN
  • CSE
  • dead-code
  • PRE
  • induction variables
  • strength reduction
  • loop-invariant lift
  • reassociation
  • memory optimization
  • constant propagation
  • constant folding
  • unreachable code
  • register promotion
  • inlining
  • unrolling
  • call-graph
  • pointer analysis
  • live var. analysis
  • CFG construction
  • unreachable code
  • build hyperblocks
  • ctrl dominance
  • path predicates

call-graph
C circuitsimulation
Verilog
back
62
Hyperblocks
Procedure
63
Predication
hyperblock
if (!p) .......
p
!p
if (p) .......
q
q
64
Speculation
if (!p) ......
if (!p) ......
ops w/ side-effects
q
q
65
Computing Predicates
s
t
b
  • Correct for irreducible graphs
  • Correct even when speculatively computed
  • Boolean operations are lenient

back
66
Procedure calls
network
Extract args
args
call P
result
caller
ret
Procedure P
67
Recursion
save live values
recursive call
localstack
restore live values
hyperblock
back
68
How Performance Is Evaluated
C
Mem
L2 1/4M
L1 8K
LSQ
2
limited BW (2 words/c)
Unlimited ILP
8
72
69
Simulation Parameters
  • Compared to 4-wide OOO SimpleScalar
  • Same operation latencies
  • Same cache hierarchy
  • No measurements in library functions/OS
  • 3-cycle multiply, 20 cycle divide

back
70
Unit of Partitioning Procedure
Program call-graph
recursive
leaves
library
71
Peering
Program
a( ) b( ) b( ) c( ) c( ) d(
) d( )
a
CPU
ASH
b
c
d
72
RPC
RH
CPU
a
b
b
c
c
d
d
back
73
Status
  • Handle all C constructs except
  • longjmp
  • exit
  • alloca
  • varargs
  • Generate coarse C simulation of circuits
  • Preliminary Verilog back-end available
  • no FP, procedure calls
  • uses a standard cell library
  • generates inefficient memory interface

74
How Many Resources?
  • Using a back-of-the-envelope calculation
  • Estimated SpecINT95 and Mediabench
  • Average lt 100 bit-operations/SLOC
  • Routing resources harder to estimate

back
75
Control Logic
rdyin
C
C
ackin
D
rdyout
ackout
D
datain
dataout
Reg
back
Write a Comment
User Comments (0)
About PowerShow.com