Breaking the Memory Wall for Scalable Microprocessor Platforms - PowerPoint PPT Presentation

About This Presentation

Title:

Breaking the Memory Wall for Scalable Microprocessor Platforms

Description:

John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, ... Philips Nexperia (Viper) MIPS. VLIW. PACT Keynote, October 1, 2004 ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 31

Provided by: johnw204

Category:

more less

Transcript and Presenter's Notes

Title: Breaking the Memory Wall for Scalable Microprocessor Platforms

1
Breaking the Memory Wall for Scalable
Microprocessor Platforms

Wen-mei Hwu
with
John W. Sias, Erik M. Nystrom, Hong-seok Kim,
Chien-wei Li,
Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng,
James W. Player,
Ian M. Steiner, Chris I. Rodrigues, Robert E.
Kidd,
Dan R. Burke, Nacho Navarro, Steve S. Lumetta
University of Illinois at Urbana-Champaign

2
Semiconductor computing platform challenges
S/W inertia
O/S limitations
reliability
feature set
performance
security
accelerators
power
cost
Reconfigurability
Microprocessors
Intelligent RAM
Mem. Latency/Bandwidth Power Constraints
DSP/ASIP
wire load
leakage
fab cost
process variation
billion transistors
3
ASIC/ASIP economics
Total ASIC/ASSP Revenues
10-20
5-20
Engineering Costs

Number of IC Designs
40
Per-chip Development Cost
30-100

Optimistically, ASIC/ASSP revenues growing 1020
/ year
Engineering portion of budget is supposed to be
trimmed every year (but never is)
Chip development costs rising faster than
increased revenues and decreased engineering
costs can make up the difference
Implies 40 fewer IC designs (doing more
applications) - every process generation!!

4
ASIPs non-traditional programmable platforms
Level of concurrency mustbe comparable to ASICs
ASIPs will be on-chip, high-performance
multi-processors
5
Example embedded ASSP implementations
VLIW
MIPS
Philips Nexperia (Viper)
Intel IXP1200 Network Processor
6
What about the general purpose world

Clock frequency increase of computing engines is
slowing down
Power budget hinders higher clock frequency
Device variation limits deeper pipelining
Most future perf. improvement will come from
concurrency and specialization
Size increase of single-thread computing engines
is slowing down
Power budget limits number of transistors
activated by each instruction
Need finer-grained units for defect containment
Wire delay is becoming a primary limiter in
large, monolithic designs
The approach to covering all applications with a
primarily single execution model is showing
limitations

7
Impact of Transistor Variations
1.4
Frequency 30 Leakage Power 5X
30
1.3
1.2
130nm
Normalized Frequency
1.1
1.0
5X
0.9
1
2
3
4
5
Normalized Leakage (Isb)
Source Shekhar Borkar, Intel
8
Metal Interconnects
1
1000
100
Low-K ILD
Line Res (Relative)
0.5
Line Cap (Relative)
10
1
0
500
250
130
65
32
500
250
130
65
32
100
10000
Interconnect RC Delay
1000
Clock Period
RC Delay (Relative)
10
100
Delay (ps)
Copper Interconnect
0.7x Scaled RC Delay
10
RC delay of 1mm interconnect
1
1
500
250
130
65
32
350
250
180
130
90
65
Source Shekhar Borkar, Intel
9
Measured SPECint2000 Performanceon real hardware
with same fabrication technology
Date October 2003
10
Convergence of future computing platforms
11
Breaking the memory wall withdistributed memory
and data movement
12
Parallelization with deep analysis
Deconstructing von Neumann IWLS2004

Memory dataflow that enables
Extraction of independent memory access streams
Conversion of implicit flows through memory into
explicit communication
Applicability to mass software base requires
pointer analysis, control flow analysis, array
dependence analysis

CPU
CPU
DRAM
PEs
DRAM
Az_4
PEs
Az_4
Weight_Ai
(Az, F_ga3, Ap3)
Weight_Ai
(Az, F_g4, Ap4)
synth
synth
Residu
(Ap3, syn_subfri,)
res2
res2
Copy
(Ap3, h, 11)
Weight_Ai
Weight_Ai
Set_zero
(h11, 11)
m_syn
m_syn
(Ap4, h, h, 22, h)
Syn_filt
Copy
F_g3
Residu
F_g3
Set_zero
tmp h0 h0
for (i 1 i lt 22 i)
tmp tmp hi hi
F_g4
F_g4
tmp1 tmp gtgt 8
Syn_filt
tmp h0 h1
for (i 1 i lt 21 i)
syn
syn
D R A M
tmp tmp hi hi1
tmp2 tmp gtgt 8
Corr0/Corr1
if (tmp2 lt 0)
Ap3
Ap3
tmp2 0
else
tmp2 tmp2 MU
Ap4
preemph
Ap4
tmp2 tmp2/tmp1
preemphasis
(res2, temp2, 40)
h
h
Syn_filt
Syn_filt
(Ap4, res2, syn_p),
tmp
tmp
40, mem_syn_pst, 1)
tmp1
tmp1
agc
(syni_subfr, syn)
agc
29491, 40)
tmp2
tmp2
13
Memory bottleneck example(G.724 Decoder
Post-filter, C code)
Residu
Syn_filt

390
390
039
039
MEM
time

Problem Production/consumption occur with
different patterns across 3 kernels
Anti-dependence in preemphasis function (loop
reversal not applicable)
Consumer must wait until producer finishes
Goal Convert memory access to inter-cluster
communication

14
Breaking the memory bottleneck

Remove anti-dependence by array renaming
Apply loop reversal to match producer/consumer
I/O
Convert array access to inter-component
communication

Residu

preemphasis
res
Syn_filt
res2

time
Interprocedural pointer analysis array
dependence test array access pattern summary
interprocedural memory data flow
15
A prototyping experience with the Xilinx ML300

Full system environment
Linux running on PowerPC
Lean system with custom Linux (Nacho Navarro,
UIUC/UPC)
Virtex 2 Pro FPGA logic treated as software
components
Removing memory bottleneck
Random memory access converted to dataflow
Memory objects assigned to distributed Block RAM
SW / HW communication
PLB vs. OCM interface

16
Initial results from our ML300 testbed

Case study GSM vocoder
Main filter in FPGA
Rest in software running under Linux with
customized support
Straightforward software/ accelerator
communications pattern
Fits in available resources on Xilinx ML300 V2P7
Performance compared to all-software execution,
with communication overhead

Hardware implementation
17
Grand challenge

Moving the mass-market software base to
heterogeneous computing architectures
Embedded computing platforms in the near term
General purpose computing platforms in the long
run

Applications and Systems Software
Platforms
OS support
Programming models
Accelerator architectures
Restructuring compilers
Communications and storage management
18
Slicing through software layers
19
Taking the first step pointer analysis

To what can this variable point? (points-to)
Can these two variables point to the same thing?
(alias)
Fundamental to unraveling communications through
memory programmers like modularity and pointers!
Pointer analysis is abstract execution
Model all possible executions of the program
Has to include important facets, or result wont
be useful
Has to ignore irrelevant details, or result wont
be timely
Unrealizable dataflow artifacts of corners
cut in the model
Typically, emphasis has been on timeliness, not
resolution, because expensive algorithms cause
unstable analysis time for typical alias uses,
may be OK
but we have new applications that can benefit
from higher accuracy
Data flow unraveling for logic synthesis and
heterogeneous systems

20
How to be fast, safe and accurate?

An efficient, accurate, and safe pointer analysis
based on the following two key ideas

Efficient analysis of a large program
necessitates that only relevant details are
forwarded to a higher level component
The algorithm can locally cut its losses (like a
bulkhead)
to avoid a global explosion in problem size
21
One facet context sensitivity
Example

Context sensitivity avoids unrealizable data
flow by distinguishing proper calling context
What assignments to a and g receive?
CI a and g each receive 1 and 3
CS g receives only 1 and a receives only 3
Typical reactions to CS costs
Forget it, live with lots of unrealizable
dataflow
Combine it with a cheapener like the lossy
compression of a Steensgaard analysis
We want to do better, but we may sometimes need
to mix CS and CI to keep analysis fast

Desired results
22
Context Insensitive (CI)

Collecting all the assignments in the program and
solving them simultaneously yields a context
insensitive solution
Unfortunately, this leads to three spurious
solutions.

23
Context Sensitive (CS) Naïve process
Excess statements unnecessary and costly
Retention of side effect still leads to spurious
results
24
CS Accurate and Efficient approach
Compact summary of jade used
Summary accounts for all side-effects. DELETE
assignment to prevent contamination
Now, only correct result derived
25
Analyzing large, complex programs SAS2004
Originally, problem size exploded as more
contexts were encountered
1012
This results in an efficient analysis process
without loss of accuracy
104
New algorithm contains problem size with each
additional context
26
Example application and current
challengesPASTE2004
Improved efficiency increases the scope over
which unique, heap-allocated objects can be
discovered
Example Improved analysis algorithms provide
more accurate call graphs (below) instead of a
blurred view (above) for use by program
transformation tools
27
From benchmarks to broad application code base

The long term trend is for all code to go through
a compiler and be managed by a runtime system
Microsoft code base to go through Phoenix
OpenIMPACT participation
Open source code base to go through
GCC/OpenIMPACT under Gelato
The compiler and runtime will perform deep
analysis to allow tool to have visibility into
software
Parallelizers, debuggers, verifiers, models,
validation, instrumentation, configuration,
memory managers, runtime, etc.

28
Global memory dataflow analysis

Integrates analyses to deconstruct memory black
box
Interprocedural pointer analysis allow
programmer to use language and modularity without
losing transformability
Array access pattern analysis figure out
communication among loops that communicate
through arrays
Control and data flow analyses enhance
resolution by understanding program structure
Heap analysis extends analysis to much wider
software base
SSA-based inductor detection and dependence test
have been integrated into IMPACT environment

29
Example on deriving memory data flow
main(...) int A100 foo(A, 64)
bar(A1, 64)
foo writes A063 stride 1 bar reads
A164 stride 1
procedure call
Data flow analysis determines that A64 is not
from foo
parameter mapping
foo (int s, int L) int ps, i for (i0
iltL i) p ... p
Write from (s) to (sL)
with stride 1
Procedure body
Read from (t) to (tM)
with stride 1
summary for the whole loop
Pointer relation analysis restates p/q in terms
of s/t
bar (int t, int M) int qt, i for (i0
iltM i) q q
Write p
loop body
Read q
30
Conclusions and outlook

Heterogeneous multiprocessor systems will be the
model for both general purpose and embedded
computing platforms in the future
Both are motivated by powerful trends
Shorter term adoption for embedded systems
Longer term for general purpose systems
Programming models and parallelization of
traditional programs to channel software to these
new platforms
Feasibility of deep pointer analysis demonstrated
Many need to participate in solving this grand
challenge problem