Breaking the Memory Wall for Scalable Microprocessor Platforms - PowerPoint PPT Presentation

About This Presentation
Title:

Breaking the Memory Wall for Scalable Microprocessor Platforms

Description:

John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, ... Philips Nexperia (Viper) MIPS. VLIW. PACT Keynote, October 1, 2004 ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 31
Provided by: johnw204
Category:

less

Transcript and Presenter's Notes

Title: Breaking the Memory Wall for Scalable Microprocessor Platforms


1
Breaking the Memory Wall for Scalable
Microprocessor Platforms
  • Wen-mei Hwu
  • with
  • John W. Sias, Erik M. Nystrom, Hong-seok Kim,
    Chien-wei Li,
  • Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng,
    James W. Player,
  • Ian M. Steiner, Chris I. Rodrigues, Robert E.
    Kidd,
  • Dan R. Burke, Nacho Navarro, Steve S. Lumetta
  • University of Illinois at Urbana-Champaign

2
Semiconductor computing platform challenges
S/W inertia
O/S limitations
reliability
feature set
performance
security
accelerators
power
cost
Reconfigurability
Microprocessors
Intelligent RAM
Mem. Latency/Bandwidth Power Constraints
DSP/ASIP
wire load
leakage
fab cost
process variation
billion transistors
3
ASIC/ASIP economics
Total ASIC/ASSP Revenues
10-20
5-20
Engineering Costs


Number of IC Designs
40
Per-chip Development Cost
30-100
  • Optimistically, ASIC/ASSP revenues growing 1020
    / year
  • Engineering portion of budget is supposed to be
    trimmed every year (but never is)
  • Chip development costs rising faster than
    increased revenues and decreased engineering
    costs can make up the difference
  • Implies 40 fewer IC designs (doing more
    applications) - every process generation!!

4
ASIPs non-traditional programmable platforms
Level of concurrency mustbe comparable to ASICs
ASIPs will be on-chip, high-performance
multi-processors
5
Example embedded ASSP implementations
VLIW
MIPS
Philips Nexperia (Viper)
Intel IXP1200 Network Processor
6
What about the general purpose world
  • Clock frequency increase of computing engines is
    slowing down
  • Power budget hinders higher clock frequency
  • Device variation limits deeper pipelining
  • Most future perf. improvement will come from
    concurrency and specialization
  • Size increase of single-thread computing engines
    is slowing down
  • Power budget limits number of transistors
    activated by each instruction
  • Need finer-grained units for defect containment
  • Wire delay is becoming a primary limiter in
    large, monolithic designs
  • The approach to covering all applications with a
    primarily single execution model is showing
    limitations

7
Impact of Transistor Variations
1.4
Frequency 30 Leakage Power 5X
30
1.3
1.2
130nm
Normalized Frequency
1.1
1.0
5X
0.9
1
2
3
4
5
Normalized Leakage (Isb)
Source Shekhar Borkar, Intel
8
Metal Interconnects
1
1000
100
Low-K ILD
Line Res (Relative)
0.5
Line Cap (Relative)
10
1
0
500
250
130
65
32
500
250
130
65
32
100
10000
Interconnect RC Delay
1000
Clock Period
RC Delay (Relative)
10
100
Delay (ps)
Copper Interconnect
0.7x Scaled RC Delay
10
RC delay of 1mm interconnect
1
1
500
250
130
65
32
350
250
180
130
90
65
Source Shekhar Borkar, Intel
9
Measured SPECint2000 Performanceon real hardware
with same fabrication technology
Date October 2003
10
Convergence of future computing platforms
11
Breaking the memory wall withdistributed memory
and data movement
12
Parallelization with deep analysis
Deconstructing von Neumann IWLS2004
  • Memory dataflow that enables
  • Extraction of independent memory access streams
  • Conversion of implicit flows through memory into
    explicit communication
  • Applicability to mass software base requires
    pointer analysis, control flow analysis, array
    dependence analysis

CPU
CPU
DRAM
PEs
DRAM
Az_4
PEs
Az_4
Weight_Ai
(Az, F_ga3, Ap3)
Weight_Ai
(Az, F_g4, Ap4)
synth
synth
Residu
(Ap3, syn_subfri,)
res2
res2
Copy
(Ap3, h, 11)
Weight_Ai
Weight_Ai
Set_zero
(h11, 11)
m_syn
m_syn
(Ap4, h, h, 22, h)
Syn_filt
Copy
F_g3
Residu
F_g3
Set_zero
tmp h0 h0
for (i 1 i lt 22 i)
tmp tmp hi hi
F_g4
F_g4
tmp1 tmp gtgt 8
Syn_filt
tmp h0 h1
for (i 1 i lt 21 i)
syn
syn
D R A M
tmp tmp hi hi1
tmp2 tmp gtgt 8
Corr0/Corr1
if (tmp2 lt 0)
Ap3
Ap3
tmp2 0
else
tmp2 tmp2 MU
Ap4
preemph
Ap4
tmp2 tmp2/tmp1
preemphasis
(res2, temp2, 40)
h
h
Syn_filt
Syn_filt
(Ap4, res2, syn_p),
tmp
tmp
40, mem_syn_pst, 1)
tmp1
tmp1
agc
(syni_subfr, syn)
agc
29491, 40)
tmp2
tmp2
13
Memory bottleneck example(G.724 Decoder
Post-filter, C code)
Residu
Syn_filt







390
390
039
039
MEM
time
  • Problem Production/consumption occur with
    different patterns across 3 kernels
  • Anti-dependence in preemphasis function (loop
    reversal not applicable)
  • Consumer must wait until producer finishes
  • Goal Convert memory access to inter-cluster
    communication

14
Breaking the memory bottleneck
  • Remove anti-dependence by array renaming
  • Apply loop reversal to match producer/consumer
    I/O
  • Convert array access to inter-component
    communication

Residu



preemphasis
res
Syn_filt
res2




time
Interprocedural pointer analysis array
dependence test array access pattern summary
interprocedural memory data flow
15
A prototyping experience with the Xilinx ML300
  • Full system environment
  • Linux running on PowerPC
  • Lean system with custom Linux (Nacho Navarro,
    UIUC/UPC)
  • Virtex 2 Pro FPGA logic treated as software
    components
  • Removing memory bottleneck
  • Random memory access converted to dataflow
  • Memory objects assigned to distributed Block RAM
  • SW / HW communication
  • PLB vs. OCM interface

16
Initial results from our ML300 testbed
  • Case study GSM vocoder
  • Main filter in FPGA
  • Rest in software running under Linux with
    customized support
  • Straightforward software/ accelerator
    communications pattern
  • Fits in available resources on Xilinx ML300 V2P7
  • Performance compared to all-software execution,
    with communication overhead

Hardware implementation
17
Grand challenge
  • Moving the mass-market software base to
    heterogeneous computing architectures
  • Embedded computing platforms in the near term
  • General purpose computing platforms in the long
    run

Applications and Systems Software
Platforms
OS support
Programming models
Accelerator architectures
Restructuring compilers
Communications and storage management
18
Slicing through software layers
19
Taking the first step pointer analysis
  • To what can this variable point? (points-to)
  • Can these two variables point to the same thing?
    (alias)
  • Fundamental to unraveling communications through
    memory programmers like modularity and pointers!
  • Pointer analysis is abstract execution
  • Model all possible executions of the program
  • Has to include important facets, or result wont
    be useful
  • Has to ignore irrelevant details, or result wont
    be timely
  • Unrealizable dataflow artifacts of corners
    cut in the model
  • Typically, emphasis has been on timeliness, not
    resolution, because expensive algorithms cause
    unstable analysis time for typical alias uses,
    may be OK
  • but we have new applications that can benefit
    from higher accuracy
  • Data flow unraveling for logic synthesis and
    heterogeneous systems

20
How to be fast, safe and accurate?
  • An efficient, accurate, and safe pointer analysis
    based on the following two key ideas

Efficient analysis of a large program
necessitates that only relevant details are
forwarded to a higher level component
The algorithm can locally cut its losses (like a
bulkhead)
to avoid a global explosion in problem size
21
One facet context sensitivity
Example
  • Context sensitivity avoids unrealizable data
    flow by distinguishing proper calling context
  • What assignments to a and g receive?
  • CI a and g each receive 1 and 3
  • CS g receives only 1 and a receives only 3
  • Typical reactions to CS costs
  • Forget it, live with lots of unrealizable
    dataflow
  • Combine it with a cheapener like the lossy
    compression of a Steensgaard analysis
  • We want to do better, but we may sometimes need
    to mix CS and CI to keep analysis fast

Desired results
22
Context Insensitive (CI)
  • Collecting all the assignments in the program and
    solving them simultaneously yields a context
    insensitive solution
  • Unfortunately, this leads to three spurious
    solutions.

23
Context Sensitive (CS) Naïve process
Excess statements unnecessary and costly
Retention of side effect still leads to spurious
results
24
CS Accurate and Efficient approach
Compact summary of jade used
Summary accounts for all side-effects. DELETE
assignment to prevent contamination
Now, only correct result derived
25
Analyzing large, complex programs SAS2004
Originally, problem size exploded as more
contexts were encountered
1012
This results in an efficient analysis process
without loss of accuracy
104
New algorithm contains problem size with each
additional context
26
Example application and current
challengesPASTE2004
Improved efficiency increases the scope over
which unique, heap-allocated objects can be
discovered
Example Improved analysis algorithms provide
more accurate call graphs (below) instead of a
blurred view (above) for use by program
transformation tools
27
From benchmarks to broad application code base
  • The long term trend is for all code to go through
    a compiler and be managed by a runtime system
  • Microsoft code base to go through Phoenix
    OpenIMPACT participation
  • Open source code base to go through
    GCC/OpenIMPACT under Gelato
  • The compiler and runtime will perform deep
    analysis to allow tool to have visibility into
    software
  • Parallelizers, debuggers, verifiers, models,
    validation, instrumentation, configuration,
    memory managers, runtime, etc.

28
Global memory dataflow analysis
  • Integrates analyses to deconstruct memory black
    box
  • Interprocedural pointer analysis allow
    programmer to use language and modularity without
    losing transformability
  • Array access pattern analysis figure out
    communication among loops that communicate
    through arrays
  • Control and data flow analyses enhance
    resolution by understanding program structure
  • Heap analysis extends analysis to much wider
    software base
  • SSA-based inductor detection and dependence test
    have been integrated into IMPACT environment

29
Example on deriving memory data flow
main(...) int A100 foo(A, 64)
bar(A1, 64)
foo writes A063 stride 1 bar reads
A164 stride 1
procedure call
Data flow analysis determines that A64 is not
from foo
parameter mapping
foo (int s, int L) int ps, i for (i0
iltL i) p ... p
Write from (s) to (sL)
with stride 1
Procedure body
Read from (t) to (tM)
with stride 1
summary for the whole loop
Pointer relation analysis restates p/q in terms
of s/t
bar (int t, int M) int qt, i for (i0
iltM i) q q
Write p
loop body
Read q
30
Conclusions and outlook
  • Heterogeneous multiprocessor systems will be the
    model for both general purpose and embedded
    computing platforms in the future
  • Both are motivated by powerful trends
  • Shorter term adoption for embedded systems
  • Longer term for general purpose systems
  • Programming models and parallelization of
    traditional programs to channel software to these
    new platforms
  • Feasibility of deep pointer analysis demonstrated
  • Many need to participate in solving this grand
    challenge problem
Write a Comment
User Comments (0)
About PowerShow.com