Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall presentation

About This Presentation

Transcript and Presenter's Notes

Title: Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall

1
Performance Analysis, Modeling, and Optimization
Understanding the Memory Wall

Leonid Oliker (LBNL) and
Katherine Yelick (UCB and LBNL)

2
Berkeley Institute for Performance Studies

Joint venture between U.C. Berkeley (Demmel
Yelick)
And LBNL (Oliker, Strohmaier, Bailey, and others)
Three performance techniques
Analysis (benchmarking)
Modeling (prediction)

3
Investigating Architectural Balance using
Adaptable Probes

Kaushik Datta, Parry Husbands, Paul Hargrove,
Shoaib Kamil, Leonid Oliker, John Shalf,
Katherine Yelick

4
Overview

Gap between peak and sustained performance well
known problem in HPC
Generally attributed to memory system, but
difficult to identify bottleneck
Application benchmarks too complex to isolate
specific architectural features
Microbenchmarks too narrow to predict actual code
performance
We use adaptable probes to isolate performance
limitations
Give application developers possible
optimizations
Give hardware designers feedback on current and
proposed architectures
Single processor probes
Sqmat captures regular and irregular memory
access patterns (such as dense and sparse linear
algebra)
Stencil captures nearest neighbor computation
(work in progress)
Architectures examined
Commercial Intel Itanium2, AMD Opteron, IBM
Power3, IBM Power4, G5
Research Imagine, Iram, DIVA

5
Sqmat overview

Sqmat based on matrix multiplication and linear
solvers
Java program used to generate optimally unrolled
C code
Square a set of matrices M times in (use enough
matrices to exceed cache)
M controls computational intensity (CI) - the
ratio between flops and mem access
Each matrix is size NxN
N controls working set size 2N2 registers
required per matrix
Direct Storage Sqmats matrix entries stored
continuously in memory
Indirect Entries accessed indirectly through
pointer Parameter S controls degree of
indirection, S matrix entries stored
contiguously, then random jump in memory

6
Unit Stride Algorithmic Peak

Curve increases until memory system fully
utilized, plateaus when FPU units saturate
Itanium2 requires longer time to achieve plateau
due to register spill penalty
Opterons SIMD nature of SSE2 inhibits high
algorithmic peak
Power3 effective hiding latency of cache-access
Power4s deep pipeline inhibits ability to find
sufficient ILP to saturate FPUs

7
Slowdown due to Indirection
Unit stride access via indirection (S1)

Operton, Power3/4 less 10 penalty once Mgt8 -
demonstrating bandwidth between cache and
processor effectively delivers addresses and
values
Itanium2 showing high penalty for indirection -
issue is currently under invesigation

8
Cost of Irregularity (1)

Itanium and Opteron perform well for irregular
accesses due to
Itanium2s L2 caching of FP values (reduces cost
of cache miss)
Opterons low memory latency due to on-chip
memory controller

9
Cost of Irregularity (2)

Power3 and Power4 perform well for irregular
accesses due to
Power3s high penalty cache miss (35 cycles) and
limited prefetch abilities
Power4s requires 4 cache-line hit to activate
prefetching

10
Tolerating Irregularity

S50
Start with some M at S? (indirect unit stride)
For a given M, how large must S be to achieve at
least 50 of the original performance?
M50
Start with M1, S?
At S1 (every access random), how large must M be
to achieve 50 of the original performance

11
Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? M50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter expensive on commodity cache-based systems Power4 can is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
Interested in developing application driven
architectural probes for evaluation of emerging
petascale systems
12
Emerging Architectures

General purpose processors badly suited for data
intensive ops
Large caches not useful if re-use is low
Low memory bandwidth, especially for irregular
patterns
Superscalar methods of increasing ILP inefficient
Power consumption
Application-specific ASICs
Good, but expensive/slow to design.
Solution general purpose memory aware
processors
Large number of ALUs to exploit data-parallelism
Huge memory bandwidth to keep ALUs busy
Concurrency overlap memory w/ computation

13
VIRAM Overview

MIPS core (200 MHz)
Main memory system
8 banks w/13 MB of on-chip DRAM
Large 6.4 GBytes/s on-chip peak bandwidth
Cache-less Vector unit
Energy efficient way to express fine-grained
parallelism and exploit bandwidth
Single issue, in order
Low power consumption 2.0 W
Peak vector performance
1.6/3.2/6.4 Gops
1.6 Gflops (single-precision)
Fabricated by IBM
Deep pipelines mask DRAM latency
Crays vcc compiler adapted to VIRAM
Simulator used for results

14
VIRAM Power Efficiency

Comparable performance with lower clock rate
Large power/performance advantage for VIRAM from
PIM technology, data parallel execution model

15
Imagine Overview

Vector VLIW processor
Coprocessor to off-chip host processor
8 arithmetic clusters control in SIMD w/ VLIW
instructions
Central 128KB Stream Register File _at_ 32GB/s
SRF can overlap computation w/ memory (double
buffering)
SRF cab reuse intermediate results (proc-consum
locality)
Stream-aware memory system with 2.7 GB/s off-chip
544 GB/s intercluster comm

Host sends instuctions to stream controller, SC
issues commands to on-chip modules

16
VIRAM and Imagine
VIRAM IMAGINEMemory IMAGINE SRF
Bandwidth GB/s 6.4 2.7 32
Peak Float 32bit 1.6 GF/s 20 GF/s 20
Peak Float/Word 1 30 2.5
Speed MHz 200 400
Chip Area 15x18mm 12x12mm
Data widths 64/32/16 32/16/8
Transistors 130 x 106 21 x 106
Power Consmp 2 Watts 10 Watts

Imagine order of magnitude higher performance
VIRAM twice memory bandwidth, less power
consumption
Notice peak Flop/Word ratios

17
What Does This Have to Do with PIMs?

Performance of Sqmat on PIMs and others for 3x3
matrices, squared 10 times (high computational
intensity!)
Imagine much faster for long streams, slower for
short ones

18
SQMAT Performance Crossover

Large number of ops/word N10 where N3x3
Crossover point L64 (cycles) , L 256 (MFlop)
Imagine power becomes apparent almost 4x VIRAM at
L1024Codes at this end of spectrum greatly
benefit from Imagine arch

19
Stencil Probe

Stencil computations core of wide range of
scientific applications
Applications include Jacobi solvers, complex
multigrid, block structured AMR
We developing adaptable stencil probe to model
range of computations
Findings isolate importance of streaming memory
accesses which engage automatic prefetch engines,
thus greatly increasing memory throughput
Previous L1 tiling techniques mostly ineffective
for stencil computations on modern
microprocessors
Small blocks inhibit automatic prefetching
performance
Modern large on-chip L2/L3 caches have similar
bandwidth to L1
Currently investigating tradeoffs between
blocking and prefetching (paper in preparation)
Interested in exploring potential benefits of
enhancing commodity processors with explicitly
programmable prefetching

Write a Comment

User Comments (0)

About PowerShow.com

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall PowerPoint PPT Presentation