Title: Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall
1Performance Analysis, Modeling, and Optimization
Understanding the Memory Wall
- Leonid Oliker (LBNL) and
- Katherine Yelick (UCB and LBNL)
2Berkeley Institute for Performance Studies
- Joint venture between U.C. Berkeley (Demmel
Yelick) - And LBNL (Oliker, Strohmaier, Bailey, and others)
- Three performance techniques
- Analysis (benchmarking)
- Modeling (prediction)
3Investigating Architectural Balance using
Adaptable Probes
- Kaushik Datta, Parry Husbands, Paul Hargrove,
Shoaib Kamil, Leonid Oliker, John Shalf,
Katherine Yelick
4Overview
- Gap between peak and sustained performance well
known problem in HPC - Generally attributed to memory system, but
difficult to identify bottleneck - Application benchmarks too complex to isolate
specific architectural features - Microbenchmarks too narrow to predict actual code
performance - We use adaptable probes to isolate performance
limitations - Give application developers possible
optimizations - Give hardware designers feedback on current and
proposed architectures - Single processor probes
- Sqmat captures regular and irregular memory
access patterns (such as dense and sparse linear
algebra) - Stencil captures nearest neighbor computation
(work in progress) - Architectures examined
- Commercial Intel Itanium2, AMD Opteron, IBM
Power3, IBM Power4, G5 - Research Imagine, Iram, DIVA
5Sqmat overview
- Sqmat based on matrix multiplication and linear
solvers - Java program used to generate optimally unrolled
C code - Square a set of matrices M times in (use enough
matrices to exceed cache) - M controls computational intensity (CI) - the
ratio between flops and mem access - Each matrix is size NxN
- N controls working set size 2N2 registers
required per matrix - Direct Storage Sqmats matrix entries stored
continuously in memory - Indirect Entries accessed indirectly through
pointer Parameter S controls degree of
indirection, S matrix entries stored
contiguously, then random jump in memory
6Unit Stride Algorithmic Peak
- Curve increases until memory system fully
utilized, plateaus when FPU units saturate - Itanium2 requires longer time to achieve plateau
due to register spill penalty - Opterons SIMD nature of SSE2 inhibits high
algorithmic peak - Power3 effective hiding latency of cache-access
- Power4s deep pipeline inhibits ability to find
sufficient ILP to saturate FPUs
7Slowdown due to Indirection
Unit stride access via indirection (S1)
- Operton, Power3/4 less 10 penalty once Mgt8 -
demonstrating bandwidth between cache and
processor effectively delivers addresses and
values - Itanium2 showing high penalty for indirection -
issue is currently under invesigation
8Cost of Irregularity (1)
- Itanium and Opteron perform well for irregular
accesses due to - Itanium2s L2 caching of FP values (reduces cost
of cache miss) - Opterons low memory latency due to on-chip
memory controller
9Cost of Irregularity (2)
- Power3 and Power4 perform well for irregular
accesses due to - Power3s high penalty cache miss (35 cycles) and
limited prefetch abilities - Power4s requires 4 cache-line hit to activate
prefetching
10Tolerating Irregularity
- S50
- Start with some M at S? (indirect unit stride)
- For a given M, how large must S be to achieve at
least 50 of the original performance? - M50
- Start with M1, S?
- At S1 (every access random), how large must M be
to achieve 50 of the original performance
11Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? M50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter expensive on commodity cache-based systems Power4 can is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
Interested in developing application driven
architectural probes for evaluation of emerging
petascale systems
12Emerging Architectures
- General purpose processors badly suited for data
intensive ops - Large caches not useful if re-use is low
- Low memory bandwidth, especially for irregular
patterns - Superscalar methods of increasing ILP inefficient
- Power consumption
- Application-specific ASICs
- Good, but expensive/slow to design.
- Solution general purpose memory aware
processors - Large number of ALUs to exploit data-parallelism
- Huge memory bandwidth to keep ALUs busy
- Concurrency overlap memory w/ computation
13VIRAM Overview
- MIPS core (200 MHz)
- Main memory system
- 8 banks w/13 MB of on-chip DRAM
- Large 6.4 GBytes/s on-chip peak bandwidth
- Cache-less Vector unit
- Energy efficient way to express fine-grained
parallelism and exploit bandwidth - Single issue, in order
- Low power consumption 2.0 W
- Peak vector performance
- 1.6/3.2/6.4 Gops
- 1.6 Gflops (single-precision)
- Fabricated by IBM
- Deep pipelines mask DRAM latency
- Crays vcc compiler adapted to VIRAM
- Simulator used for results
14VIRAM Power Efficiency
- Comparable performance with lower clock rate
- Large power/performance advantage for VIRAM from
- PIM technology, data parallel execution model
15Imagine Overview
- Vector VLIW processor
- Coprocessor to off-chip host processor
- 8 arithmetic clusters control in SIMD w/ VLIW
instructions - Central 128KB Stream Register File _at_ 32GB/s
- SRF can overlap computation w/ memory (double
buffering) - SRF cab reuse intermediate results (proc-consum
locality) - Stream-aware memory system with 2.7 GB/s off-chip
- 544 GB/s intercluster comm
- Host sends instuctions to stream controller, SC
issues commands to on-chip modules
16VIRAM and Imagine
VIRAM IMAGINEMemory IMAGINE SRF
Bandwidth GB/s 6.4 2.7 32
Peak Float 32bit 1.6 GF/s 20 GF/s 20
Peak Float/Word 1 30 2.5
Speed MHz 200 400
Chip Area 15x18mm 12x12mm
Data widths 64/32/16 32/16/8
Transistors 130 x 106 21 x 106
Power Consmp 2 Watts 10 Watts
- Imagine order of magnitude higher performance
- VIRAM twice memory bandwidth, less power
consumption - Notice peak Flop/Word ratios
17What Does This Have to Do with PIMs?
- Performance of Sqmat on PIMs and others for 3x3
matrices, squared 10 times (high computational
intensity!) - Imagine much faster for long streams, slower for
short ones
18SQMAT Performance Crossover
- Large number of ops/word N10 where N3x3
- Crossover point L64 (cycles) , L 256 (MFlop)
- Imagine power becomes apparent almost 4x VIRAM at
L1024Codes at this end of spectrum greatly
benefit from Imagine arch
19Stencil Probe
- Stencil computations core of wide range of
scientific applications - Applications include Jacobi solvers, complex
multigrid, block structured AMR - We developing adaptable stencil probe to model
range of computations - Findings isolate importance of streaming memory
accesses which engage automatic prefetch engines,
thus greatly increasing memory throughput - Previous L1 tiling techniques mostly ineffective
for stencil computations on modern
microprocessors - Small blocks inhibit automatic prefetching
performance - Modern large on-chip L2/L3 caches have similar
bandwidth to L1 - Currently investigating tradeoffs between
blocking and prefetching (paper in preparation) - Interested in exploring potential benefits of
enhancing commodity processors with explicitly
programmable prefetching