Title: Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS)
1 Performance Understanding, Prediction, and
Tuning at the Berkeley Institute for
Performance Studies (BIPS)
- Katherine Yelick, BIPS Director
- Lawrence Berkeley National Laboratory and
- U. C. Berkeley, EECS Dept.
National Science Foundation
2Challenges to Performance
- Two trends in High End Computing
- Increasing complicated systems
- Multiple forms of parallelism
- Many levels of memory hierarchy
- Complex systems software in between
- Increasingly sophisticated algorithms
- Unstructured meshes and sparse matrices
- Adaptivity in time and space
- Multi-physics models lead to hybrid approaches
- Conclusion Deep understanding of performance at
all levels is important
3BIPS Institute Goals
- Bring together researchers on all aspects of
performance engineering - Use performance understanding to
- Improve application performance
- Compare architectures for application suitability
- Influence the design of processors, networks and
compilers - Identify algorithmic needs
4BIPS Approaches
- Benchmarking and Analysis
- Measure performance
- Identify opportunities for improvements in
software, hardware, and algorithms - Modeling
- Predict performance on future machines
- Understand performance limits
- Tuning
- Improve performance
- By hand or with automatic self-tuning tools
5Multi-Level Analysis
- Full Applications
- What users want
- Do not reveal impact of features
- Compact Applications
- Can be ported with modest effort
- Easily match phases of full applications
- Microbenchmarks
- Isolate architectural features
- Hard to tie to real applications
6Projects Within BIPS
- Application evaluation on vector processors
- APEX Application Performance Characterization
Benchmarking - BeBop Berkeley Benchmarking and Optimization
Group - Architectural probes for alternative
architectures - LAPACK Linear Algebra Package
- PERC Performance Engineering Research Center
- Top500
- ViVA Virtual Vector Architectures
7 Application Evaluation of Vector Systems
- Two vector architectures
- The Japanese Earth Simulator
- The Cray X1
- Comparison to commodity-based systems
- IBM SP, Power4
- SGI Altix
- Ongoing study of DOE applications
- CACTUS Astrophysics 100,000 lines
grid based - PARATEC Material Science 50,000 lines
Fourier space - LBMHD Plasma Physics 1,500 lines
grid based - GTC Magnetic Fusion 5,000 lines
particle based - MADCAP Cosmology 5,000 lines
dense lin. alg. - Work by L. Oliker, J. Borrill, A. Canning, J.
Carter, J. Shalf, S. Hongzhang
8Architectural Comparison
Node Type Where CPU/Node ClockMHz PeakGFlop Mem BW GB/s Peak byte/flop NetwkBWGB/s/P BisectBWbyte/flop MPI Latencyusec NetworkTopology
Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree
Power4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree
Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-tree
ES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 Crossbar
X1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus
- Custom vector architectures have
- High memory bandwidth relative to peak
- Tightly integrated networks result in lower
latency (Altix) - Bisection bandwidth depends on topology
- JES also dominates here
- A key balance point for vector systems is the
scalarvector ratio
9Summary of Results
Code (P64) peak (P64) peak (P64) peak (P64) peak (P64) peak (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs.
Code Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1
LBMHD 7 5 11 58 37 30.6 15.3 7.2 1.5
CACTUS 6 11 7 34 6 45.0 5.1 6.4 4.0
GTC 9 6 5 16 11 9.4 4.3 4.1 0.9
PARATEC 57 33 54 58 20 8.2 3.9 1.4 3.9
MADCAP 61 40 --- 53 19 3.4 2.3 --- 0.9
- Tremendous potential of vector architectures
- 4 codes running faster than ever before
- Vector systems allow resolution not possible with
scalar (any procs) - Advantage of having larger/faster nodes
- ES shows much higher sustained performance than
X1 - Limited X1 specific optimization so far - more
may be possible (CAF, etc) - Non-vectorizable code segments become very
expensive (81 or even 321 ratio) - Vectors potentially at odds w/ emerging methods
(sparse, irregular, adaptive) - GTC example code at odds with data-parallelism
10Comparison to HPCC Four Corners
RandomAccess
Stream
Temporal Locality
FFT
LINPACK
Spatial Locality
11APEX-MAP Benchmark
- Goal Quantify the effects of temporal and
spatial locality - Focus on memory system and network performance
- Graphs over temporal and spatial locality axes
- Show performance valleys/cliffs
12MicroBenchmarks
- Using Adaptable probes to understand
micro-architecture limits - Tunable to match application kernels
- Ability to collect continuous data sets over
parameters reveals performance cliffs - Two examples
- Sqmat
- APEX-Map
- Also application kernel benchmarks
- SPMV (for HPCS)
- Stencil probe
13APEX-MAP Probe
- Use an array of size M.
- Access data in vectors of length L.
- Regular
- Walk over consecutive (unit stride) vectors
through memory. - Re-access each vector k-times.
- Random
- Pick the start address of the vector randomly.
- Use the properties of the random numbers to
achieve a re-use number k. - Use the Power distribution for the non-uniform
random address generator. - Exponent ? in 0,1
- ?1 Uniform random access.
- ?0 Access to a single vector only.
14Apex-Map Sequential
spatial
temporal
15Apex-Map Sequential
spatial
temporal
Performance sensitive to both spatial and
temporal locality
16Apex-Map Sequential
spatial
temporal
Performance sensitive to both spatial and
temporal locality
17Apex-Map Sequential
spatial
temporal
Performance less sensitive to temporal locality
18Apex-Map Sequential
spatial
temporal
Performance less sensitive to temporal locality
19Parallel Version
- Same Design Principal as sequential code.
- Data evenly distributed among processes.
- L contiguous addresses will be accessed together.
- Each remote access is a communication message
with length L. - Random Access.
- MPI version first
- Plans to do Shmem and UPC
20Parallel APEX-Map
21Parallel APEX-Map
22Application Kernel Benchmarks
- Microbenchmarks are good for
- Identifying architecture/compiler bottlenecks
- Optimization opportunities
- Application benchmarks are good for
- Machine selection for specific apps
- In between Benchmarks to capture important
behavior in real applications - Sparse matrices SPMV benchmark
- Stencil operations Stencil probe
- Possible future sorting, narrow datatype ops,
23Sparse Matrix Vector Multiply (SPMV)
- Sparse matrix algorithms
- Increasingly important in applications
- Challenge memory systems poor locality
- Many matrices have structure, e.g., dense
sub-blocks, that can be exploited - Benchmarking SPMV
- NAS CG, SciMark, use a random matrix
- Not reflective of most real problems
- Benchmark challenge
- Ship real matrices cumbersome inflexible
- Build realistic synthetic matrices
24Importance of Using Blocked Matrices
Speedup of best-case blocked matrix vs unblocked
25Generating Blocked Matrices
- Our approach Uniformly distributed random
structure, each a rxc block - Collect data for r and c from 1 to 12
- Validation Can our random matrices simulate
typical matrices? - 44 matrices from various applications
- 1 Dense matrix in sparse format
- 2-17 Finite-Element-Method matrices, FEM
- 2-9 single block size, 10-17 multiple block
sizes - 18-44 non-FEM
- Summarization Weighted by occurrence in test
suite (ongoing)
26Itanium 2 prediction
27UltraSparc III prediction
28Benchmark details
- BCSR Randomly scattered nonzero blocks
- Non-zero density average from FEM matrices
- Outputs
- Different block dimensions 1x1, best case,
average over common block dimensions for FEM
problems - Different problem sizes
- small matrix and vectors in cache
- medium matrix out of cache, vectors in cache
- large matrix and vectors out of cache
- Still working on this distribution of nonzeros
could make SpMV on a large matrix act like SpMV
on a smaller matrix - What if cache size not known?
- Working on classification algorithms to guess the
cache size, based on a range of performance tests
29Sample summary results (Apple G5, 1.8 GHz)
30Selected SpMV benchmark results
- Raw results
- Which machine is fastest
- Scaled machine's peak floating-point rate
- Mitigates chip technology factors
- Influenced by compiler issues
- Fraction of peak memory bandwidth
- Use Stream bechmark for attainable peak
- How close to this bound is SPMV running?
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Lessons Learned
- Tuning is important
- Motivates tool for automatic tuning
- Scaling by peak floating-point rate
- SSE2 machines hurt by this measure Hard for
compilers to identify SIMD parallelism - Scaling by peak memory bandwidth
- Blocking a matrix improves actual bandwidth
- Also reduces total matrix size (less metadata)
35Automatic Performance Tuning
- Performance depends on machine, kernel, matrix
- Matrix known at run-time
- Best data structure implementation can be
surprising - Filling in explicit zeros can
- Reduce storage
- Improve performance
- PIII example 50 more nonzeros, 50 faster
- BeBOP approach empirical modeling and search
- Up to 4x speedups and 31 of peak for SpMV
- Many optimization techniques for SpMV
- Several other kernels triangular solve, ATAx,
Akx - Proof-of-concept Integrate with Omega3P
- Release OSKI Library, integrate into PETSc
36Extra Work Can Improve Efficiency!
- More complicated non-zero structure in general
- Example 3x3 blocking
- Logical grid of 3x3 cells
- Fill-in explicit zeros
- Unroll 3x3 block multiplies
- Fill ratio 1.5
- On Pentium III 1.5x speedup!
37Ultra 2i - 9
Ultra 3 - 6
63 Mflop/s
109 Mflop/s
35 Mflop/s
53 Mflop/s
Pentium III-M - 15
Pentium III - 19
96 Mflop/s
120 Mflop/s
42 Mflop/s
58 Mflop/s
38Power3 - 13
Power4 - 14
195 Mflop/s
703 Mflop/s
100 Mflop/s
469 Mflop/s
Itanium 2 - 31
Itanium 1 - 7
225 Mflop/s
1.1 Gflop/s
103 Mflop/s
276 Mflop/s
39Opteron Performance Profile
Opteron - 18
40Extra Work Can Improve Efficiency!
- Example 3x3 blocking
- Logical grid of 3x3 cells
- Fill-in explicit zeros
- Unroll 3x3 block multiplies
- Fill ratio 1.5
- On Pentium III 1.5x speedup!
- Automatic tuning
- Counter intuitive optimization
- Selects block size and generates optimized
code/matrix
41Summary of Optimizations
- Optimizations for SpMV (numbers shown are
maximums) - Register blocking (RB) up to 4x
- Variable block splitting 2.1x over CSR, 1.8x
over RB - Diagonals 2x
- Reordering to create dense structure splitting
2x - Symmetry 2.8x
- Cache blocking 6x
- Multiple vectors (SpMM) 7x
- Sparse triangular solve
- Hybrid sparse/dense data structure 1.8x
- Higher-level kernels
- AATx, ATAx 4x
- A2x 2x over CSR, 1.5x
- Future automatic tuning for vectors
42Architectural Probes
- Understanding memory system performance
- Interaction with processor architecture
- Number of registers
- Arithmetic units (parallelism)
- Prefetching
- Cache size, structure, policies
- APEX-MAP memory and network system
- Sqmat processor features included
43Impact of Indirection
- Results from the sqmat probe
- Unit stride access via indirection (S1)
- Operton, Power3/4 less 10 penalty once Mgt8 -
demonstrating bandwidth between cache and
processor effectively delivers addresses and
values - Itanium2 showing high penalty for indirection
44Tolerating Irregularity
- S50 (Penalty for random access)
- S is the length of each unit stride run
- Start with S? (indirect unit stride)
- How large must S be to achieve at least 50 of
this performance? - All done for a fixed computational intensity
- CI50 (Hide random access penalty using high
computational intensity) - CI is computational intensity, controlled by
number of squarings (M) per matrix - Start with M1, S?
- At S1 (every access random), how large must M be
to achieve 50 of this performance? - For both, lower numbers are better
45Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? CI50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter is expensive on commodity cache-based systems Power4 is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
46Memory System Observations
- Caches are important
- Important gap has moved
- between L3/memory, not L1/L2
- Prefetching increasingly important
- Limited and finicky
- Effect may overwhelm cache optimizations if
blocking increases non-unit stride access - Sparse codes matrix volume is key factor
- Not the indirect loads
47Ongoing Vector Investigation
- How much hardware support for vector-like
performance? - Can small changes to a conventional processor get
this effect? - Role of compilers/software
- Related to Power5 effort
- Latency hiding in software
- Prefetch engines easily confused
- Sparse matrix (random) and grid-based (strided)
applications are target - Currently investigating simulator tools and any
emerging hardware
48Summary
- High level goals
- Understand future HPC architecture options that
are commercially viable - Can minimal hardware extensions make improve
effectiveness for scientific applications - Various technologies
- Current, future, academic
- Various performance analysis techniques
- Application level benchmarks
- Application kernel benchmarks (SPMV, stencil)
- Architectural probes
- Performance modeling and prediction
49People within BIPS
- Jonathan Carter
- Kaushik Datta
- James Demmel
- Joe Gebis
- Paul Hargrove
- Parry Husbands
- Shoaib Kamil
- Bill Kramer
- Rajesh Nishtala
- Leonid Oliker
- John Shalf
- Hongzhang Shan
- Horst Simon
- David Skinner
- Erich Strohmaier
- Rich Vuduc
- Mike Welcome
- Sam Williams
- Katherine Yelick
And many collaborators outside Berkeley Lab/Campus
50End of Slides
51Sqmat overview
- Java code generate produces unrolled C code
- Stream of matrices
- Square each Matrix M times in
- M controls computational intensity (CI) - the
ratio between flops and mem access - Each matrix is size NxN
- N controls working set size 2N2 registers
required per matrix. N is varied to cover
observable register set size. - Two storage formats
- Direct Storage Sqmats matrix entries stored
continuously in memory - Indirect Entries accessed through indirection
vector. Stanza length S controls degree of
indirection
NxN
. . .
S in a row
52Slowdown due to Indirection
Unit stride access via indirection (S1)
- Operton, Power3/4 less 10 penalty once Mgt8 -
demonstrating bandwidth between cache and
processor effectively delivers addresses and
values - Itanium2 showing high penalty for indirection
53Potential Impact on Applications T3P
- Source SLAC Ko
- 80 of time spent in SpMV
- Relevant optimization techniques
- Symmetric storage
- Register blocking
- On Single Processor Itanium 2
- 1.68x speedup
- 532 Mflops, or 15 of 3.6 GFlop peak
- 4.4x speedup with 8 multiple vectors
- 1380 Mflops, or 38 of peak
54Potential Impact on Applications Omega3P
- Application accelerator cavity design Ko
- Relevant optimization techniques
- Symmetric storage
- Register blocking
- Reordering
- Reverse Cuthill-McKee ordering to reduce
bandwidth - Traveling Salesman Problem-based ordering to
create blocks - Nodes columns of A
- Weights(u, v) no. of nz u, v have in common
- Tour ordering of columns
- Choose maximum weight tour
- See Pinar Heath 97
- 2x speedup on Itanium 2, but SPMV not dominant
55Tolerating Irregularity
- S50 (Penalty for random access)
- S is the length of each unit stride run
- Start with S? (indirect unit stride)
- How large must S be to achieve at least 50 of
this performance? - All done for a fixed computational intensity
- CI50 (Hide random access penalty using high
computational intensity) - CI is computational intensity, controlled by
number of squarings (M) per matrix - Start with M1, S?
- At S1 (every access random), how large must M be
to achieve 50 of this performance? - For both, lower numbers are better
56Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? CI50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter is expensive on commodity cache-based systems Power4 is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
57Emerging Architectures
- General purpose processors badly suited for data
intensive ops - Large caches not useful if re-use is low
- Low memory bandwidth, especially for irregular
patterns - Superscalar methods of increasing ILP inefficient
- Power consumption
- Research architectures
- Berkeley IRAM Vector and PIM chip
- Stanford Imagine Stream processor
- ISI Diva PIM with conventional processor
58Sqmat on PIM Systems
- Performance of Sqmat on PIMs and others for 3x3
matrices, squared 10 times (high computational
intensity!) - Imagine much faster for long streams, slower for
short ones
59Comparison to HPCC Four Corners
Opteron LINPACK 2000 MFLOPS _at_1.4ghz Sqmat 2145
MFLOPS _at_1.6ghz STREAMS 1969 MB/s Sqmat 2047
MB/s RandomAccess 0.00442 GUPs Sqmat 0.00440
GUPs
Stream Sqmat S0 M1 N1
RandomAccess Sqmat S1 M1 N1
Temporal Locality
Itanium2 LINPACK 4.65 GFLOPs Sqmat 4.47
GFLOPs STREAMS 3895 MB/s Sqmat 4055
MB/s RandomAccess 0.00484 GUPs Sqmat 0.0141 GUPs
LINPACK Sqmat S0 M8 N8
FFT (future)
Spatial Locality