Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS) - PowerPoint PPT Presentation

About This Presentation

Title:

Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS)

Description:

B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S. C O M P U T A T I O N A L R E S E A R C H D I V I S I O N ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 40

Provided by: erichstr

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS)

1
Performance Understanding, Prediction, and
Tuning at the Berkeley Institute for
Performance Studies (BIPS)

Katherine Yelick, BIPS Director
Lawrence Berkeley National Laboratory and
U. C. Berkeley, EECS Dept.

National Science Foundation
2
Challenges to Performance

Two trends in High End Computing
Increasing complicated systems
Multiple forms of parallelism
Many levels of memory hierarchy
Complex systems software in between
Increasingly sophisticated algorithms
Unstructured meshes and sparse matrices
Adaptivity in time and space
Multi-physics models lead to hybrid approaches
Conclusion Deep understanding of performance at
all levels is important

3
BIPS Institute Goals

Bring together researchers on all aspects of
performance engineering
Use performance understanding to
Improve application performance
Compare architectures for application suitability
Influence the design of processors, networks and
compilers
Identify algorithmic needs

4
BIPS Approaches

Benchmarking and Analysis
Measure performance
Identify opportunities for improvements in
software, hardware, and algorithms
Modeling
Predict performance on future machines
Understand performance limits
Tuning
Improve performance
By hand or with automatic self-tuning tools

5
Multi-Level Analysis

Full Applications
What users want
Do not reveal impact of features
Compact Applications
Can be ported with modest effort

Easily match phases of full applications
Microbenchmarks
Isolate architectural features
Hard to tie to real applications

6
Projects Within BIPS

Application evaluation on vector processors
APEX Application Performance Characterization
Benchmarking
BeBop Berkeley Benchmarking and Optimization
Group
Architectural probes for alternative
architectures
LAPACK Linear Algebra Package
PERC Performance Engineering Research Center
Top500
ViVA Virtual Vector Architectures

7
Application Evaluation of Vector Systems

Two vector architectures
The Japanese Earth Simulator
The Cray X1
Comparison to commodity-based systems
IBM SP, Power4
SGI Altix
Ongoing study of DOE applications
CACTUS Astrophysics 100,000 lines
grid based
PARATEC Material Science 50,000 lines
Fourier space
LBMHD Plasma Physics 1,500 lines
grid based
GTC Magnetic Fusion 5,000 lines
particle based
MADCAP Cosmology 5,000 lines
dense lin. alg.
Work by L. Oliker, J. Borrill, A. Canning, J.
Carter, J. Shalf, S. Hongzhang

8
Architectural Comparison
Node Type Where CPU/Node ClockMHz PeakGFlop Mem BW GB/s Peak byte/flop NetwkBWGB/s/P BisectBWbyte/flop MPI Latencyusec NetworkTopology
Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree
Power4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree
Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-tree
ES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 Crossbar
X1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus

Custom vector architectures have
High memory bandwidth relative to peak
Tightly integrated networks result in lower
latency (Altix)
Bisection bandwidth depends on topology
JES also dominates here
A key balance point for vector systems is the
scalarvector ratio

9
Summary of Results
Code (P64) peak (P64) peak (P64) peak (P64) peak (P64) peak (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs.
Code Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1
LBMHD 7 5 11 58 37 30.6 15.3 7.2 1.5
CACTUS 6 11 7 34 6 45.0 5.1 6.4 4.0
GTC 9 6 5 16 11 9.4 4.3 4.1 0.9
PARATEC 57 33 54 58 20 8.2 3.9 1.4 3.9
MADCAP 61 40 --- 53 19 3.4 2.3 --- 0.9

Tremendous potential of vector architectures
4 codes running faster than ever before
Vector systems allow resolution not possible with
scalar (any procs)
Advantage of having larger/faster nodes
ES shows much higher sustained performance than
X1
Limited X1 specific optimization so far - more
may be possible (CAF, etc)
Non-vectorizable code segments become very
expensive (81 or even 321 ratio)
Vectors potentially at odds w/ emerging methods
(sparse, irregular, adaptive)
GTC example code at odds with data-parallelism

10
Comparison to HPCC Four Corners
RandomAccess
Stream

Temporal Locality
FFT
LINPACK
Spatial Locality
11
APEX-MAP Benchmark

Goal Quantify the effects of temporal and
spatial locality
Focus on memory system and network performance

Graphs over temporal and spatial locality axes
Show performance valleys/cliffs

12
MicroBenchmarks

Using Adaptable probes to understand
micro-architecture limits
Tunable to match application kernels
Ability to collect continuous data sets over
parameters reveals performance cliffs
Two examples
Sqmat
APEX-Map
Also application kernel benchmarks
SPMV (for HPCS)
Stencil probe

13
APEX-MAP Probe

Use an array of size M.
Access data in vectors of length L.
Regular
Walk over consecutive (unit stride) vectors
through memory.
Re-access each vector k-times.
Random
Pick the start address of the vector randomly.
Use the properties of the random numbers to
achieve a re-use number k.
Use the Power distribution for the non-uniform
random address generator.
Exponent ? in 0,1
?1 Uniform random access.
?0 Access to a single vector only.

14
Apex-Map Sequential
spatial
temporal
15
Apex-Map Sequential
spatial
temporal
Performance sensitive to both spatial and
temporal locality
16
Apex-Map Sequential
spatial
temporal
Performance sensitive to both spatial and
temporal locality
17
Apex-Map Sequential
spatial
temporal
Performance less sensitive to temporal locality
18
Apex-Map Sequential
spatial
temporal
Performance less sensitive to temporal locality
19
Parallel Version

Same Design Principal as sequential code.
Data evenly distributed among processes.
L contiguous addresses will be accessed together.
Each remote access is a communication message
with length L.
Random Access.
MPI version first
Plans to do Shmem and UPC

20
Parallel APEX-Map
21
Parallel APEX-Map
22
Application Kernel Benchmarks

Microbenchmarks are good for
Identifying architecture/compiler bottlenecks
Optimization opportunities
Application benchmarks are good for
Machine selection for specific apps
In between Benchmarks to capture important
behavior in real applications
Sparse matrices SPMV benchmark
Stencil operations Stencil probe
Possible future sorting, narrow datatype ops,

23
Sparse Matrix Vector Multiply (SPMV)

Sparse matrix algorithms
Increasingly important in applications
Challenge memory systems poor locality
Many matrices have structure, e.g., dense
sub-blocks, that can be exploited
Benchmarking SPMV
NAS CG, SciMark, use a random matrix
Not reflective of most real problems
Benchmark challenge
Ship real matrices cumbersome inflexible
Build realistic synthetic matrices

24
Importance of Using Blocked Matrices
Speedup of best-case blocked matrix vs unblocked
25
Generating Blocked Matrices

Our approach Uniformly distributed random
structure, each a rxc block
Collect data for r and c from 1 to 12
Validation Can our random matrices simulate
typical matrices?
44 matrices from various applications
1 Dense matrix in sparse format
2-17 Finite-Element-Method matrices, FEM
2-9 single block size, 10-17 multiple block
sizes
18-44 non-FEM
Summarization Weighted by occurrence in test
suite (ongoing)

26
Itanium 2 prediction
27
UltraSparc III prediction
28
Benchmark details

BCSR Randomly scattered nonzero blocks
Non-zero density average from FEM matrices
Outputs
Different block dimensions 1x1, best case,
average over common block dimensions for FEM
problems
Different problem sizes
small matrix and vectors in cache
medium matrix out of cache, vectors in cache
large matrix and vectors out of cache
Still working on this distribution of nonzeros
could make SpMV on a large matrix act like SpMV
on a smaller matrix
What if cache size not known?
Working on classification algorithms to guess the
cache size, based on a range of performance tests

29
Sample summary results (Apple G5, 1.8 GHz)
30
Selected SpMV benchmark results

Raw results
Which machine is fastest
Scaled machine's peak floating-point rate
Mitigates chip technology factors
Influenced by compiler issues
Fraction of peak memory bandwidth
Use Stream bechmark for attainable peak
How close to this bound is SPMV running?

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Lessons Learned

Tuning is important
Motivates tool for automatic tuning
Scaling by peak floating-point rate
SSE2 machines hurt by this measure Hard for
compilers to identify SIMD parallelism
Scaling by peak memory bandwidth
Blocking a matrix improves actual bandwidth
Also reduces total matrix size (less metadata)

35
Automatic Performance Tuning

Performance depends on machine, kernel, matrix
Matrix known at run-time
Best data structure implementation can be
surprising
Filling in explicit zeros can
Reduce storage
Improve performance
PIII example 50 more nonzeros, 50 faster
BeBOP approach empirical modeling and search
Up to 4x speedups and 31 of peak for SpMV
Many optimization techniques for SpMV
Several other kernels triangular solve, ATAx,
Akx
Proof-of-concept Integrate with Omega3P
Release OSKI Library, integrate into PETSc

36
Extra Work Can Improve Efficiency!

More complicated non-zero structure in general
Example 3x3 blocking
Logical grid of 3x3 cells
Fill-in explicit zeros
Unroll 3x3 block multiplies
Fill ratio 1.5
On Pentium III 1.5x speedup!

37
Ultra 2i - 9
Ultra 3 - 6
63 Mflop/s
109 Mflop/s
35 Mflop/s
53 Mflop/s
Pentium III-M - 15
Pentium III - 19
96 Mflop/s
120 Mflop/s
42 Mflop/s
58 Mflop/s
38
Power3 - 13
Power4 - 14
195 Mflop/s
703 Mflop/s
100 Mflop/s
469 Mflop/s
Itanium 2 - 31
Itanium 1 - 7
225 Mflop/s
1.1 Gflop/s
103 Mflop/s
276 Mflop/s
39
Opteron Performance Profile
Opteron - 18
40
Extra Work Can Improve Efficiency!

Example 3x3 blocking
Logical grid of 3x3 cells
Fill-in explicit zeros
Unroll 3x3 block multiplies
Fill ratio 1.5
On Pentium III 1.5x speedup!
Automatic tuning
Counter intuitive optimization
Selects block size and generates optimized
code/matrix

41
Summary of Optimizations

Optimizations for SpMV (numbers shown are
maximums)
Register blocking (RB) up to 4x
Variable block splitting 2.1x over CSR, 1.8x
over RB
Diagonals 2x
Reordering to create dense structure splitting
2x
Symmetry 2.8x
Cache blocking 6x
Multiple vectors (SpMM) 7x
Sparse triangular solve
Hybrid sparse/dense data structure 1.8x
Higher-level kernels
AATx, ATAx 4x
A2x 2x over CSR, 1.5x
Future automatic tuning for vectors

42
Architectural Probes

Understanding memory system performance
Interaction with processor architecture
Number of registers
Arithmetic units (parallelism)
Prefetching
Cache size, structure, policies
APEX-MAP memory and network system
Sqmat processor features included

43
Impact of Indirection

Results from the sqmat probe
Unit stride access via indirection (S1)

Operton, Power3/4 less 10 penalty once Mgt8 -
demonstrating bandwidth between cache and
processor effectively delivers addresses and
values
Itanium2 showing high penalty for indirection

44
Tolerating Irregularity

S50 (Penalty for random access)
S is the length of each unit stride run
Start with S? (indirect unit stride)
How large must S be to achieve at least 50 of
this performance?
All done for a fixed computational intensity
CI50 (Hide random access penalty using high
computational intensity)
CI is computational intensity, controlled by
number of squarings (M) per matrix
Start with M1, S?
At S1 (every access random), how large must M be
to achieve 50 of this performance?
For both, lower numbers are better

45
Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? CI50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter is expensive on commodity cache-based systems Power4 is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
46
Memory System Observations

Caches are important
Important gap has moved
between L3/memory, not L1/L2
Prefetching increasingly important
Limited and finicky
Effect may overwhelm cache optimizations if
blocking increases non-unit stride access
Sparse codes matrix volume is key factor
Not the indirect loads

47
Ongoing Vector Investigation

How much hardware support for vector-like
performance?
Can small changes to a conventional processor get
this effect?
Role of compilers/software
Related to Power5 effort
Latency hiding in software
Prefetch engines easily confused
Sparse matrix (random) and grid-based (strided)
applications are target
Currently investigating simulator tools and any
emerging hardware

48
Summary

High level goals
Understand future HPC architecture options that
are commercially viable
Can minimal hardware extensions make improve
effectiveness for scientific applications
Various technologies
Current, future, academic
Various performance analysis techniques
Application level benchmarks
Application kernel benchmarks (SPMV, stencil)
Architectural probes
Performance modeling and prediction

49
People within BIPS

Jonathan Carter
Kaushik Datta
James Demmel
Joe Gebis
Paul Hargrove
Parry Husbands
Shoaib Kamil
Bill Kramer
Rajesh Nishtala
Leonid Oliker

John Shalf
Hongzhang Shan
Horst Simon
David Skinner
Erich Strohmaier
Rich Vuduc
Mike Welcome
Sam Williams
Katherine Yelick

And many collaborators outside Berkeley Lab/Campus
50
End of Slides
51
Sqmat overview

Java code generate produces unrolled C code
Stream of matrices
Square each Matrix M times in
M controls computational intensity (CI) - the
ratio between flops and mem access
Each matrix is size NxN
N controls working set size 2N2 registers
required per matrix. N is varied to cover
observable register set size.
Two storage formats
Direct Storage Sqmats matrix entries stored
continuously in memory
Indirect Entries accessed through indirection
vector. Stanza length S controls degree of
indirection

NxN
. . .
S in a row
52
Slowdown due to Indirection
Unit stride access via indirection (S1)

Operton, Power3/4 less 10 penalty once Mgt8 -
demonstrating bandwidth between cache and
processor effectively delivers addresses and
values
Itanium2 showing high penalty for indirection

53
Potential Impact on Applications T3P

Source SLAC Ko
80 of time spent in SpMV
Relevant optimization techniques
Symmetric storage
Register blocking
On Single Processor Itanium 2
1.68x speedup
532 Mflops, or 15 of 3.6 GFlop peak
4.4x speedup with 8 multiple vectors
1380 Mflops, or 38 of peak

54
Potential Impact on Applications Omega3P

Application accelerator cavity design Ko
Relevant optimization techniques
Symmetric storage
Register blocking
Reordering
Reverse Cuthill-McKee ordering to reduce
bandwidth
Traveling Salesman Problem-based ordering to
create blocks
Nodes columns of A
Weights(u, v) no. of nz u, v have in common
Tour ordering of columns
Choose maximum weight tour
See Pinar Heath 97
2x speedup on Itanium 2, but SPMV not dominant

55
Tolerating Irregularity

S50 (Penalty for random access)
S is the length of each unit stride run
Start with S? (indirect unit stride)
How large must S be to achieve at least 50 of
this performance?
All done for a fixed computational intensity
CI50 (Hide random access penalty using high
computational intensity)
CI is computational intensity, controlled by
number of squarings (M) per matrix
Start with M1, S?
At S1 (every access random), how large must M be
to achieve 50 of this performance?
For both, lower numbers are better

56
Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? CI50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter is expensive on commodity cache-based systems Power4 is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
57
Emerging Architectures

General purpose processors badly suited for data
intensive ops
Large caches not useful if re-use is low
Low memory bandwidth, especially for irregular
patterns
Superscalar methods of increasing ILP inefficient
Power consumption
Research architectures
Berkeley IRAM Vector and PIM chip
Stanford Imagine Stream processor
ISI Diva PIM with conventional processor

58
Sqmat on PIM Systems

Performance of Sqmat on PIMs and others for 3x3
matrices, squared 10 times (high computational
intensity!)
Imagine much faster for long streams, slower for
short ones

59
Comparison to HPCC Four Corners
Opteron LINPACK 2000 MFLOPS _at_1.4ghz Sqmat 2145
MFLOPS _at_1.6ghz STREAMS 1969 MB/s Sqmat 2047
MB/s RandomAccess 0.00442 GUPs Sqmat 0.00440
GUPs
Stream Sqmat S0 M1 N1
RandomAccess Sqmat S1 M1 N1

Temporal Locality
Itanium2 LINPACK 4.65 GFLOPs Sqmat 4.47
GFLOPs STREAMS 3895 MB/s Sqmat 4055
MB/s RandomAccess 0.00484 GUPs Sqmat 0.0141 GUPs
LINPACK Sqmat S0 M8 N8
FFT (future)
Spatial Locality

Write a Comment

User Comments (0)