Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS) - PowerPoint PPT Presentation

About This Presentation
Title:

Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS)

Description:

B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S. C O M P U T A T I O N A L R E S E A R C H D I V I S I O N ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 40
Provided by: erichstr
Category:

less

Transcript and Presenter's Notes

Title: Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS)


1
Performance Understanding, Prediction, and
Tuning at the Berkeley Institute for
Performance Studies (BIPS)
  • Katherine Yelick, BIPS Director
  • Lawrence Berkeley National Laboratory and
  • U. C. Berkeley, EECS Dept.

National Science Foundation
2
Challenges to Performance
  • Two trends in High End Computing
  • Increasing complicated systems
  • Multiple forms of parallelism
  • Many levels of memory hierarchy
  • Complex systems software in between
  • Increasingly sophisticated algorithms
  • Unstructured meshes and sparse matrices
  • Adaptivity in time and space
  • Multi-physics models lead to hybrid approaches
  • Conclusion Deep understanding of performance at
    all levels is important

3
BIPS Institute Goals
  • Bring together researchers on all aspects of
    performance engineering
  • Use performance understanding to
  • Improve application performance
  • Compare architectures for application suitability
  • Influence the design of processors, networks and
    compilers
  • Identify algorithmic needs

4
BIPS Approaches
  • Benchmarking and Analysis
  • Measure performance
  • Identify opportunities for improvements in
    software, hardware, and algorithms
  • Modeling
  • Predict performance on future machines
  • Understand performance limits
  • Tuning
  • Improve performance
  • By hand or with automatic self-tuning tools

5
Multi-Level Analysis
  • Full Applications
  • What users want
  • Do not reveal impact of features
  • Compact Applications
  • Can be ported with modest effort
  • Easily match phases of full applications
  • Microbenchmarks
  • Isolate architectural features
  • Hard to tie to real applications

6
Projects Within BIPS
  • Application evaluation on vector processors
  • APEX Application Performance Characterization
    Benchmarking
  • BeBop Berkeley Benchmarking and Optimization
    Group
  • Architectural probes for alternative
    architectures
  • LAPACK Linear Algebra Package
  • PERC Performance Engineering Research Center
  • Top500
  • ViVA Virtual Vector Architectures

7
Application Evaluation of Vector Systems
  • Two vector architectures
  • The Japanese Earth Simulator
  • The Cray X1
  • Comparison to commodity-based systems
  • IBM SP, Power4
  • SGI Altix
  • Ongoing study of DOE applications
  • CACTUS Astrophysics 100,000 lines
    grid based
  • PARATEC Material Science 50,000 lines
    Fourier space
  • LBMHD Plasma Physics 1,500 lines
    grid based
  • GTC Magnetic Fusion 5,000 lines
    particle based
  • MADCAP Cosmology 5,000 lines
    dense lin. alg.
  • Work by L. Oliker, J. Borrill, A. Canning, J.
    Carter, J. Shalf, S. Hongzhang

8
Architectural Comparison
Node Type Where CPU/Node ClockMHz PeakGFlop Mem BW GB/s Peak byte/flop NetwkBWGB/s/P BisectBWbyte/flop MPI Latencyusec NetworkTopology
Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree
Power4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree
Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-tree
ES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 Crossbar
X1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus
  • Custom vector architectures have
  • High memory bandwidth relative to peak
  • Tightly integrated networks result in lower
    latency (Altix)
  • Bisection bandwidth depends on topology
  • JES also dominates here
  • A key balance point for vector systems is the
    scalarvector ratio

9
Summary of Results
Code (P64) peak (P64) peak (P64) peak (P64) peak (P64) peak (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs. (PMax avail) Speedup ES vs.
Code Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1
LBMHD 7 5 11 58 37 30.6 15.3 7.2 1.5
CACTUS 6 11 7 34 6 45.0 5.1 6.4 4.0
GTC 9 6 5 16 11 9.4 4.3 4.1 0.9
PARATEC 57 33 54 58 20 8.2 3.9 1.4 3.9
MADCAP 61 40 --- 53 19 3.4 2.3 --- 0.9
  • Tremendous potential of vector architectures
  • 4 codes running faster than ever before
  • Vector systems allow resolution not possible with
    scalar (any procs)
  • Advantage of having larger/faster nodes
  • ES shows much higher sustained performance than
    X1
  • Limited X1 specific optimization so far - more
    may be possible (CAF, etc)
  • Non-vectorizable code segments become very
    expensive (81 or even 321 ratio)
  • Vectors potentially at odds w/ emerging methods
    (sparse, irregular, adaptive)
  • GTC example code at odds with data-parallelism

10
Comparison to HPCC Four Corners
RandomAccess
Stream

Temporal Locality
FFT
LINPACK
Spatial Locality
11
APEX-MAP Benchmark
  • Goal Quantify the effects of temporal and
    spatial locality
  • Focus on memory system and network performance
  • Graphs over temporal and spatial locality axes
  • Show performance valleys/cliffs

12
MicroBenchmarks
  • Using Adaptable probes to understand
    micro-architecture limits
  • Tunable to match application kernels
  • Ability to collect continuous data sets over
    parameters reveals performance cliffs
  • Two examples
  • Sqmat
  • APEX-Map
  • Also application kernel benchmarks
  • SPMV (for HPCS)
  • Stencil probe

13
APEX-MAP Probe
  • Use an array of size M.
  • Access data in vectors of length L.
  • Regular
  • Walk over consecutive (unit stride) vectors
    through memory.
  • Re-access each vector k-times.
  • Random
  • Pick the start address of the vector randomly.
  • Use the properties of the random numbers to
    achieve a re-use number k.
  • Use the Power distribution for the non-uniform
    random address generator.
  • Exponent ? in 0,1
  • ?1 Uniform random access.
  • ?0 Access to a single vector only.

14
Apex-Map Sequential
spatial
temporal
15
Apex-Map Sequential
spatial
temporal
Performance sensitive to both spatial and
temporal locality
16
Apex-Map Sequential
spatial
temporal
Performance sensitive to both spatial and
temporal locality
17
Apex-Map Sequential
spatial
temporal
Performance less sensitive to temporal locality
18
Apex-Map Sequential
spatial
temporal
Performance less sensitive to temporal locality
19
Parallel Version
  • Same Design Principal as sequential code.
  • Data evenly distributed among processes.
  • L contiguous addresses will be accessed together.
  • Each remote access is a communication message
    with length L.
  • Random Access.
  • MPI version first
  • Plans to do Shmem and UPC

20
Parallel APEX-Map
21
Parallel APEX-Map
22
Application Kernel Benchmarks
  • Microbenchmarks are good for
  • Identifying architecture/compiler bottlenecks
  • Optimization opportunities
  • Application benchmarks are good for
  • Machine selection for specific apps
  • In between Benchmarks to capture important
    behavior in real applications
  • Sparse matrices SPMV benchmark
  • Stencil operations Stencil probe
  • Possible future sorting, narrow datatype ops,

23
Sparse Matrix Vector Multiply (SPMV)
  • Sparse matrix algorithms
  • Increasingly important in applications
  • Challenge memory systems poor locality
  • Many matrices have structure, e.g., dense
    sub-blocks, that can be exploited
  • Benchmarking SPMV
  • NAS CG, SciMark, use a random matrix
  • Not reflective of most real problems
  • Benchmark challenge
  • Ship real matrices cumbersome inflexible
  • Build realistic synthetic matrices

24
Importance of Using Blocked Matrices
Speedup of best-case blocked matrix vs unblocked
25
Generating Blocked Matrices
  • Our approach Uniformly distributed random
    structure, each a rxc block
  • Collect data for r and c from 1 to 12
  • Validation Can our random matrices simulate
    typical matrices?
  • 44 matrices from various applications
  • 1 Dense matrix in sparse format
  • 2-17 Finite-Element-Method matrices, FEM
  • 2-9 single block size, 10-17 multiple block
    sizes
  • 18-44 non-FEM
  • Summarization Weighted by occurrence in test
    suite (ongoing)

26
Itanium 2 prediction
27
UltraSparc III prediction
28
Benchmark details
  • BCSR Randomly scattered nonzero blocks
  • Non-zero density average from FEM matrices
  • Outputs
  • Different block dimensions 1x1, best case,
    average over common block dimensions for FEM
    problems
  • Different problem sizes
  • small matrix and vectors in cache
  • medium matrix out of cache, vectors in cache
  • large matrix and vectors out of cache
  • Still working on this distribution of nonzeros
    could make SpMV on a large matrix act like SpMV
    on a smaller matrix
  • What if cache size not known?
  • Working on classification algorithms to guess the
    cache size, based on a range of performance tests

29
Sample summary results (Apple G5, 1.8 GHz)
30
Selected SpMV benchmark results
  • Raw results
  • Which machine is fastest
  • Scaled machine's peak floating-point rate
  • Mitigates chip technology factors
  • Influenced by compiler issues
  • Fraction of peak memory bandwidth
  • Use Stream bechmark for attainable peak
  • How close to this bound is SPMV running?

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Lessons Learned
  • Tuning is important
  • Motivates tool for automatic tuning
  • Scaling by peak floating-point rate
  • SSE2 machines hurt by this measure Hard for
    compilers to identify SIMD parallelism
  • Scaling by peak memory bandwidth
  • Blocking a matrix improves actual bandwidth
  • Also reduces total matrix size (less metadata)

35
Automatic Performance Tuning
  • Performance depends on machine, kernel, matrix
  • Matrix known at run-time
  • Best data structure implementation can be
    surprising
  • Filling in explicit zeros can
  • Reduce storage
  • Improve performance
  • PIII example 50 more nonzeros, 50 faster
  • BeBOP approach empirical modeling and search
  • Up to 4x speedups and 31 of peak for SpMV
  • Many optimization techniques for SpMV
  • Several other kernels triangular solve, ATAx,
    Akx
  • Proof-of-concept Integrate with Omega3P
  • Release OSKI Library, integrate into PETSc

36
Extra Work Can Improve Efficiency!
  • More complicated non-zero structure in general
  • Example 3x3 blocking
  • Logical grid of 3x3 cells
  • Fill-in explicit zeros
  • Unroll 3x3 block multiplies
  • Fill ratio 1.5
  • On Pentium III 1.5x speedup!

37
Ultra 2i - 9
Ultra 3 - 6
63 Mflop/s
109 Mflop/s
35 Mflop/s
53 Mflop/s
Pentium III-M - 15
Pentium III - 19
96 Mflop/s
120 Mflop/s
42 Mflop/s
58 Mflop/s
38
Power3 - 13
Power4 - 14
195 Mflop/s
703 Mflop/s
100 Mflop/s
469 Mflop/s
Itanium 2 - 31
Itanium 1 - 7
225 Mflop/s
1.1 Gflop/s
103 Mflop/s
276 Mflop/s
39
Opteron Performance Profile
Opteron - 18
40
Extra Work Can Improve Efficiency!
  • Example 3x3 blocking
  • Logical grid of 3x3 cells
  • Fill-in explicit zeros
  • Unroll 3x3 block multiplies
  • Fill ratio 1.5
  • On Pentium III 1.5x speedup!
  • Automatic tuning
  • Counter intuitive optimization
  • Selects block size and generates optimized
    code/matrix

41
Summary of Optimizations
  • Optimizations for SpMV (numbers shown are
    maximums)
  • Register blocking (RB) up to 4x
  • Variable block splitting 2.1x over CSR, 1.8x
    over RB
  • Diagonals 2x
  • Reordering to create dense structure splitting
    2x
  • Symmetry 2.8x
  • Cache blocking 6x
  • Multiple vectors (SpMM) 7x
  • Sparse triangular solve
  • Hybrid sparse/dense data structure 1.8x
  • Higher-level kernels
  • AATx, ATAx 4x
  • A2x 2x over CSR, 1.5x
  • Future automatic tuning for vectors

42
Architectural Probes
  • Understanding memory system performance
  • Interaction with processor architecture
  • Number of registers
  • Arithmetic units (parallelism)
  • Prefetching
  • Cache size, structure, policies
  • APEX-MAP memory and network system
  • Sqmat processor features included

43
Impact of Indirection
  • Results from the sqmat probe
  • Unit stride access via indirection (S1)
  • Operton, Power3/4 less 10 penalty once Mgt8 -
    demonstrating bandwidth between cache and
    processor effectively delivers addresses and
    values
  • Itanium2 showing high penalty for indirection

44
Tolerating Irregularity
  • S50 (Penalty for random access)
  • S is the length of each unit stride run
  • Start with S? (indirect unit stride)
  • How large must S be to achieve at least 50 of
    this performance?
  • All done for a fixed computational intensity
  • CI50 (Hide random access penalty using high
    computational intensity)
  • CI is computational intensity, controlled by
    number of squarings (M) per matrix
  • Start with M1, S?
  • At S1 (every access random), how large must M be
    to achieve 50 of this performance?
  • For both, lower numbers are better

45
Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? CI50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter is expensive on commodity cache-based systems Power4 is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
46
Memory System Observations
  • Caches are important
  • Important gap has moved
  • between L3/memory, not L1/L2
  • Prefetching increasingly important
  • Limited and finicky
  • Effect may overwhelm cache optimizations if
    blocking increases non-unit stride access
  • Sparse codes matrix volume is key factor
  • Not the indirect loads

47
Ongoing Vector Investigation
  • How much hardware support for vector-like
    performance?
  • Can small changes to a conventional processor get
    this effect?
  • Role of compilers/software
  • Related to Power5 effort
  • Latency hiding in software
  • Prefetch engines easily confused
  • Sparse matrix (random) and grid-based (strided)
    applications are target
  • Currently investigating simulator tools and any
    emerging hardware

48
Summary
  • High level goals
  • Understand future HPC architecture options that
    are commercially viable
  • Can minimal hardware extensions make improve
    effectiveness for scientific applications
  • Various technologies
  • Current, future, academic
  • Various performance analysis techniques
  • Application level benchmarks
  • Application kernel benchmarks (SPMV, stencil)
  • Architectural probes
  • Performance modeling and prediction

49
People within BIPS
  • Jonathan Carter
  • Kaushik Datta
  • James Demmel
  • Joe Gebis
  • Paul Hargrove
  • Parry Husbands
  • Shoaib Kamil
  • Bill Kramer
  • Rajesh Nishtala
  • Leonid Oliker
  • John Shalf
  • Hongzhang Shan
  • Horst Simon
  • David Skinner
  • Erich Strohmaier
  • Rich Vuduc
  • Mike Welcome
  • Sam Williams
  • Katherine Yelick

And many collaborators outside Berkeley Lab/Campus
50
End of Slides
51
Sqmat overview
  • Java code generate produces unrolled C code
  • Stream of matrices
  • Square each Matrix M times in
  • M controls computational intensity (CI) - the
    ratio between flops and mem access
  • Each matrix is size NxN
  • N controls working set size 2N2 registers
    required per matrix. N is varied to cover
    observable register set size.
  • Two storage formats
  • Direct Storage Sqmats matrix entries stored
    continuously in memory
  • Indirect Entries accessed through indirection
    vector. Stanza length S controls degree of
    indirection

NxN
. . .
S in a row
52
Slowdown due to Indirection
Unit stride access via indirection (S1)
  • Operton, Power3/4 less 10 penalty once Mgt8 -
    demonstrating bandwidth between cache and
    processor effectively delivers addresses and
    values
  • Itanium2 showing high penalty for indirection

53
Potential Impact on Applications T3P
  • Source SLAC Ko
  • 80 of time spent in SpMV
  • Relevant optimization techniques
  • Symmetric storage
  • Register blocking
  • On Single Processor Itanium 2
  • 1.68x speedup
  • 532 Mflops, or 15 of 3.6 GFlop peak
  • 4.4x speedup with 8 multiple vectors
  • 1380 Mflops, or 38 of peak

54
Potential Impact on Applications Omega3P
  • Application accelerator cavity design Ko
  • Relevant optimization techniques
  • Symmetric storage
  • Register blocking
  • Reordering
  • Reverse Cuthill-McKee ordering to reduce
    bandwidth
  • Traveling Salesman Problem-based ordering to
    create blocks
  • Nodes columns of A
  • Weights(u, v) no. of nz u, v have in common
  • Tour ordering of columns
  • Choose maximum weight tour
  • See Pinar Heath 97
  • 2x speedup on Itanium 2, but SPMV not dominant

55
Tolerating Irregularity
  • S50 (Penalty for random access)
  • S is the length of each unit stride run
  • Start with S? (indirect unit stride)
  • How large must S be to achieve at least 50 of
    this performance?
  • All done for a fixed computational intensity
  • CI50 (Hide random access penalty using high
    computational intensity)
  • CI is computational intensity, controlled by
    number of squarings (M) per matrix
  • Start with M1, S?
  • At S1 (every access random), how large must M be
    to achieve 50 of this performance?
  • For both, lower numbers are better

56
Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? CI50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter is expensive on commodity cache-based systems Power4 is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
57
Emerging Architectures
  • General purpose processors badly suited for data
    intensive ops
  • Large caches not useful if re-use is low
  • Low memory bandwidth, especially for irregular
    patterns
  • Superscalar methods of increasing ILP inefficient
  • Power consumption
  • Research architectures
  • Berkeley IRAM Vector and PIM chip
  • Stanford Imagine Stream processor
  • ISI Diva PIM with conventional processor

58
Sqmat on PIM Systems
  • Performance of Sqmat on PIMs and others for 3x3
    matrices, squared 10 times (high computational
    intensity!)
  • Imagine much faster for long streams, slower for
    short ones

59
Comparison to HPCC Four Corners
Opteron LINPACK 2000 MFLOPS _at_1.4ghz Sqmat 2145
MFLOPS _at_1.6ghz STREAMS 1969 MB/s Sqmat 2047
MB/s RandomAccess 0.00442 GUPs Sqmat 0.00440
GUPs
Stream Sqmat S0 M1 N1
RandomAccess Sqmat S1 M1 N1

Temporal Locality
Itanium2 LINPACK 4.65 GFLOPs Sqmat 4.47
GFLOPs STREAMS 3895 MB/s Sqmat 4055
MB/s RandomAccess 0.00484 GUPs Sqmat 0.0141 GUPs
LINPACK Sqmat S0 M8 N8
FFT (future)
Spatial Locality
Write a Comment
User Comments (0)
About PowerShow.com