Title: A Comparison of Empirical and Modeldriven Optimization, PLDI03 K. Yotov, et. al.
1A Comparison of Empirical and Model-driven
Optimization, PLDI03K. Yotov, et. al.
- Summarized by Ian Kuon for ECE1724
2Overview
- Empirical Optimization
- Results based on running the program on actual
hardware - Model-based Optimization
- Optimization parameters set based on hardware
model - Empirical library generators outperform
model-based compilers - Paper measures gap between optimization styles
- Finds that model-based can be competitive
3Experimental Methodology
Empirical Optimization
Model-based Optimization
4BLAS
- Basic Linear Algebra Subroutines
- Level 3 Most complex and time consuming
operations - C aA B bC
- Example
- for (int j 0 j lt M j)
- for (int i 0 i lt N i)
- for (int k 0 k lt K k)
- Cij Cij AikBkj
5ATLAS
- Automatically Tuned Linear Algebra Software
- R. C. Whaley, A. Petitet and J. J. Dongarra,
Automated Empirical Optimization of Software and
the ATLAS Project, Parallel Computing, 27(1-2)
3-35, 2001 - http//math-atlas.sourceforge.net/
- Optimizes BLAS code
- Empirical search using code generator
- Code generator accepts many optimization
parameters
6ATLAS Optimizations Cache Level
- Multiply MB x KB submatrix by KB x NB submatrix
- Called mini-MMM
- Square tiles only NBKBMB
- Non-optimal NB increases L1 cache misses
7ATLAS Optimizations Register Level
- Tile mini-MMM to use general-purpose registers
- Called micro-MMM
- MU, NU define tile size
- KU defines amount of loop unrolling
8ATLAS Optimizations Scheduling
- Interleave computation and memory accesses
- MULADD is there a combined multiply-add
instruction? - LATENCY FP multiplier latency used to skew
addition and multiplication operations
9ATLAS Optimizations Scheduling (2)
- Memory access interleaving
- NFETCH is the number of supported outstanding
loads - IFETCH number of loads that overlap between
iterations - FFETCH suppress initial C loads sometimes
10ATLAS Optimizations Versioning
- Several versions of operations generated
- Appropriate version selected at run-time
- Decisions made at run-time
- Copy the mini-MMM tiles
- Loop order
- Handling of boundary sub-matrices
11ATLAS Optimization Procedure
- Basic parameters determined using
micro-benchmarks - L1 Data Cache
- Number of Floating Point registers
- Availability of Multiply Add Instruction
- FP Unit Latency
- Optimization parameters then determined
experimentally
12ATLAS Optimization Procedure (2)
- Find NB
- Find MU and NU
- Find KU
- Find Latency
- Find Fetch factors
- Find non-copy version threshold
- Find optimal cleanup codes
- Phase ordering a concern
- Would a different search order provide better
results?
13Model-Based Optimization
- Generate estimate for each optimization parameter
14Model Development NB
- Goal largest tile such that matrix will seem
small relative to L1 cache - Approach Start from simple cache model and
refine to handle realistic effects
15Model Development NB
- for (j 0 j lt M j)
- for (i 0 i lt N i)
- for (k 0 k lt K k)
- Cij AikBkj
- NB2 NB 1 C
16Model Development NB (2)
- With larger cache lines inequality becomes
- Depends on loop ordering relative to memory
organization of array - If ordered differently (IJK, IKJ, KIJ) inequality
becomes
17Model Development NB (3)
- Consider effect of LRU replacement
- Order after one iteration of inner two loops
- A1,1, A1,2, A1,NB, C1,j
- A2,1, A2,2, A2,NB, C2,j
-
- ANB,1, B1,j, ANB,2, B2,j, ANB, NB, CNB,j,
- Inequality becomes
18Model Development NB Summary
19Model Development MicroMMM
- Similar to cache tiling
- Assume MUNU
- KU set to allow maximum unrolling based on L1
instruction cache
20Model Development Other Parameters
- LATENCY, MULADD set based on detected hardware
- FFETCH set (arbitrarily) to 1
- IFETCH, NFETCH set to 2
21Performance Results Installation
22Performance Results Parameters
23Mini-MMM Performance
- Multiply performance for a single block
- Similar performance to ATLAS possible but NOT
guaranteed
24Performance SGI R12000
Non-Copy Version
Matrix Size
25Performance UltraSPARCIII
Matrix Size
26Performance Pentium III-XEON
Matrix Size
27Performance Questions
- Model code generates more versions of the
clean-up code than ATLAS does - What is performance impact of this?
- Likely small but reduces accuracy of comparison
- Native compiler performance on SGI and Sun is
close to that of ATLAS and Model when
matrix-size is hard coded - Neither modeling nor measurement is necessary in
this case?
28ATLAS Performance Whaley et. al.
Vendor BLAS isnt significantly better
29Sensitivity Analysis
- What differences affected performance most?
- Simplify optimizations for performance-insensitive
parameters - Too many parameters to vary simultaneously
- Parameters set to ATLAS-optimized parameters
- Single parameter varied
30Sensitivity SGI Tile Size
- Both ATLAS and Model miss out on performance
- L2 cache size must also be considered
31Sensitivity Sun Tile Size
- Model mis-predicts significantly
- Performance affected by more than just size and
line length
32Sensitivity Intel Tile Size
- Model did well
- Performance varies widely (and wildly)
33Sensitivity Register Tile Size (Sun)
34Sensitivity Register Tile Shape (Intel)
- Dependence on shape seen when using gcc
- Was icc used for everything?
35Sensitivity Latency (Sun)
- Incorrect prediction by model causes big
performance loss
36Conclusions
- Model-driven can be effective sometimes
- Performance was 20 worse on Sun
- Claim empirical search may not be necessary
- Both empirical and modeling approaches perform
much worse than BLAS
37Future Work
- Applying model-driven optimization to other areas
such as for SPIRAL and FFTW - Speed empirical searches based on modeling
results - Develop performance equations that compilers can
use
38Further Reading
- X. Li and M. Garzarán and D. Padua. A Dynamically
Tuned Sorting Library. in Proc. of the
International Symposium on Code Generation and
Optimization. pp 111-123. 2004
39Discussion
- Is the comparison fair?
- What are the weaknesses in the model?
- What improvements could be made to the model?
40(No Transcript)
41Extra Slides
42Sensitivity Latency (Intel)
43Mini-MMM Pseudo Code
- loop on N, step NU2
- loop on M, step MU4
- prefetch C // FFetch1
- loop K, step KU1
- LA0 LB0 // IFetch2
- 00
- LA1 LA2 // NFetch2
- 10
- LA3 LB1 // NFetch2
- 00 20 10
- 30 20 01
- 30 11 01
- 21 11 31
- 21 31
- end K
- S00 S10 S20 S30 S01 S11 S21 S31
- end M
- end N