A Comparison of Empirical and Modeldriven Optimization, PLDI03 K. Yotov, et. al. - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

A Comparison of Empirical and Modeldriven Optimization, PLDI03 K. Yotov, et. al.

Description:

Copy the mini-MMM tiles. Loop order. Handling of boundary sub-matrices. 2004/10/05 ... Mini-MMM Performance. Multiply performance for a single block ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 44

Provided by: iank6

Category:

more less

Transcript and Presenter's Notes

Title: A Comparison of Empirical and Modeldriven Optimization, PLDI03 K. Yotov, et. al.

1
A Comparison of Empirical and Model-driven
Optimization, PLDI03K. Yotov, et. al.

Summarized by Ian Kuon for ECE1724

2
Overview

Empirical Optimization
Results based on running the program on actual
hardware
Model-based Optimization
Optimization parameters set based on hardware
model
Empirical library generators outperform
model-based compilers
Paper measures gap between optimization styles
Finds that model-based can be competitive

3
Experimental Methodology
Empirical Optimization
Model-based Optimization
4
BLAS

Basic Linear Algebra Subroutines
Level 3 Most complex and time consuming
operations
C aA B bC
Example
for (int j 0 j lt M j)
for (int i 0 i lt N i)
for (int k 0 k lt K k)
Cij Cij AikBkj

5
ATLAS

Automatically Tuned Linear Algebra Software
R. C. Whaley, A. Petitet and J. J. Dongarra,
Automated Empirical Optimization of Software and
the ATLAS Project, Parallel Computing, 27(1-2)
3-35, 2001
http//math-atlas.sourceforge.net/
Optimizes BLAS code
Empirical search using code generator
Code generator accepts many optimization
parameters

6
ATLAS Optimizations Cache Level

Multiply MB x KB submatrix by KB x NB submatrix
Called mini-MMM
Square tiles only NBKBMB
Non-optimal NB increases L1 cache misses

7
ATLAS Optimizations Register Level

Tile mini-MMM to use general-purpose registers
Called micro-MMM
MU, NU define tile size
KU defines amount of loop unrolling

8
ATLAS Optimizations Scheduling

Interleave computation and memory accesses
MULADD is there a combined multiply-add
instruction?
LATENCY FP multiplier latency used to skew
addition and multiplication operations

9
ATLAS Optimizations Scheduling (2)

Memory access interleaving
NFETCH is the number of supported outstanding
loads
IFETCH number of loads that overlap between
iterations
FFETCH suppress initial C loads sometimes

10
ATLAS Optimizations Versioning

Several versions of operations generated
Appropriate version selected at run-time
Decisions made at run-time
Copy the mini-MMM tiles
Loop order
Handling of boundary sub-matrices

11
ATLAS Optimization Procedure

Basic parameters determined using
micro-benchmarks
L1 Data Cache
Number of Floating Point registers
Availability of Multiply Add Instruction
FP Unit Latency
Optimization parameters then determined
experimentally

12
ATLAS Optimization Procedure (2)

Find NB
Find MU and NU
Find KU
Find Latency
Find Fetch factors
Find non-copy version threshold
Find optimal cleanup codes

Phase ordering a concern
Would a different search order provide better
results?

13
Model-Based Optimization

Generate estimate for each optimization parameter

14
Model Development NB

Goal largest tile such that matrix will seem
small relative to L1 cache
Approach Start from simple cache model and
refine to handle realistic effects

15
Model Development NB

for (j 0 j lt M j)
for (i 0 i lt N i)
for (k 0 k lt K k)
Cij AikBkj
NB2 NB 1 C

16
Model Development NB (2)

With larger cache lines inequality becomes
Depends on loop ordering relative to memory
organization of array
If ordered differently (IJK, IKJ, KIJ) inequality
becomes

17
Model Development NB (3)

Consider effect of LRU replacement
Order after one iteration of inner two loops
A1,1, A1,2, A1,NB, C1,j
A2,1, A2,2, A2,NB, C2,j
ANB,1, B1,j, ANB,2, B2,j, ANB, NB, CNB,j,
Inequality becomes

18
Model Development NB Summary
19
Model Development MicroMMM

Similar to cache tiling
Assume MUNU
KU set to allow maximum unrolling based on L1
instruction cache

20
Model Development Other Parameters

LATENCY, MULADD set based on detected hardware
FFETCH set (arbitrarily) to 1
IFETCH, NFETCH set to 2

21
Performance Results Installation
22
Performance Results Parameters
23
Mini-MMM Performance

Multiply performance for a single block
Similar performance to ATLAS possible but NOT
guaranteed

24
Performance SGI R12000
Non-Copy Version
Matrix Size
25
Performance UltraSPARCIII
Matrix Size
26
Performance Pentium III-XEON
Matrix Size
27
Performance Questions

Model code generates more versions of the
clean-up code than ATLAS does
What is performance impact of this?
Likely small but reduces accuracy of comparison
Native compiler performance on SGI and Sun is
close to that of ATLAS and Model when
matrix-size is hard coded
Neither modeling nor measurement is necessary in
this case?

28
ATLAS Performance Whaley et. al.
Vendor BLAS isnt significantly better
29
Sensitivity Analysis

What differences affected performance most?
Simplify optimizations for performance-insensitive
parameters
Too many parameters to vary simultaneously
Parameters set to ATLAS-optimized parameters
Single parameter varied

30
Sensitivity SGI Tile Size

Both ATLAS and Model miss out on performance
L2 cache size must also be considered

31
Sensitivity Sun Tile Size

Model mis-predicts significantly
Performance affected by more than just size and
line length

32
Sensitivity Intel Tile Size

Model did well
Performance varies widely (and wildly)

33
Sensitivity Register Tile Size (Sun)
34
Sensitivity Register Tile Shape (Intel)

Dependence on shape seen when using gcc
Was icc used for everything?

35
Sensitivity Latency (Sun)

Incorrect prediction by model causes big
performance loss

36
Conclusions

Model-driven can be effective sometimes
Performance was 20 worse on Sun
Claim empirical search may not be necessary
Both empirical and modeling approaches perform
much worse than BLAS

37
Future Work

Applying model-driven optimization to other areas
such as for SPIRAL and FFTW
Speed empirical searches based on modeling
results
Develop performance equations that compilers can
use

38
Further Reading

X. Li and M. Garzarán and D. Padua. A Dynamically
Tuned Sorting Library. in Proc. of the
International Symposium on Code Generation and
Optimization. pp 111-123. 2004

39
Discussion