A Comparison of Empirical and Modeldriven Optimization, PLDI03 K. Yotov, et. al. - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

A Comparison of Empirical and Modeldriven Optimization, PLDI03 K. Yotov, et. al.

Description:

Copy the mini-MMM tiles. Loop order. Handling of boundary sub-matrices. 2004/10/05 ... Mini-MMM Performance. Multiply performance for a single block ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 44
Provided by: iank6
Category:

less

Transcript and Presenter's Notes

Title: A Comparison of Empirical and Modeldriven Optimization, PLDI03 K. Yotov, et. al.


1
A Comparison of Empirical and Model-driven
Optimization, PLDI03K. Yotov, et. al.
  • Summarized by Ian Kuon for ECE1724

2
Overview
  • Empirical Optimization
  • Results based on running the program on actual
    hardware
  • Model-based Optimization
  • Optimization parameters set based on hardware
    model
  • Empirical library generators outperform
    model-based compilers
  • Paper measures gap between optimization styles
  • Finds that model-based can be competitive

3
Experimental Methodology
Empirical Optimization
Model-based Optimization
4
BLAS
  • Basic Linear Algebra Subroutines
  • Level 3 Most complex and time consuming
    operations
  • C aA B bC
  • Example
  • for (int j 0 j lt M j)
  • for (int i 0 i lt N i)
  • for (int k 0 k lt K k)
  • Cij Cij AikBkj

5
ATLAS
  • Automatically Tuned Linear Algebra Software
  • R. C. Whaley, A. Petitet and J. J. Dongarra,
    Automated Empirical Optimization of Software and
    the ATLAS Project, Parallel Computing, 27(1-2)
    3-35, 2001
  • http//math-atlas.sourceforge.net/
  • Optimizes BLAS code
  • Empirical search using code generator
  • Code generator accepts many optimization
    parameters

6
ATLAS Optimizations Cache Level
  • Multiply MB x KB submatrix by KB x NB submatrix
  • Called mini-MMM
  • Square tiles only NBKBMB
  • Non-optimal NB increases L1 cache misses

7
ATLAS Optimizations Register Level
  • Tile mini-MMM to use general-purpose registers
  • Called micro-MMM
  • MU, NU define tile size
  • KU defines amount of loop unrolling

8
ATLAS Optimizations Scheduling
  • Interleave computation and memory accesses
  • MULADD is there a combined multiply-add
    instruction?
  • LATENCY FP multiplier latency used to skew
    addition and multiplication operations

9
ATLAS Optimizations Scheduling (2)
  • Memory access interleaving
  • NFETCH is the number of supported outstanding
    loads
  • IFETCH number of loads that overlap between
    iterations
  • FFETCH suppress initial C loads sometimes

10
ATLAS Optimizations Versioning
  • Several versions of operations generated
  • Appropriate version selected at run-time
  • Decisions made at run-time
  • Copy the mini-MMM tiles
  • Loop order
  • Handling of boundary sub-matrices

11
ATLAS Optimization Procedure
  • Basic parameters determined using
    micro-benchmarks
  • L1 Data Cache
  • Number of Floating Point registers
  • Availability of Multiply Add Instruction
  • FP Unit Latency
  • Optimization parameters then determined
    experimentally

12
ATLAS Optimization Procedure (2)
  • Find NB
  • Find MU and NU
  • Find KU
  • Find Latency
  • Find Fetch factors
  • Find non-copy version threshold
  • Find optimal cleanup codes
  • Phase ordering a concern
  • Would a different search order provide better
    results?

13
Model-Based Optimization
  • Generate estimate for each optimization parameter

14
Model Development NB
  • Goal largest tile such that matrix will seem
    small relative to L1 cache
  • Approach Start from simple cache model and
    refine to handle realistic effects

15
Model Development NB
  • for (j 0 j lt M j)
  • for (i 0 i lt N i)
  • for (k 0 k lt K k)
  • Cij AikBkj
  • NB2 NB 1 C

16
Model Development NB (2)
  • With larger cache lines inequality becomes
  • Depends on loop ordering relative to memory
    organization of array
  • If ordered differently (IJK, IKJ, KIJ) inequality
    becomes

17
Model Development NB (3)
  • Consider effect of LRU replacement
  • Order after one iteration of inner two loops
  • A1,1, A1,2, A1,NB, C1,j
  • A2,1, A2,2, A2,NB, C2,j
  • ANB,1, B1,j, ANB,2, B2,j, ANB, NB, CNB,j,
  • Inequality becomes

18
Model Development NB Summary
19
Model Development MicroMMM
  • Similar to cache tiling
  • Assume MUNU
  • KU set to allow maximum unrolling based on L1
    instruction cache

20
Model Development Other Parameters
  • LATENCY, MULADD set based on detected hardware
  • FFETCH set (arbitrarily) to 1
  • IFETCH, NFETCH set to 2

21
Performance Results Installation
22
Performance Results Parameters
23
Mini-MMM Performance
  • Multiply performance for a single block
  • Similar performance to ATLAS possible but NOT
    guaranteed

24
Performance SGI R12000
Non-Copy Version
Matrix Size
25
Performance UltraSPARCIII
Matrix Size
26
Performance Pentium III-XEON
Matrix Size
27
Performance Questions
  • Model code generates more versions of the
    clean-up code than ATLAS does
  • What is performance impact of this?
  • Likely small but reduces accuracy of comparison
  • Native compiler performance on SGI and Sun is
    close to that of ATLAS and Model when
    matrix-size is hard coded
  • Neither modeling nor measurement is necessary in
    this case?

28
ATLAS Performance Whaley et. al.
Vendor BLAS isnt significantly better
29
Sensitivity Analysis
  • What differences affected performance most?
  • Simplify optimizations for performance-insensitive
    parameters
  • Too many parameters to vary simultaneously
  • Parameters set to ATLAS-optimized parameters
  • Single parameter varied

30
Sensitivity SGI Tile Size
  • Both ATLAS and Model miss out on performance
  • L2 cache size must also be considered

31
Sensitivity Sun Tile Size
  • Model mis-predicts significantly
  • Performance affected by more than just size and
    line length

32
Sensitivity Intel Tile Size
  • Model did well
  • Performance varies widely (and wildly)

33
Sensitivity Register Tile Size (Sun)
34
Sensitivity Register Tile Shape (Intel)
  • Dependence on shape seen when using gcc
  • Was icc used for everything?

35
Sensitivity Latency (Sun)
  • Incorrect prediction by model causes big
    performance loss

36
Conclusions
  • Model-driven can be effective sometimes
  • Performance was 20 worse on Sun
  • Claim empirical search may not be necessary
  • Both empirical and modeling approaches perform
    much worse than BLAS

37
Future Work
  • Applying model-driven optimization to other areas
    such as for SPIRAL and FFTW
  • Speed empirical searches based on modeling
    results
  • Develop performance equations that compilers can
    use

38
Further Reading
  • X. Li and M. Garzarán and D. Padua. A Dynamically
    Tuned Sorting Library. in Proc. of the
    International Symposium on Code Generation and
    Optimization. pp 111-123. 2004

39
Discussion
  • Is the comparison fair?
  • What are the weaknesses in the model?
  • What improvements could be made to the model?

40
(No Transcript)
41
Extra Slides
42
Sensitivity Latency (Intel)
43
Mini-MMM Pseudo Code
  • loop on N, step NU2
  • loop on M, step MU4
  • prefetch C // FFetch1
  • loop K, step KU1
  • LA0 LB0 // IFetch2
  • 00
  • LA1 LA2 // NFetch2
  • 10
  • LA3 LB1 // NFetch2
  • 00 20 10
  • 30 20 01
  • 30 11 01
  • 21 11 31
  • 21 31
  • end K
  • S00 S10 S20 S30 S01 S11 S21 S31
  • end M
  • end N
Write a Comment
User Comments (0)
About PowerShow.com