Compilers and Optimization on AIX systems - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Compilers and Optimization on AIX systems

Description:

On Pelican system: -qarch=pwr4: Power 4 machines -qarch=Pwr5: Power 5 machines ... Remember: the head node on Pelican is a Power4 machine! On LONI AIX systems: ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 41
Provided by: ley6
Category:

less

Transcript and Presenter's Notes

Title: Compilers and Optimization on AIX systems


1
Compilers and Optimization on AIX systems
  • Le Yan
  • Jan 26, 2007

2
Outline
  • Overview
  • Basic compiler options
  • Optimization
  • General programming tips
  • Compiler options
  • Optimized libraries

3
Outline
  • Overview
  • Basic compiler options
  • Optimization
  • General programming tips
  • Compiler options
  • Optimized libraries

4
Overview
5
Overview
  • Most flags and options are the same for all three
    groups of compilers.
  • Prefix mp indicates compatibility with MPI
  • e.g. mpxlf is the Fortran compiler compatible
    with MPI
  • Prefix _r indicates thread safe compiler
  • Usage
  • compiler ltoptionsgt input_files

6
Documentation and references
  • IBM AIX compiler center
  • http//publib.boulder.ibm.com/infocenter/comphelp/
    v7v91/index.jsp
  • LSU HPC documentations
  • http//appl003.lsu.edu/ocsweb/hpchome.nsf/Content
    /document?OpenDocument

7
Outline
  • Overview
  • Basic compiler options
  • Optimization
  • General programming tips
  • Compiler options
  • Optimized libraries

8
Basic options
9
Basic options (contd)
10
Outline
  • Overview
  • Basic compiler options
  • Optimization
  • General programming tips
  • Compiler options
  • Optimized libraries

11
Optimization general tips
  • Do not excessively hand-tune your code.
  • Unusual constructs may confuse the compiler and
    make it difficult to optimize for new machines
  • Use the MASS and ESSL libraries rather than
    writing your own-code (details later)
  • Optimized for Power5 machines
  • Try not to break your code into too many small
    functions and subroutines to avoid lengthy call
    overhead.

12
Optimization general tips (cont'd)
  • Avoid unnecessary use of global variables
  • Use local variables for loop index and bounds
    when possible
  • Example When using a global variable in a loop,
    load it into a local variable before the loop and
    restore it back after.
  • Limit the use of ALLOCATABLE arrays only to
    situations that demand dynamic allocation.

13
Outline
  • Overview
  • Basic compiler options
  • Optimization
  • General programming tips
  • Compiler options
  • Optimized libraries

14
High Order Transformations (-qhot)
  • What does it do?
  • Scalar replacement
  • Loop transformation (Blocking, interchange,
    fusion, reversal and unrolling of loops)
  • Reduce the generation of temporary arrays
  • Controlled by the characteristics of loops and
    the cost of loop transformations
  • When -qhot is specified, the compiler assumes an
    optimization level of -O2 (details later)

15
Example outer loop unroll
Do I1,N Do J1,N SumSumX(J)A(J,I)
Enddo Enddo
Do I1,N,4 Do J1,N SumSumX(J)A(J,I)
X(J)A(J,I1) X(J)A(J,I2) X(J)A(J,I3)
Enddo Enddo
unroll
  • Minimize loads/stores by finding variables that
    can be loaded once and used multiple times
  • Left 2 flops/2 loads
  • Right 8 flops/5 loads

16
Outer loop unroll test
MFLOP/s
17
Example interchange loops
Do I1,N Do J1,N SumSumA(I,J)
Enddo Enddo
Do J1,N Do I1,N SumSumA(I,J)
Enddo Enddo
interchange
  • Minimize strides
  • Remember Fortran and C are different
  • Fortran column-major arrays
  • C row-major arrays

18
Interchange loop test
MFLOP/s
19
Optimizing for a target machine
  • Instruct the compiler to generate code for
    optimal execution on a given processor or
    architecture.
  • Target machine options
  • -q32 generates code for 32-bit environment
  • -q64 generates code for 64-bit environment
  • -qarch selects specific architecture
  • -qtune biases optimization toward execution on a
    give machine
  • -qcache defines specific cache or memory geometry

20
32/64-bit environment
  • Performance consideration
  • 64-bit mode
  • Capable of handling larger amount of data
    directly in physical memory rather than relying
    on disk I/O
  • 32-bit mode
  • Smaller program, less demanding on physical
    memory
  • The operation of division is faster

21
32/64-bit environment
  • Specify -q32 (default) or -q64 when compiling
  • Alternative set the OBJECT_MODE environment
    variable to 32 or 64
  • Some tips on working with 64-bit programs
  • Avoid performing mixed 32-bit and 64-bit
    operations
  • Avoid long division whenever possible
  • For C and C programs use long types instead of
    signed, unsigned and plain int types for
    variables which will be frequently accessed.

22
Target a specific architecture (-qarch)
  • Syntax -qarcharchitecture
  • On Pelican system
  • -qarchpwr4 Power 4 machines
  • -qarchPwr5 Power 5 machines
  • -qarchAuto Use the architecture of the
    compiling machine.
  • Remember the head node on Pelican is a Power4
    machine!
  • On LONI AIX systems
  • -qarchauto or -qarchpwr5 it does not matter
    because all nodes are Power5 machines

23
-qtune
  • Bias optimization toward a specific machine
  • Tunes instruction selection, scheduling and other
    implementation-dependent performance enhancement
  • Has effect on performance but not correctness
  • Primarily of benefit for floating-point intensive
    programs
  • Is controlled by qarch, -q32 and q64 options if
    not explicitly specified
  • -qtuneauto assumes that the execution
    environment will be the same as the complication
    environment

24
-qcache
  • Specifies the cache configuration for a specific
    machine
  • Especially useful for loop operations (process
    only the amount of data that can fit into the
    data cache)
  • Must be used in conjunction with -qhot
  • Options
  • linebytes line size of the cache
  • Sizebytes total size of the cache
  • Levellevel specifies the level of cache
    affected
  • costcycles specifies the performance penalty
    resulting from a cache miss

25
Profile directed optimization
  • Profile-directed feedback (PDF)
  • Two stage optimization
  • Should be mainly used on code that has rarely
    executed conditional error handling or
    instrumentation.

26
Interprocedure analysis (-qipa)
  • Optimize across different files (whole program
    analysis)
  • Have different levels
  • Level0
  • Program partitioning and simple interprocedural
    optimization
  • Level1
  • Default level of -qipa
  • Inlining and global data mapping
  • level2
  • Global alias analysis
  • Interprocedural data flow

27
Inlining
  • Can be turned on by specifing -qipainlineinline-
    options (or-qinlineinline-options)
  • Useful when your program has many subprogram
    calls
  • Reduce the call overhead
  • Identify the subprograms that are called the most
    and inline only those subprograms
  • Examples
  • -qipainlineauto inline all procedures
  • -qipainlinesub1inlinenoauto only inline the
    procedure sub1

28
Choose an optimization level
  • -On option
  • -O0 very limited optimization, fast compilation,
    debuggable code
  • -O2 comprehensive low-level optimization,
    partial debug support
  • -O3 more extensive optimization, some precision
    trade-off
  • -O4 Everything from -O3 plus -qhot -qipa
    -qarchauto -qtuneauto -qcacheauto
  • -O5 Everything from -O4 plus -qipalevel2

29
Choose an optimization level
  • Test and debug code before go to any level of
    optimization
  • If encountered problem with -O2, check the code
    for any non-standard use of aliasing rules.
  • Consider using -qaliasnostd (Fortran) or
    -qaliasnoansi (C) instruct the compiler to
    apply aliasing assertion to your compilation
    unit.
  • If encountered problem with -O3, consider using
    -qstrict along with -O3.
  • -qstrict ensure the optimizer will not alter the
    semantics of a program
  • Try to at least optimize your program with -O3
    -qhot

30
Outline
  • Overview
  • Basic compiler options
  • Optimization
  • General programming tips
  • Compiler options
  • Optimized libraries

31
Optimized libraries
  • Mathematical Acceleration SubSystem (MASS)
  • Engineering and Scientific Subroutine Library
    (ESSL)
  • Both support FORTRAN, C, and C languages.

32
MASS
  • Optimized intrinsic functions
  • Examples sqrt, sin, cos, exp, log, xy
  • Better performance at the expense of reduced
    precision (1 to 2 bits less)
  • Have both scalar and vector versions
  • Thread safe
  • Usage
  • Compile normally, then link with the option -lmass

33
MASS performance
Moperations/s
34
Intrinsic vector functions
  • Intrinsic vector functions
  • Compiler generates vector intrinsic functions
    when -qhot is specified
  • Examples vlog, vexp, vdiv, vsqrt
  • The performance is very good

Do i1,n A(i)log(B(i)) Enddo
Call __vlog(A,B,n)
-qhot
35
Vector MASS library
  • Usage
  • Need to call explicitly in the code
  • Example call vexp(A,B,n) rather than do i1,n
    B(i)exp(A(i)) enddo
  • Link with lmassv

36
Vector MASS performance
Moperations/s
37
ESSL
  • Has over 400 subroutines
  • Tuned for PowerPC systems
  • Available for parallel computing environment also
  • Usage
  • Link with -lessl

38
ESSL library
  • Linear algebra subprograms
  • Linear equations
  • Eigen system analysis
  • Fourier transforms
  • Convolution and correlation
  • Sorting and searching
  • Interpolation
  • Numerical quadrature
  • Random number generation

39
BLAS (Basic Linear Algebra Subprograms)
  • BLAS 1 (vector-vector operation)
  • Compiler generated code is faster than ESSL
  • BLAS 2 (vector-matrix operation)
  • Compiler generated code is equivalent to ESSL
  • BLAS 3 (matrix-matrix operation)
  • ESSL is significantly faster

40
ESSL performance
Write a Comment
User Comments (0)
About PowerShow.com