PGI Compilers Tools for Scientists and Engineers - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

PGI Compilers Tools for Scientists and Engineers

Description:

Type conversions manually convert constants or use flags ... Syntax coloring, keyword completion. Fortran 95 Intrinsics tips ... – PowerPoint PPT presentation

Number of Views:251
Avg rating:3.0/5.0
Slides: 40
Provided by: dougm4
Category:

less

Transcript and Presenter's Notes

Title: PGI Compilers Tools for Scientists and Engineers


1
PGI CompilersTools for Scientists and Engineers
Brent Leback brent.leback_at_pgroup.com Dave
Norton dave.norton_at_pgroup.com www.pgroup.com
2
Outline of Todays Topics
  • Introduction to PGI Compilers and Tools
  • Documentation. Getting Help
  • Basic Compiler Options
  • Optimization Strategies
  • Questions and Answers

3
PGI Compilers and Tools, features
  • Optimization State-of-the-art vector,
    parallel, IPA, Feedback,
  • Cross-platform AMD Intel, 32/64-bit, Linux
    Windows
  • PGI Unified Binary for AMD and Intel processors
  • Tools Integrated OpenMP/MPI debug profile,
    IDE integration
  • Parallel MPI, OpenMP 2.5, auto-parallel for
    Multi-core
  • Comprehensive OS Support Red Hat 7.3 9.0,
    RHEL 3.0/4.0, Fedora Core 2/3/4/5, SuSE 7.1
    10.1, SLES 8/9/10, Windows XP, Windows x64

4
PGI Tools Enable Developers to
  • View x64 as a unified CPU architecture
  • Extract peak performance from x64 CPUs
  • Ride innovation waves from both Intel and AMD
  • Use a single source base and toolset across Linux
    and Windows
  • Develop, debug, tune parallel applications
    forMulti-core, Multi-core SMP, Clustered
    Multi-core SMP

5
PGI Documentation and Support
  • PGI provided documentation
  • PGI User Forums, at www.pgroup.com
  • PGI FAQs, Tips Techniques pages
  • Email support, via trs_at_pgroup.com
  • Web support, a form-based system similar to email
    support
  • Fax support

6
PGI Docs Support, cont.
  • Legacy phone support, direct access, etc.
  • PGI download web page
  • PGI prepared/personalized training
  • PGI ISV program
  • PGI Premier Service program

7
PGI Basic Compiler Options
  • Basic Usage
  • Language Dialects
  • Target Architectures
  • Debugging aids
  • Optimization switches

8
PGI Basic Compiler Usage
  • A compiler driver interprets options and invokes
    pre-processors, compilers, assembler, linker,
    etc.
  • Options precedence if options conflict, last
    option on command line takes precedence
  • Use -Minfo and Mneginfo to see a listing of
    optimizations and transformations performed by
    the compiler
  • Use -help to list all options or see details on
    how to use a given option, e.g. pgf90 -Mvect
    -help
  • Use man pages for more details on options, e.g.
    man pgf90
  • Use v to see under the hood

9
Flags to support language dialects
  • Fortran
  • pgf77, pgf90, pgf95, pghpf tools
  • Suffixes .f, .F, .for, .fpp, .f90, .F90, .f95,
    .F95, .hpf, .HPF
  • -Mextend, -Mfixed, -Mfreeform
  • Type size i2, -i4, -i8, -r4, -r8, etc.
  • -Mcray, -Mbyteswapio, -Mupcase, -Mnomain,
    -Mrecursive, etc.
  • C/C
  • pgcc, pgCC, aka pgcpp
  • Suffixes .c, .C, .cc, .cpp, .i
  • -B, -c89, -c9x, -Xa, -Xc, -Xs, -Xt
  • -Msignextend, -Mfcon, -Msingle, -Muchar, -Mgccbugs

10
Specifying the target architecture
  • Not an issue on XT3.
  • Defaults to the type of processor/OS you are
    running on
  • Use the tp switch.
  • -tp k8-64 or tp p7-64 or tp core2-64 for 64-bit
    code.
  • -tp amd64e for AMD opteron rev E or later
  • -tp x64 for unified binary
  • -tp k8-32, k7, p7, piv, piii, p6, p5, px for 32
    bit code

11
Flags for debugging aids
  • -g generates symbolic debug information used by a
    debugger
  • -gopt generates debug information in the presence
    of optimization
  • -Mbounds adds array bounds checking
  • -v gives verbose output, useful for debugging
    system or build problems
  • -Minfo provides feedback on optimizations made by
    the compiler
  • -S or Mkeepasm to see the exact assembly
    generated

12
Basic optimization switches
  • Traditional optimization controlled through
    -Oltngt, n is 0 to 4.
  • -fastsse and fast are equal to -O2 -Munrollc1
    -Mnoframe Mlre -Mvectsse, -Mscalarsse
    -Mcache_align -Mflushz
  • For -Munroll, c specifies completely unroll loops
    with this loop count or less
  • -Munrollnltmgt says unroll other loops m times
  • -Mcache_align aligns top level arrays and objects
    on cache-line boundaries
  • -Mflushz flushes SSE denormal numbers to zero
  • -Mnoframe does not set up a stack frame
  • -Mlre is loop-carried redundancy elimination

13
Optimization Strategies
  • Establish a workload
  • Optimization from the top-down
  • Use of proper tools, methods
  • Processor level optimizations, parallel methods
  • Different flags/features for different types of
    code

14
Node level tuning
  • Vectorization packed SSE instructions maximize
    performance
  • Interprocedural Analysis (IPA) use it!
    motivating examples
  • Function Inlining especially important for C
    and C
  • Parallelization for multi-core processors
  • Miscellaneous Optimizations hit or miss, but
    worth a try

15
Vectorizable F90 Array Syntax Data is REAL4
350 ! 351 ! Initialize vertex, similarity and
coordinate arrays 352 ! 353 Do Index 1,
NodeCount 354 IX MOD (Index - 1, NodesX)
1 355 IY ((Index - 1) / NodesX)
1 356 CoordX (IX, IY) Position (1) (IX
- 1) StepX 357 CoordY (IX, IY)
Position (2) (IY - 1) StepY 358 JetSim
(Index) SUM (Graph (, , Index) 359
GaborTrafo (, ,
CoordX(IX,IY), CoordY(IX,IY))) 360 VertexX
(Index) MOD (ParamsGraphRandomIndex (Index) -
1, NodesX) 1 361 VertexY (Index)
((ParamsGraphRandomIndex (Index) - 1) / NodesX)
1 362 End Do
Inner loop at line 358 is vectorizable, can
used packed SSE instructions
16
fastsse to Enable SSE VectorizationMinfo to
List Optimizations to stderr
pgf95 -fastsse -Mipafast -Minfo -S
graphRoutines.f90 localmove    334, Loop unrol
led 1 times (completely unrolled)
   343, Loop unrolled 2 times (completely unrolle
d)    358, Generated an alternate loop for the in
ner loop          Generated vector sse code for
 inner loop      Generated 2 prefetch
instructions for this loop         
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this
loop
17
Vector SSE
Scalar SSE
.LB6_1245  lineno 358         movlps  (rdx,
rcx),xmm2         subl    8,eax
        movlps  16(rcx,rdx),xmm3
prefetcht0 64(rcx,rsi) prefetcht0
64(rcx,rdx) movhps 8(rcx,rdx),xmm2
        mulps   (rsi,rcx),xmm2 movhps
24(rcx,rdx),xmm3         addps   xmm2,xmm0
        mulps   16(rcx,rsi),xmm3
        addq    32,rcx         testl   eax,e
ax         addps   xmm3,xmm0
        jg      .LB6_1245
.LB6_668 lineno 358
movss   -12(rax),xmm2         movss   -4(rax),
xmm3         subl    1,edx         mulss   -1
2(rcx),xmm2         addss   xmm0,xmm2
        mulss   -4(rcx),xmm3
        movss   -8(rax),xmm0
        mulss   -8(rcx),xmm0
        addss   xmm0,xmm2         movss   (ra
x),xmm0         addq    16,rax
        addss   xmm3,xmm2         mulss   (rc
x),xmm0         addq    16,rcx
        testl   edx,edx         addss   xmm0,
xmm2         movaps  xmm2,xmm0
        jg      .LB6_625
Facerec Scalar 104.2 sec Facerec Vector 84.3
sec
18
Vectorizable C Code Fragment?
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Minfo functions.c func4
221, Loop unrolled 4 times 221, Loop not
vectorized due to data dependency 223, Loop
not vectorized due to data dependency
19
Pointer Arguments Inhibit Vectorization
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Msafeptr Minfo
functions.c func4 221, Generated vector SSE
code for inner loop Generated 3
prefetch instructions for this loop 223,
Unrolled inner loop 4 times
20
C Constant Inhibits Vectorization
217 void func4(float u1, float u2, float
u3, 221 for (i -NE1, p1
u2-ny, p2 n2ny i lt nxNE-1 i) 222
u3i clz (p1i p2i) 223 for (i
-NI1, i lt nxNE-1 i) 224 float vdt
vi dt 225 u3i
2.u2i-u1ivdtvdtu3i 226
pgcc fastsse Msafeptr Mfcon Minfo
functions.c func4 221, Generated vector SSE
code for inner loop Generated 3
prefetch instructions for this loop 223,
Generated vector SSE code for inner loop
Generated 4 prefetch instructions for this
loop
21
-Msafeptr Option and Pragma
Mnosafeptrall arg auto dummy local
static global all All pointers are
safe arg Argument pointers are safe local local
pointers are safe static static local pointers
are safe global global pointers are safe
pragma scope nosafeptrarg local global
static all, Where scope is global, routine
or loop
22
Common Barriers to SSE Vectorization
  • Potential Dependencies C Pointers Give
    compiler more info with Msafeptr, pragmas,
    or restrict type qualifer
  • Function Calls Try inlining with Minline or
    Mipainline
  • Type conversions manually convert constants
    or use flags
  • Too few iterations Usually better to unroll
    the loop
  • Real dependencies Must restructure loop, if
    possible

23
Barriers to Efficient Execution of Vector SSE
Loops

  • Not enough work vectors are too short
  • Vectors not aligned to a cache line boundary
  • Non unity strides
  • Code bloat if altcode is generated

24
  • Vectorization packed SSE instructions
    maximize performance
  • Interprocedural Analysis (IPA) use it!
    motivating example
  • Function Inlining especially important for C
    and C
  • Parallelization for Cray XD1 and multi-core
    processors
  • Miscellaneous Optimizations hit or miss, but
    worth a try


25
What can Interprocedural Analysis and
Optimization with Mipa do for You?
  • Interprocedural constant propagation
  • Pointer disambiguation
  • Alignment detection, Alignment propagation
  • Global variable mod/ref detection
  • F90 shape propagation
  • Function inlining
  • IPA optimization of libraries, including
    inlining


26
Effect of IPA on the WUPWISE Benchmark
  • Mipafast gt constant propagation gt compiler
    sees complex matrices are all 4x3 gt
    completely unrolls loops
  • Mipafast,inline gt small matrix multiplies
    are all inlined

27
Using Interprocedural Analysis
  • Must be used at both compile time and link time
  • Non-disruptive to development process
    edit/build/run
  • Speed-ups of 5 - 10 are common
  • Mipasafeltnamegt - safe to optimize functions
    which call or are called from unknown
    function/library name
  • Mipalibopt perform IPA optimizations on
    libraries
  • Mipalibinline perform IPA inlining from
    libraries


28
  • Vectorization packed SSE instructions
    maximize performance
  • Interprocedural Analysis (IPA) use it!
    motivating examples
  • Function Inlining especially important for C
    and C
  • SMP Parallelization for Cray XD1 and
    multi-core processors
  • Miscellaneous Optimizations hit or miss, but
    worth a try


29
Explicit Function Inlining
Minlinelibltinlibgt nameltfuncgt
exceptltfuncgt sizeltngt
levelsltngt libltinlibgt Inline extracted
functions from inlib nameltfuncgt Inline
function func exceptltfuncgt Do not inline
function func sizeltngt Inline only functions
smaller than n statements (approximate) levels
ltngt Inline n levels of functions
For C Codes, PGI Recommends IPA-basedinlining
or Minlinelevels10!
30
Other C recommendations
  • Encapsulation, Data Hiding - small functions,
    inline!
  • Exception Handling use no_exceptions until
    7.0
  • Overloaded operators, overloaded functions -
    okay
  • Pointer Chasing - -Msafeptr, restrict qualifer,
    32 bits?
  • Templates, Generic Programming now okay
  • Inheritance, polymorphism, virtual functions
    runtime lookup or check, no inlining, potential
    performance penalties


31
  • Vectorization packed SSE instructions
    maximize performance
  • Interprocedural Analysis (IPA) use it!
    motivating examples
  • Function Inlining especially important for C
    and C
  • SMP Parallelization for multi-core processors
  • Miscellaneous Optimizations hit or miss, but
    worth a try


32
SMP Parallelization
  • mpnonuma to enable OpenMP 2.5 parallel
    programming model
  • See PGI Users Guide or OpenMP 2.5 standard
  • OpenMP programs compiled w/out mp just work


33
MGRID BenchmarkMain Loop
DO 10 I32, N-1 DO 10 I22,N-1
DO 10 I12,N-110 R(I1,I2,I3)
V(I1,I2,I3)
-A(0)(U(I1,I2,I3))
-A(1)(U(I1-1,I2,I3)U(I11,I2,I3)

U(I1,I2-1,I3)U(I1,I21,I3)
U(I1,I2,I3-1)U(I1,I2,I
31))
-A(2)(U(I1-1,I2-1,I3)U(I11,I2-1,I3)

U(I1-1,I21,I3)U(I11,I21,I3)
U(I1,I2-1,I3-1)U(I
1,I21,I3-1)
U(I1,I2-1,I31)U(I1,I21,I31)

U(I1-1,I2,I3-1)U(I1-1,I2,I31)
U(I11,I2,I3-1)U(I
11,I2,I31) )
-A(3)(U(I1-1,I2-1,I3-1)U(I11,I2-1,I3-1)

U(I1-1,I21,I3-1)U(I11,I21,I3-1)

U(I1-1,I2-1,I31)U(I11,I2-1,I31)

U(I1-1,I21,I31)U(I11,I21,I31))
34
Auto-parallel MGRID Overall Speed-upis 40 on
Dual-core AMD Opteron
pgf95 fastsse Mipafast,inline Minfo
Mconcur mgrid.f resid . . . 189, Parallel
code for non-innermost loop activated
if loop count gt 33 block distribution
291, 4 loop-carried redundant expressions
removed with 12 operations and
16 arrays Generated vector SSE code
for inner loop Generated 8 prefetch
instructions for this loop Generated
vector SSE code for inner loop
Generated 8 prefetch instructions for this loop
35
  • Vectorization packed SSE instructions
    maximize performance
  • Interprocedural Analysis (IPA) use it!
    motivating examples
  • Function Inlining especially important for C
    and C
  • SMP Parallelization for Cray XD1 and
    multi-core processors
  • Miscellaneous Optimizations hit or miss, but
    worth a try


36
Miscellaneous Optimizations (1)
  • Mfprelaxed single-precision sqrt, rsqrt, div
    performed using reduced-precision reciprocal
    approximation
  • Mprefetchdltpgt,nltqgt control prefetching
    distance, max number of prefetch
    instructions per loop
  • tp k8-32 can result in big performance win
    on some C/C codes that dont require gt 2GB
    addressing pointer and long data become
    32-bits


37
Miscellaneous Optimizations (2)
  • O3 or O4 more aggressive hoisting and
    scalar replacement not part of fastsse, always
    time your code to make sure its faster
  • For C codes no_exceptions
    Minlinelevels10
  • Mnomovnt disable / force non-temporal
    moves
  • Vversion to switch between PGI releases at
    file level
  • Mvectnoaltcode disable multiple versions of
    loops


38
Whats New in PGI 6.2
  • Industry-leading SPECFP06 and SPECINT06
    Performance
  • PGI Visual Fortran for Windows x64 Windows XP
  • Full-featured PGI Workstation/Server for 32-bit
    Windows XP
  • PGI Unified Binary performance enhancements
  • More gcc extensions / compatibility
  • New SSE intrinsics
  • PGI CDK ROLL for ROCKS clusters
  • MPICH1 and MPICH2 support in the PGI CDK
  • Incremental debugger/profiler enhancements
  • Limited tuning for Intel Core2 (Woodcrest et al)

39
PGI Visual Fortran 6.2
  • Deep integration with Visual Studio 2005
  • PGI-custom Fortran-aware text editor
  • Syntax coloring, keyword completion
  • Fortran 95 Intrinsics tips
  • PGI-custom project system and icons
  • PGI-custom property pages
  • One-touch project build / execute
  • MS Visual C interoperability
  • Mixed VC / PGI Fortran applications
  • PGI-custom parallel F95 debug engine
  • OpenMP 2.5 / threads debugging
  • Just-in-time debugging features
  • DVF/CVF compatibility features
  • Win32 API support
  • Complete (Vis Studio bundled) and Standard
    (no Vis Studio) versions
  • PGI Unified Binary executables
  • Auto-parallel for multi-core CPUs
  • Native OpenMP 2.5 parallelization
  • World-class performance
  • 64-bit Windows x64 support
  • 32-bit Windows 2000/XP support
  • Optimization/support for AMD64
  • Optimization/support for Intel EM64T
  • DEC/IBM/Cray compatibility features
  • cpp-compatible pre-processing
  • Visual Studio 2005 bundled
  • MSDN Library bundled
  • GUI parallel debugging/profiling
  • Assembly-optimized BLAS/LAPACK/FFTs
  • Boxed CD-ROM/Manuals media kit

PVF Workstation Complete Only
40
On the PGI Roadmap
  • PGI Unified Binary directives and enhancements
  • Aggressive Intel Core2 and next gen AMD64 tuning
  • Industry-leading SPECFP06 and SPECINT06
    Performance on Linux/Windows/AMD/Intel/32/64
  • Incremental PGDBG enhancements, improved C
    support
  • MPI Debugging / Profiling for Windows x64 CCS
    Clusters
  • All-new cross-platform PGPROF performance
    profiler
  • Fortran 2003/C99 language features
  • GCC front-end compatibility, g
    interoperability
  • PGC tuning, PGC/VC interoperability
  • Windows SUA and Apple/MacOS platform support
  • De facto standard scalable C/Fortran
    language/tools extensions

41
Questions?
Reach me at brent.leback_at_pgroup.com Thanks for
your time
42
Pathscale
  • Version 3.1 on odin latest release
  • Well worth trying in addition to PGI
  • Not the default compiler.
  • often gives better results!
  • Very fine-grained control of optimization and
    code generation
  • Less informative optimization information

43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com