High Performance Computing with AMD Opteron - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

High Performance Computing with AMD Opteron

Description:

in double precision source bases a good substitute since Opteron has the same ... ACML 2.5 Snap Shot Soon to be released. Components of ACML. BLAS, LAPACK, FFTs ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 32
Provided by: mau
Category:

less

Transcript and Presenter's Notes

Title: High Performance Computing with AMD Opteron


1
High Performance Computing with AMD Opteron
  • Maurizio Davini

2
Agenda
  • OS
  • Compilers
  • Libraries
  • Some Benchmark results
  • Conclusions

3
64-Bit Operating SystemsRecommendations and
Status
  • SUSE SLES 9 with latest Service Pack available
  • Has technology for supporting latest AMD
    processor features
  • Widest breadth of NUMA support and enabled by
    default
  • Oprofile system profiler installable as an RPM
    and modularized
  • complete support for static dynamically linked
    32-bit binaries
  • Red Hat Enterprise Server 3.0 Service Pack 2 or
    later
  • NUMA features support not as complete as that of
    SUSE SLES 9
  • Oprofile installable as an RPM but installation
    is not modularized and may require a kernel
    rebuild if RPM version isnt satisfactory
  • only SP 2 or later has complete 32-bit shared
    object library support (a requirement to run all
    32-bit binaries in 64-bit)
  • Posix-threading library changed between 2.1 and
    3.0, may require users to rebuild applications

4
AMD Opteron Compilers PGI , Pathscale , GNU ,
Absoft Intel, Microsoft and SUN
5
Compiler Comparisons TableCritical Features
Supported by x86 Compilers
Vector SIMD Support Peels Vector Loops Global IPA Open MP Links ACML Libraries Profile Guided Feedback Aligns Vector Loops Parallel Debuggers Large Array Support Medium Memory Model
PGI
GNU
Intel
Pathscale
Absoft
SUN
Microsoft
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
6
Tuning Performance with CompilersMaintaining
Stability while Optimizing
  • STEP 0 Build application using the following
    procedure
  • compile all files with the most aggressive
    optimization flags below
  • -tp k8-64 fastsse
  • if compilation fails or the application doesnt
    run properly, turn off vectorization
  • -tp k8-64 fast Mscalarsse
  • if problems persist compile at Optimization
    level 1
  • -tp k8-64 O0
  • STEP 1 Profile binary and determine performance
    critical routines
  • STEP 2 Repeat STEP 0 on performance critical
    functions, one at a time, and run binary after
    each step to check stability

7
PGI Compiler FlagsOptimization Flags
  • Below are 3 different sets of recommended PGI
    compiler flags for flag mining application source
    bases
  • Most aggressive -tp k8-64 fastsse Mipafast
  • enables instruction level tuning for Opteron, O2
    level optimizations, sse scalar and vector code
    generation, inter-procedural analysis, LRE
    optimizations and unrolling
  • strongly recommended for any single precision
    source code
  • Middle of the ground -tp k8-64 fast
    Mscalarsse
  • enables all of the most aggressive except vector
    code generation, which can reorder loops and
    generate slightly different results
  • in double precision source bases a good
    substitute since Opteron has the same throughput
    on both scalar and vector code
  • Least aggressive -tp k8-64 O0 (or O1)

8
PGI Compiler FlagsFunctionality Flags
  • -mcmodelmedium
  • use if your application statically allocates a
    net sum of data structures greater than 2GB
  • -Mlarge_arrays
  • use if any array in your application is greater
    than 2GB
  • -KPIC
  • use when linking to shared object (dynamically
    linked) libraries
  • -mp
  • process OpenMP/SGI directives/pragmas (build
    multi-threaded code)
  • -Mconcur
  • attempt auto-parallelization of your code on SMP
    system with OpenMP

9
Absoft Compiler FlagsOptimization Flags
  • Below are 3 different sets of recommended PGI
    compiler flags for flag mining application source
    bases
  • Most aggressive -O3
  • loop transformations, instruction preference
    tuning, cache tiling, SIMD code generation
    (CG). Generally provides the best performance
    but may cause compilation failure or slow
    performance in some cases
  • strongly recommended for any single precision
    source code
  • Middle of the ground -O2
  • enables most options by O3, including SIMD CG,
    instruction preferences, common sub-expression
    elimination, pipelining and unrolling.
  • in double precision source bases a good
    substitute since Opteron has the same throughput
    on both scalar and vector code
  • Least aggressive -O1

10
Absoft Compiler FlagsFunctionality Flags
  • -mcmodelmedium
  • use if your application statically allocates a
    net sum of data structures greater than 2GB
  • -g77
  • enables full compatibility with g77 produced
    objects and libraries
  • (must use this option to link to GNU ACML
    libraries)
  • -fpic
  • use when linking to shared object (dynamically
    linked) libraries
  • -safefp
  • performs certain floating point operations in a
    slower manner that avoids overflow, underflow and
    assures proper handling of NaNs

11
Pathscale Compiler FlagsOptimization Flags
  • Most aggressive -Ofast
  • Equivalent to O3 ipa OPTOfast
    fno-math-errno
  • Aggressive -O3
  • optimizations for highest quality code enabled
    at cost of compile time
  • Some generally beneficial optimization included
    may hurt performance
  • Reasonable -O2
  • Extensive conservative optimizations
  • Optimizations almost always beneficial
  • Faster compile time
  • Avoids changes which affect floating point
    accuracy.

12
Pathscale Compiler FlagsFunctionality Flags
  • -mcmodelmedium
  • use if static data structures are greater than
    2GB
  • -ffortran-bounds-check
  • (fortran) check array bounds
  • -shared
  • generate position independent code for calling
    shared object libraries
  • Feedback Directed Optimization
  • STEP 0 Compile binary with -fb_create_fbdata
  • STEP 1 Run code collect data
  • STEP 2 Recompile binary with -fb_opt fbdata
  • -march(opteronathlon64athlon64fx)
  • Optimize code for selected platform (Opteron is
    default)

13
ACML 2.1
  • Features
  • BLAS, LAPACK, FFT Performance
  • Open MP Performance
  • ACML 2.5 Snap Shot Soon to be released

14
Components of ACMLBLAS, LAPACK, FFTs
  • Linear Algebra (LA)
  • Basic Linear Algebra Subroutines (BLAS)
  • Level 1 (vector-vector operations)
  • Level 2 (matrix-vector operations)
  • Level 3 (matrix-matrix operations)
  • Routines involving sparse vectors
  • Linear Algebra PACKage (LAPACK)
  • leverage BLAS to perform complex operations
  • 28 Threaded LAPACK routines
  • Fast Fourier Transforms (FFTs)
  • 1D, 2D, single, double, r-r, r-c, c-r, c-c
    support
  • C and Fortran interfaces

15
64-bit BLAS PerformanceDGEMM (Double Precision
General Matrix Multiply)
16
64-bit FFT Performance (non-power of 2)MKL vs
ACML on 2.2 Ghz Opteron
17
64-bit FFT Performance (non-power of 2)2.2 Ghz
Opteron vs 3.2 Ghz XeonEMT
18
Multithreaded LAPACK PerformanceDouble Precsion
(LU, Cholesky, QR Factorize/Solve)
19
Conclusion and Closing Points
  • How good is our performance?

Averaging over 70 BLAS/LAPACK/FFT routines
Computation weighted average
All measurements performed on an 4P AMD OpteronTM
844 Quartet Server
ACML 32-bit is 55 faster than MKL 6.1
ACML 64-bit is 80 faster than MKL 6.1
19
20
64-ACML 2.5 SnapshotSmall Dgemm Enhancements
21
Recent Caspur Results ( thanks to
M.Rosati) Benchmark suites
ATLSIM A full-scale GEANT3 simulation of
ATLAS detector (P.Nevski)
(typical LHC Higgs events) SixTrack
Tracking of two particles in a 6-dimensional
phase space including
synchrotron oscillations (F.Schmidt)
(http//frs.home.cern.ch/frs/sixtrac
k.html) Sixtrack
benchmark code E.McIntosh
(http//frs.home.cern.ch/frs/Benchmark/benchma
rk.html) CERN U Ancient CERN Units
Benchmark (E.McIntosh)
22
What was measured
On both platforms, we were running one or two
simultaneous jobs for each of the benchmarks.
On Opteron, we used the SuSE numactl
interface to make sure that at any time each of
the two processors makes use of the right bank of
memory. Example of submission, 2 simultaneous
jobs Intel ./TestJob ./TestJob AMD numactl
cpubind0 membind0 ./TestJob
numactl cpubind1 membind1 ./TestJob

23
Results
CERN Units CERN Units SixTrack (seconds/run) SixTrack (seconds/run) ATLSIM (seconds/event) ATLSIM (seconds/event)
1 job 2 jobs 1 job 2 jobs 1 job 2 jobs
Intel Nocona 491 375,399 185 218,218 394 484,484
AMD Opteron 528 528,528 165 165,166 389 389,389
While both machines behave in a similar way
when only one job is run, the situation changes
in a visible manner in the case of two jobs. It
may take up to 30 more time to run two
simultaneous jobs on Intel, while on AMD there is
a notable absence of any visible performance
drop.
24
HEP Software bench
25
HEP Software bench
26
Hep software Bench
27
HEP software bench
28
HEP Software bench
29
An original MPI work on AMD Opteron
  • We got access to the MPI wrapper-library source
  • Environment
  • 4 way servers
  • Myrinet interconnect
  • Linux 2.6 kernel
  • LAM MPI
  • We inserted libnuma calls after MPI_INIT to bind
    the newly-created MPI tasks to specific
    processors
  • We avoid unnecessary memory traffic by having
    each processor accessing its own memory

30
gt20 improvement
31
Conclusioni
  • AMD Opteron HPEW
  • High
  • Performance
  • Easy
  • Way
Write a Comment
User Comments (0)
About PowerShow.com