Title: A Multi-platform Co-Array Fortran Compiler for High-Performance Computing
1A Multi-platform Co-Array Fortran Compiler for
High-Performance Computing Cristian Coarfa, Yuri
Dotsenko, John Mellor-Crummey dotsenko, ccristi,
johnmc_at_cs.rice.edu
Programming Ultra-scale Parallel Systems
Co-Array Fortran Language
Rice CAF Compiler
CAF Model Refinements
- Parallel extension of Fortran 90
- SPMD programming model
- fixed number of images during execution
- images operate asynchronously
- Both private and shared data
- real a(20,20) private a 20x20 array in
each image - real a(20,20) shared a 20x20 array in
each image - Simple one-sided communication (PUT GET)
- x(,jj2) a(r,) pp2 copy rows from
pp2 into local columns - Flexible explicit synchronization
- sync_team(team ,wait)
- team a vector of process ids to synchronize
with - wait a vector of processes to wait for
- Pointers and dynamic allocation
- Source-to-source code generation
- Open source compiler
- Build on Open64/SL infrastructure
- Support for core language features
- Code generation
- library-based communication portable ARMCI and
GASNet communication libraries and array
descriptor CHASM library - load/store communication on shared-memory
platforms - Operating systems
- Linux IA64/IA32
- Alpha Tru64
- SGI IRIX64
- Interconnects Platforms
- Quadrics QSNet (Elan 3), QSNet II (Elan 4)
- Myrinet 2000
- Ethernet
- SGI Altix 3000, SGI Origin 2000
- Point-to-point synchronization
- sync_notify(p)
- sync_wait(p)
- Less restrictive memory fences at call site
- Collective operations
- CHALLENGES
- High-performance and good scalability
- Programmer productivity
- CAF a promising near-term alternative
- As expressive as MPI
- Simpler to program than MPI
- More amenable to compiler optimizations
- User has control over performance-critical
factors - MPI a library-based parallel programming model
- Portable and widely used
- The developer has explicit control over data and
communication placement - Difficult and error prone to program
- Most of the burden for communication
optimization falls on application developers
compiler support is underutilized
Current Optimizations
- Procedure Splitting
- Hints for non-blocking communication
- Library-based and load/store communication
- Packing of strided communication
Planned Optimizations
- Communication vectorization and aggregation
- Synchronization strength-reduction
- Automatic split-phase communication
- Platform-driven communication optimizations
- transform communication from one-sided into
two- sided and collective, if useful - multi-model code for hierarchical architectures
- convert GETs into PUTs
- Multi-buffer co-arrays for asynchrony tolerance
- Employ virtualization for latency tolerance
- Interoperability with other programming models
CAF Applications and Benchmarks
- Sweep3D wave-front parallelism
- Spark98 sparse matrix vector multiply
- NAS Parallel Benchmarks 2.3 MG, CG, SP, BT, LU
- Random Access, STREAM
Neutron transport problem Sweep3D
San Fernando Valley Earthquake Simulation
Spark98
Computational Fluid Dynamics Cluster Platforms
NAS BT C on Itanium2Myrinet 2000
NAS MG C on Itanium2Myrinet 2000
Spark98 on SGI Altix 3000
Sweep3D 1503on Itanium2Quadrics
Mesh
Computational Fluid Dynamics SGI Altix 3000
- Sparse matrix vector multiply (sf2 traces)
- Performance of all CAF versions is comparable to
that of MPI and better on large number of CPUs - CAF GETs is simple and more natural to code,
but up to 13 slower - Without considering locality, applications do
not scale on NUMA architectures (Hybrid) - ARMCI library is more efficient than MPI
Sweep3D 1503on Itanium2Myrinet
Sweep3D 1503on SGI Altix 3000
NAS BT B on SGI Altix 3000
NAS MG C on SGI Altix 3000
Partitioned Mesh