A Multi-platform Co-Array Fortran Compiler for High-Performance Computing - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing

Description:

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing ... Less restrictive memory fences at call site. Collective operations ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 2
Provided by: cohesi
Category:

less

Transcript and Presenter's Notes

Title: A Multi-platform Co-Array Fortran Compiler for High-Performance Computing


1
A Multi-platform Co-Array Fortran Compiler for
High-Performance Computing Cristian Coarfa, Yuri
Dotsenko, John Mellor-Crummey dotsenko, ccristi,
johnmc_at_cs.rice.edu
Programming Ultra-scale Parallel Systems
Co-Array Fortran Language
Rice CAF Compiler
CAF Model Refinements
  • Parallel extension of Fortran 90
  • SPMD programming model
  • fixed number of images during execution
  • images operate asynchronously
  • Both private and shared data
  • real a(20,20) private a 20x20 array in
    each image
  • real a(20,20) shared a 20x20 array in
    each image
  • Simple one-sided communication (PUT GET)
  • x(,jj2) a(r,) pp2 copy rows from
    pp2 into local columns
  • Flexible explicit synchronization
  • sync_team(team ,wait)
  • team a vector of process ids to synchronize
    with
  • wait a vector of processes to wait for
  • Pointers and dynamic allocation
  • Source-to-source code generation
  • Open source compiler
  • Build on Open64/SL infrastructure
  • Support for core language features
  • Code generation
  • library-based communication portable ARMCI and
    GASNet communication libraries and array
    descriptor CHASM library
  • load/store communication on shared-memory
    platforms
  • Operating systems
  • Linux IA64/IA32
  • Alpha Tru64
  • SGI IRIX64
  • Interconnects Platforms
  • Quadrics QSNet (Elan 3), QSNet II (Elan 4)
  • Myrinet 2000
  • Ethernet
  • SGI Altix 3000, SGI Origin 2000
  • Point-to-point synchronization
  • sync_notify(p)
  • sync_wait(p)
  • Less restrictive memory fences at call site
  • Collective operations
  • CHALLENGES
  • High-performance and good scalability
  • Programmer productivity
  • CAF a promising near-term alternative
  • As expressive as MPI
  • Simpler to program than MPI
  • More amenable to compiler optimizations
  • User has control over performance-critical
    factors
  • MPI a library-based parallel programming model
  • Portable and widely used
  • The developer has explicit control over data and
    communication placement
  • Difficult and error prone to program
  • Most of the burden for communication
    optimization falls on application developers
    compiler support is underutilized

Current Optimizations
  • Procedure Splitting
  • Hints for non-blocking communication
  • Library-based and load/store communication
  • Packing of strided communication

Planned Optimizations
  • Communication vectorization and aggregation
  • Synchronization strength-reduction
  • Automatic split-phase communication
  • Platform-driven communication optimizations
  • transform communication from one-sided into
    two- sided and collective, if useful
  • multi-model code for hierarchical architectures
  • convert GETs into PUTs
  • Multi-buffer co-arrays for asynchrony tolerance
  • Employ virtualization for latency tolerance
  • Interoperability with other programming models

CAF Applications and Benchmarks
  • Sweep3D wave-front parallelism
  • Spark98 sparse matrix vector multiply
  • NAS Parallel Benchmarks 2.3 MG, CG, SP, BT, LU
  • Random Access, STREAM

Neutron transport problem Sweep3D
San Fernando Valley Earthquake Simulation
Spark98
Computational Fluid Dynamics Cluster Platforms
NAS BT C on Itanium2Myrinet 2000
NAS MG C on Itanium2Myrinet 2000
Spark98 on SGI Altix 3000
Sweep3D 1503on Itanium2Quadrics
Mesh
Computational Fluid Dynamics SGI Altix 3000
  • Sparse matrix vector multiply (sf2 traces)
  • Performance of all CAF versions is comparable to
    that of MPI and better on large number of CPUs
  • CAF GETs is simple and more natural to code,
    but up to 13 slower
  • Without considering locality, applications do
    not scale on NUMA architectures (Hybrid)
  • ARMCI library is more efficient than MPI

Sweep3D 1503on Itanium2Myrinet
Sweep3D 1503on SGI Altix 3000
NAS BT B on SGI Altix 3000
NAS MG C on SGI Altix 3000
Partitioned Mesh
Write a Comment
User Comments (0)
About PowerShow.com