John MellorCrummey - PowerPoint PPT Presentation

About This Presentation

Title:

John MellorCrummey

Description:

can't express both source-level packing and unpacking for a one-sided transfer. two-sided packing/unpacking is awkward for users. Preferred approach ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 33

Provided by: hiperso

Learn more at: http://www.hipersoft.rice.edu

Category:

more less

Transcript and Presenter's Notes

Title: John MellorCrummey

1
Experiences Building a Multi-platform Compiler
for Co-array Fortran

John Mellor-Crummey
Cristian Coarfa, Yuri Dotsenko
Department of Computer Science
Rice University

AHPCRC PGAS Workshop September, 2005
2
Goals for HPC Languages

Expressiveness
Ease of programming
Portable performance
Ubiquitous availability

3
PGAS Languages

Global address space programming model
one-sided communication (GET/PUT)
Programmer has control over performance-critical
factors
data distribution and locality control
computation partitioning
communication placement
Data movement and synchronization as language
primitives
amenable to compiler-based communication
optimization

simpler than msg passing
4
Co-array Fortran Programming Model

SPMD process images
fixed number of images during execution
images operate asynchronously
Both private and shared data
real x(20, 20) a private 20x20 array in each
image
real y(20, 20) a shared 20x20 array in each
image
Simple one-sided shared-memory communication
x(,jj2) y(,pp2)r copy columns from
image r into local columns
Synchronization intrinsic functions
sync_all a barrier and a memory fence
sync_mem a memory fence
sync_team(team members to notify, team members
to wait for)
Pointers and (perhaps asymmetric) dynamic
allocation
Parallel I/O

5
One-sided Communication with Co-Arrays
image 1
image 2
image N
image 1
image 2
image N
6
CAF Compilers

Cray compilers for X1 T3E architectures
Rice Co-Array Fortran Compiler (cafc)

7
Rice cafc Compiler

Source-to-source compiler
source-to-source yields multi-platform
portability
Implements core language features
core sufficient for non-trivial codes
preliminary support for derived types
soon support for allocatable components
Open source

Performance comparable to that of hand-tuned MPI
codes
8
Implementation Strategy

Goals
portability
high performance on a wide range of platforms
Approach
source-to-source compilation of CAF codes
use Open64/SL Fortran 90 infrastructure
CAF ? Fortran 90 communication operations
communication
ARMCI and GASNet one-sided comm libraries for
portability
load/store communication on shared-memory
platforms

9
Key Implementation Concerns

Fast access to local co-array data
Fast communication
Overlap of communication and computation

10
Accessing Co-Array Data

Two Representations
SAVE and COMMON co-arrays as Fortran 90 pointers
F90 pointers to memory allocated outside Fortran
run-time system
original references accessing local co-array data
rhs(1,i,j,k,c) u(1,i-1,j,k,c) -
transformed references
rhsptr(1,i,j,k,c) uptr(1,i-1,j,k,c) -
Procedure co-array arguments as F90
explicit-shape arrays
CAF language requires explicit shape for co-array
arguments

real a(10,10,10) type CAFDesc_real_3
real, pointer ptr(,,) ! F90 pointer to local
co-array data end Type CAFDesc_real_3 type(CAFDesc
_real_3) a
11
Performance Challenges

Problem
Fortran 90 pointer-based representation does not
convey
the lack of co-array aliasing
contiguity of co-array data
co-array bounds information
lack of knowledge inhibits important code
optimizations
Approach procedure splitting

12
Procedure Splitting
subroutine f() real, save c(100) interface
subroutine f_inner(, c_arg) real
c_arg end subroutine f_inner end
interface call f_inner(,c(1)) end subroutine
f subroutine f_inner(, c_arg) real
c_arg(100) ... c_arg(50) ... end
subroutine f_inner
CAF to CAF optimization
subroutine f() real, save c(100) ...
c(50) ... end subroutine f

Benefits
better alias analysis
contiguity of co-array data
co-array bounds information
better dependence analysis

result back-end compiler can generate better code
13
Implementing Communication

x(1n) a(1n)p
General approach use buffer to hold off
processor data
allocate buffer
perform GET to fill buffer
perform computation x(1n) buffer(1n)
deallocate buffer
Optimizations
no buffer for co-array to co-array copies
unbuffered load/store on shared-memory systems

14
Strided vs. Contiguous Transfers

Problem
CAF remote reference might induce many small data
transfers
a(i,1n)p b(j,1n)
Solution
pack strided data on source and unpack it on
destination
Constraints
cant express both source-level packing and
unpacking for a one-sided transfer
two-sided packing/unpacking is awkward for users
Preferred approach
have communication layer perform packing/unpacking

15
Pragmatics of Packing

Who should implement packing?
CAF programmer
difficult to program
CAF compiler
must convert PUTs into two-sided communication to
unpack
difficult whole-program transformation
Communication library
most natural place
ARMCI currently performs packing on Myrinet (at
least)

16
Synchronization

Original CAF specification team synchronization
only
sync_all, sync_team
Limits performance on loosely-coupled
architectures
Point-to-point extensions
sync_notify(q)
sync_wait(p)

Point to point
synchronization semantics
Delivery of a notify to q from p ?
all communication from p to q issued before the
notify has been delivered to q

17
Hiding Communication Latency

Goal enable communication/computation overlap
Impediments to generating non-blocking
communication
use of indexed subscripts in co-dimensions
lack of whole program analysis
Approach support hints for non-blocking
communication
overcome conservative compiler analysis
enable sophisticated programmers to achieve good
performance today

18
Questions about PGAS Languages

Performance
can performance match hand-tuned msg passing
programs?
what are the obstacles to top performance?
what should be done to overcome them?
language modifications or extensions?
program implementation strategies?
compiler technology?
run-time system enhancements?
Programmability
how easy is it to develop high performance
programs?

19
Investigating these Issues

Evaluate CAF, UPC, and MPI versions of NAS
benchmarks
Performance
compare CAF and UPC performance to that of MPI
versions
use hardware performance counters to pinpoint
differences
determine optimization techniques common for both
languages as well as language specific
optimizations
language features
program implementation strategies
compiler optimizations
runtime optimizations
Programmability
assess programmability of the CAF and UPC variants

20
Platforms and Benchmarks

Platforms
Itanium2Myrinet 2000 (900 MHz Itanium2)
AlphaQuadrics QSNetI (1 GHz Alpha EV6.8CB)
SGI Altix 3000 (1.5 GHz Itanium2)
SGI Origin 2000 (R10000)
Codes
NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
MG, CG, SP, BT
CAF and UPC versions were derived from
Fortran77MPI versions

21
MG class A (2563) on Itanium2Myrinet2000
Higher is better
22
MG class C (5123) on SGI Altix 3000
Fortran compiler linearized array subscripts 30
slowdown compared to multidimensional subscripts
64
Higher is better
23
MG class B (2563) on SGI Origin 2000
Higher is better
24
CG class C (150000) on SGI Altix 3000
Higher is better
25
CG class B (75000) on SGI Origin 2000
Higher is better
26
SP class C (1623) on Itanium2Myrinet2000
Higher is better
27
SP class C (1623) on AlphaQuadrics
Higher is better
28
BT class C (1623) on Itanium2Myrinet2000
Higher is better
29
BT class B (1023) on SGI Altix 3000
Higher is better
30
Performance Observations