Experiences with Sweep3D Implementations in Coarray Fortran presentation

About This Presentation

Transcript and Presenter's Notes

Title: Experiences with Sweep3D Implementations in Coarray Fortran

1
Experiences with Sweep3D Implementations in
Co-array Fortran

Cristian Coarfa Yuri Dotsenko
John Mellor-Crummey
Department of Computer Science
Rice University
Houston, TX USA

2
Motivation

Parallel Programming Models
MPI de facto standard
difficult to program
OpenMP inefficient to map on distributed memory
platforms
lack of locality control
HPF hard to obtain high-performance
heroic compilers needed!
An appealing middle ground global address space
languages
CAF, Titanium, UPC

Evaluate CAF for an application with
sophisticated parallelization Sweep3D
3
Co-Array Fortran

Global address space programming model
one-sided communication (GET/PUT)
Programmer has control over performance-critical
factors
data distribution
computation partitioning
communication placement
Data movement and synchronization as language
primitives
amenable to compiler-based communication
optimization

4
CAF Programming Model Features

SPMD process images
fixed number of images during execution
images operate asynchronously
Both private and shared data
real x(20, 20) a private 20x20 array in
each image
real y(20, 20) a shared 20x20 array in each
image
Simple one-sided shared-memory communication
x(,jj2) y(,pp2)r copy columns from
image r into local columns
Synchronization intrinsic functions
sync_all a barrier and a memory fence
sync_mem a memory fence
sync_team(team members to notify, team members
to wait for)
Pointers and (perhaps asymmetric) dynamic
allocation

5
One-sided Communication with Co-Arrays
image 1
image 2
image N
h
Copy from left neighbor
image 1
image 2
image N
6
Outline

CAF programming model
cafc
Sweep3D implementations in CAF
Experimental evaluation
Conclusions

7
Rice Co-Array Fortran Compiler (cafc)

First CAF multi-platform compiler
previous compiler only for Cray shared memory
systems
Implements core of the language
currently lacks support for derived type and
dynamic co-arrays
Core sufficient for non-trivial codes
Performance comparable to that of hand-tuned MPI
codes
Open source

8
cafc Implementation Strategy

Goals
portability
high-performance on a wide range of platforms

Source-to-source compilation of CAF codes
uses Open64/SL Fortran 90 infrastructure
CAF Fortran 90 communication operations
Communication
ARMCI library for one-sided communication on
clusters (PNNL)
load/store communication on shared-memory
platforms

9
Synchronization

Original CAF specification team synchronization
only
sync_all, sync_team
Limits performance on loosely-coupled
architectures
Point-to-point extensions
sync_notify(q)
sync_wait(p)

Point to point
synchronization semantics
Delivery of a notify to q from p ?
all communication from p to q issued before the
notify has been delivered to q

10
CAF Compiler Targets (Oct 2004)

Processors
Pentium, Alpha, Itanium2, MIPS
Interconnects
Quadrics, Myrinet, Gigabit Ethernet, shared
memory
Operating systems
Linux, Tru64, IRIX

11
Outline

CAF programming model
cafc
Sweep3D implementations
Original MPI implementation
CAF versions
Communication microbenchmark
Experimental evaluation
Conclusions

12
Sweep3D

Core of an ASCI application
Solves a
one-group
time-independent
discrete ordinates (Sn)
3D Cartesian (XYZ) geometry
neutron transport problem
Deterministic particle transport accounts for
50-80 execution time of many realistic DOE
simulations

13
Sweep3D Parallelization
2D spatial domain decomposition onto a 2D
processor array
14
Sweep3D Parallelization
Wavefront parallelism
15
Sweep3D Parallelization
Wavefront parallelism
16
Sweep3D Parallelization
Wavefront parallelism
17
Sweep3D Parallelization
Wavefront parallelism
18
Sweep3D Parallelization
Wavefront parallelism
19
Sweep3D Parallelization
Wavefront parallelism
20
Sweep3D Parallelization
Wavefront parallelism
21
Sweep3D Parallelization
Wavefront parallelism
22
Sweep3D Parallelization
Wavefront parallelism
23
Sweep3D Parallelization
Wavefront parallelism
24
Sweep3D Parallelization
Wavefront parallelism
25
Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
26
Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
27
Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
28
Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
29
Initial Sweep3D CAF Implementation

Based on the MPI implementation
Maintain original computation
Convert communication buffers into co-arrays
Fundamental issue converting from two-sided
communication into one-sided communication

30
2-sided vs 1-sided Communication
2-sided comm
31
2-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
32
2-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
33
2-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
34
2-sided vs 1-sided Communication
sync_notify
sync_wait
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
35
2-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
36
2-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
sync_notify
sync_wait
2-sided comm
1-sided comm
37
2-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
sync_notify
sync_wait
2-sided comm
1-sided comm
38
CAF Implementation Issues

Synchronization necessary to avoid data races
might lead to inefficiency
Using multiple communication buffers enables
overlap of synchronization with computation

39
One- vs. Two-buffer Communication
One-buffer communication
source
dest
d
Two-buffers communication
40
Asynchrony-tolerant CAF Implementation of Sweep3D

Multiple-versioned communication buffers
Benefits
Overlap PUT with computation on destination
Overlap of synchronization with computation on
source

41
Three-buffer Communication
42
Communication Throughput Microbenchmark

MPI implementation blocking send and receive
CAF one-version buffer
CAF multi-versioned buffers
ARMCI implementation one buffer

43
Outline

CAF programming model
cafc
Sweep3D implementations
Experimental evaluation
Conclusions

44
Experimental Evaluation

Platforms
Itanium2Quadrics QSNet II (Elan4)
SGI Altix 3000
Itanium2Myrinet 2000
AlphaQuadrics QSNet (Elan3)
Problem sizes
50x50x50
150x150x150
300x300x300

45
Itanium2 Quadrics, Size 50x50x50
46
Itanium2 Quadrics, Size 150x150x150
47
Itanium2 Quadrics, Size 300x300x300

multi-version buffers improve performance of
CAF codes by 15
imperative to use non-blocking notifies

48
Itanium2Quadrics, Communication Throughput
Microbenchmark

multi-version buffers improve throughput
by 30 for messages up to 8KB
by 10 for messages larger than 8KB
overhead of the CAF translation is acceptable

49
SGI Altix 3000, Size 50x50x50
50
SGI Altix 3000, Size 150x150x150

multi-version buffers are effective for
asynchrony-tolerance

51
SGI Altix 3000, Size 300x300x300

both CAF implementations outperforms MPI

52
SGI Altix 3000, Communication Throughput
Microbenchmark
Warm cache

ARMCI library exploits effectively
the hardware support for efficient
data movement
MPI performs extra data copies

53
Summary of results

MPI buffering for small messages helps latency
asynchrony tolerance
CAF multi-version buffers improve performance of
one-sided communication for wavefront
computations
enables PUT and receivers computation to overlap
asynchrony tolerance between sender and receiver
Non-blocking notifies are important for
performance
enables synchronization to overlap with
computation
Platform results
CAF outperforms MPI for large problem sizes by
10 on Itanium2Quadrics,Myrinet,Altix
CAF 16slower on AlphaQuadrics(Elan3)
ARMCI lacks non-blocking notifies on Elan3

54
Enhancing CAF Usability

CAF vs MPI usability
easier to use than MPI for simple parallel
programs
as difficult for carefully-tuned parallel codes
Improving CAF ease of use
compiler support for managing multi-version
communication buffers
vectorizing fine-grain communication to best
support X1 and cluster platforms with same code

http//www.hipersoft.rice.edu/caf
55
(No Transcript)
56
Implementing Communication

x(1n) a(1n)p
Use a temporary buffer to hold off processor data
allocate buffer
perform GET to fill buffer
perform computation x(1n) buffer(1n)
deallocate buffer
Optimizations
no temporary storage for co-array to co-array
copies
load/store communication on shared-memory systems

57
Detailed Results

Itanium2Quadrics(Elan4)
similar for 503, 9 better for 1503 and 3003
AlphaQuadrics(Elan3)
8 better for 503, 16 lower for 1503 and similar
for 3003
ARMCI lacks non-blocking notifies on Elan3
SGI Altix 3000
comparable for 503 and 1503, 10 better for 3003
Itanium2Myrinet
similar for 503, 12 better for 1503 and 9
better for 3003

58
SGI Altix 3000, communication throughput
microbenchmark
Warm cache
Cold cache
59
One- vs. Two-buffer Communication
One-buffer communication
source
dest
d
Two-buffers communication
60
Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
61
Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
62
Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
63
Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
64
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Experiences with Sweep3D Implementations in Coarray Fortran PowerPoint PPT Presentation