Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems

Description:

Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems University of Tsukuba – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 26

Provided by: jinp6

Learn more at: https://www.mcs.anl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems

1
Implementation and Performance Evaluation of
XcalableMP A Parallel Programming Language for
Distributed Memory Systems

University of Tsukuba
HPCS lab
Jinpil Lee, Mitsuhisa Sato

2
Background

HPC with Parallel Platform
existing parallel programming models
shared memory
OpenMP adding directives to the serial code
easy to program
distributed memory
MPI(Message Passing Interface)
describe internode communication explicitly
high performance
low productivity
Another programming model for distributed memory
is needed

3
Research Objective

New Programming Model for Distributed Memory
High Productivity
Equivalent Performance to MPI, other PGAS
languages
XcalableMP
Proposed by XcalableMP Specification WG
Language Extension C Fortran
OpenMP-like directives a(HPF, CAF feature)
What we have done in this work
implementation of C version compiler
evaluated using HPC Challenge Benchmarks

4
Overview of XcalableMP

C Fortran Based Language Extension
Execution Model
SPMD(Single Program Multiple Data)
starts with single thread per process
Explicit Parallelism
no virtual shared memory
no automatic comm(RMA, Sync)
Two Paradigms in one language
global view model
OpenMP-like, provides directives for distributed
memory
incremental parallelization from the serial code
local view model
describes comm using language extension
(co-array)
supports PGAS

5
Global View Model

Directive-based Model
OpenMP-like directives describing data/task
parallelism
explicit parallelism

6
Local View Model

Data Distribution, Work-sharing manual
One-sided comm supported by language extension

int arrayYMAX/YPROCXMAX
main(int argc, char argv)
int i,j,res 0,res_local 0
for(i 0 i lt YMAX/YPROC i)
for(j 0 j lt XMAX j)
arrayij func(i,j)
res_local arrayij
for(i 1 i lt YPROC i)
res res_locali

7
Related Works
name feature
High Performance Fortran communication inserted by compiler
VPP Fortran explicit parallization using directives
OpenMPD explicit parallization using directives
template, align, ...
XMP Global View
XMP Local View
name feature
Unified Parallel C global shared memory by runtime support
Co-Array Fortran supprts one-sided communication
co-array
8
Data Distibution Using Template

Template
virtual array representing data(index) space
array distribution, work-sharing must be done
using template

Example)
pragma xmp nodes p(4) declare node set
pragma xmp template t(099) declare template
pragma xmp distribute t(BLOCK) on p distribute
template
pragma align arrayi with t(i) distribute
array owner of t(i) has ai
9
Parallel Execution of Loops

pragma xmp loop on template-ref
example)
pragma xmp distribute t(BLOCK) onto p
pragma xmp align ai with t(i)
pragma xmp loop on t(i)
for(i 2 i lt 10 i) ai . . .

NODE(1)
NODE(2)
NODE(3)
each elements must be processed by the owners
distributed image of array
NODE(4)
10
Data Synchronization of Array(shadow)

Shadow Region
in XMP, memory access is always local
duplicated overlapped data distributed onto other
nodes
data synchronization reflect directive

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
a
pragma xmp shadow a11 declare shadow
NODE1
NODE2
NODE3
NODE4
pragma xmp reflect a synchronize shadow
11
Internode Communication

broadcast
pragma xmp bcast var on node from node
barrier synchronization
pragma xmp barrier
reduce operation
pragma xmp reduction (varop)
data movement in global view (next slide)
pragma xmp gmove

12
gmove Directive

Data Movement in Global View
collective communication
translated to message-passing communication
Ex) get data from distributed array
pragma xmp gmove
x A1
(error without gmove)
C Extension array section ex) array099 0

x A4
0
1
2
3

pragma xmp gmove
L0N-1 A00N-1 // all-gather

ANN
LN
0
1
2
3
13
Local View Model

Co-Array Fortran
describes one-sided communication
ex) real dimension a(100) (Co-array
declaration)
. . .
b(1) a(1)p (get from node p)
XcalableMP Co-array
XMP-FortranCAF compatible
XMP-Ccoarray directive co-array statement
ex) pragma xmp coarray a100
. . .
b1 a1p (get from node p)
usage performance tuning,
describing arbitrary comm pattern

14
Performance Evaluation

T2K-Tsukuba System (2 to 32 nodes)
Target HPC Challenge Benchmark
Stream, Linpack, FFT global view
RandomAccess local view

CPU AMD Opteron Quad-core 8000series 2.3Ghz x 4sockets (16 cores)
MEM 32GB
NETWORK InfiniBand (x4 rails)
MPI lib MVAPICH2 - 1.2
15
Parallelizing Stream

Straightfoward Implementation

double aSIZE , bSIZE , cSIZE pragma xmp
nodes p() pragma xmp template
t(0SIZE-1) pragma xmp distribute t(block) onto
p pragma xmp align j with t(j) a, b, c . .
. pragma xmp loop on t(j) for (j 0 j lt SIZE
j) aj bj scalarcj . . . pragma xmp
reduction(triadGBs)
16
Evaluation Result Stream

Lines Of Code serial86 ? XMP98

vector elements256k, double
17
Parallelizing RandomAccess

Co-array for Random Access

define SIZE TABLE_SIZE/PROCS u64Int TableSIZE
pragma xmp nodes p(PROCS) pragma xmp coarray
Table PROCS . . . for (i 0 i lt SIZE i)
Tablei b i . . . for (i 0 i lt NUPDATE
i) temp (temp ltlt 1) ˆ ((s64Int)temp lt 0 ?
POLY 0) TabletempSIZE(tempTABLE_SIZE)/S
IZE ˆ temp pragma xmp barrier
18
Evaluation Result RandomAccess

Lines Of Code serial73 ? XMP77

bad scalability!
table size128, double
18
XMP project
19
Parallelizing Linpack

1-dimensional Cyclic Distribution
Selecting Pivot and Exchange
needs internode comm
vector exchange using gmove

dgefa function pragma xmp gmove
pvt_vkn-1 akn-1l if (l ! k)
pragma xmp gmove akn-1l
akn-1k pragma xmp gmove akn-1k
pvt_vkn-1
k l

pvt_v
exchange
20
Evaluation Result Linpack

Lines Of Code serial208 ? XMP243

matrix size30k 30k, double
21
Parallelizing FFT

six-step FFT algorithm
needs three matrix transpose
1-dimensional block distribution
transpose needs internode comm
implemented by gmove

for (i 0 i lt N1 i) for (j 0 j lt N2
j) bij aji
aN2N1
bN1N2
0
0
1
1
local copy
0
0
gmove a_work a
0
1
1
1
0
1
a_workN2N1
22
Parallelizing FFT

XMP code

pragma xmp align a_worki with t1(i) pragma
xmp align ai with t2(i) pragma xmp align
bi with t1(i) . . . pragma xmp gmove
a_work a // gmove pragma xmp
loop on t1(i) for(i 0 i lt N1 i) for(j
0 j lt N2 j) c_assgn(bij,
a_workji) // local copy pragma xmp loop on
t1(i) for(i 0 i lt N1 i)
HPCC_fft235(bi, work, w2, N2, ip2) // 1-dim FFT
23
Evaluation Result FFT