Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems - PowerPoint PPT Presentation

Loading...

PPT – Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems PowerPoint presentation | free to view - id: 70d0e0-ZDA1M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems

Description:

Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems University of Tsukuba – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems


1
Implementation and Performance Evaluation of
XcalableMP A Parallel Programming Language for
Distributed Memory Systems
  • University of Tsukuba
  • HPCS lab
  • Jinpil Lee, Mitsuhisa Sato

2
Background
  • HPC with Parallel Platform
  • existing parallel programming models
  • shared memory
  • OpenMP adding directives to the serial code
  • easy to program
  • distributed memory
  • MPI(Message Passing Interface)
  • describe internode communication explicitly
  • high performance
  • low productivity
  • Another programming model for distributed memory
    is needed

3
Research Objective
  • New Programming Model for Distributed Memory
  • High Productivity
  • Equivalent Performance to MPI, other PGAS
    languages
  • XcalableMP
  • Proposed by XcalableMP Specification WG
  • Language Extension C Fortran
  • OpenMP-like directives a(HPF, CAF feature)
  • What we have done in this work
  • implementation of C version compiler
  • evaluated using HPC Challenge Benchmarks

4
Overview of XcalableMP
  • C Fortran Based Language Extension
  • Execution Model
  • SPMD(Single Program Multiple Data)
  • starts with single thread per process
  • Explicit Parallelism
  • no virtual shared memory
  • no automatic comm(RMA, Sync)
  • Two Paradigms in one language
  • global view model
  • OpenMP-like, provides directives for distributed
    memory
  • incremental parallelization from the serial code
  • local view model
  • describes comm using language extension
    (co-array)
  • supports PGAS

5
Global View Model
  • Directive-based Model
  • OpenMP-like directives describing data/task
    parallelism
  • explicit parallelism

6
Local View Model
  • Data Distribution, Work-sharing manual
  • One-sided comm supported by language extension
  • int arrayYMAX/YPROCXMAX
  • main(int argc, char argv)
  • int i,j,res 0,res_local 0
  • for(i 0 i lt YMAX/YPROC i)
  • for(j 0 j lt XMAX j)
  • arrayij func(i,j)
  • res_local arrayij
  • for(i 1 i lt YPROC i)
  • res res_locali

7
Related Works
name feature
High Performance Fortran communication inserted by compiler
VPP Fortran explicit parallization using directives
OpenMPD explicit parallization using directives
template, align, ...
XMP Global View
XMP Local View
name feature
Unified Parallel C global shared memory by runtime support
Co-Array Fortran supprts one-sided communication
co-array
8
Data Distibution Using Template
  • Template
  • virtual array representing data(index) space
  • array distribution, work-sharing must be done
    using template

Example)
pragma xmp nodes p(4) declare node set
pragma xmp template t(099) declare template
pragma xmp distribute t(BLOCK) on p distribute
template
pragma align arrayi with t(i) distribute
array owner of t(i) has ai
9
Parallel Execution of Loops
  • pragma xmp loop on template-ref
  • example)
  • pragma xmp distribute t(BLOCK) onto p
  • pragma xmp align ai with t(i)
  • pragma xmp loop on t(i)
  • for(i 2 i lt 10 i) ai . . .

NODE(1)
NODE(2)
NODE(3)
each elements must be processed by the owners
distributed image of array
NODE(4)
10
Data Synchronization of Array(shadow)
  • Shadow Region
  • in XMP, memory access is always local
  • duplicated overlapped data distributed onto other
    nodes
  • data synchronization reflect directive

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
a
pragma xmp shadow a11 declare shadow
NODE1
NODE2
NODE3
NODE4
pragma xmp reflect a synchronize shadow
11
Internode Communication
  • broadcast
  • pragma xmp bcast var on node from node
  • barrier synchronization
  • pragma xmp barrier
  • reduce operation
  • pragma xmp reduction (varop)
  • data movement in global view (next slide)
  • pragma xmp gmove

12
gmove Directive
  • Data Movement in Global View
  • collective communication
  • translated to message-passing communication
  • Ex) get data from distributed array
  • pragma xmp gmove
  • x A1
  • (error without gmove)
  • C Extension array section ex) array099 0

x A4
0
1
2
3
  • pragma xmp gmove
  • L0N-1 A00N-1 // all-gather

ANN
LN
0
1
2
3
13
Local View Model
  • Co-Array Fortran
  • describes one-sided communication
  • ex) real dimension a(100) (Co-array
    declaration)
  • . . .
  • b(1) a(1)p (get from node p)
  • XcalableMP Co-array
  • XMP-FortranCAF compatible
  • XMP-Ccoarray directive co-array statement
  • ex) pragma xmp coarray a100
  • . . .
  • b1 a1p (get from node p)
  • usage performance tuning,
  • describing arbitrary comm pattern

14
Performance Evaluation
  • T2K-Tsukuba System (2 to 32 nodes)
  • Target HPC Challenge Benchmark
  • Stream, Linpack, FFT global view
  • RandomAccess local view

CPU AMD Opteron Quad-core 8000series 2.3Ghz x 4sockets (16 cores)
MEM 32GB
NETWORK InfiniBand (x4 rails)
MPI lib MVAPICH2 - 1.2
15
Parallelizing Stream
  • Straightfoward Implementation

double aSIZE , bSIZE , cSIZE pragma xmp
nodes p() pragma xmp template
t(0SIZE-1) pragma xmp distribute t(block) onto
p pragma xmp align j with t(j) a, b, c . .
. pragma xmp loop on t(j) for (j 0 j lt SIZE
j) aj bj scalarcj . . . pragma xmp
reduction(triadGBs)
16
Evaluation Result Stream
  • Lines Of Code serial86 ? XMP98

vector elements256k, double
17
Parallelizing RandomAccess
  • Co-array for Random Access

define SIZE TABLE_SIZE/PROCS u64Int TableSIZE
pragma xmp nodes p(PROCS) pragma xmp coarray
Table PROCS . . . for (i 0 i lt SIZE i)
Tablei b i . . . for (i 0 i lt NUPDATE
i) temp (temp ltlt 1) ˆ ((s64Int)temp lt 0 ?
POLY 0) TabletempSIZE(tempTABLE_SIZE)/S
IZE ˆ temp pragma xmp barrier
18
Evaluation Result RandomAccess
  • Lines Of Code serial73 ? XMP77

bad scalability!
table size128, double
18
XMP project
19
Parallelizing Linpack
  • 1-dimensional Cyclic Distribution
  • Selecting Pivot and Exchange
  • needs internode comm
  • vector exchange using gmove

dgefa function pragma xmp gmove
pvt_vkn-1 akn-1l if (l ! k)
pragma xmp gmove akn-1l
akn-1k pragma xmp gmove akn-1k
pvt_vkn-1
k l

pvt_v
exchange
20
Evaluation Result Linpack
  • Lines Of Code serial208 ? XMP243

matrix size30k 30k, double
21
Parallelizing FFT
  • six-step FFT algorithm
  • needs three matrix transpose
  • 1-dimensional block distribution
  • transpose needs internode comm
  • implemented by gmove

for (i 0 i lt N1 i) for (j 0 j lt N2
j) bij aji
aN2N1
bN1N2
0
0
1
1
local copy
0
0
gmove a_work a
0
1
1
1
0
1
a_workN2N1
22
Parallelizing FFT
  • XMP code

pragma xmp align a_worki with t1(i) pragma
xmp align ai with t2(i) pragma xmp align
bi with t1(i) . . . pragma xmp gmove
a_work a // gmove pragma xmp
loop on t1(i) for(i 0 i lt N1 i) for(j
0 j lt N2 j) c_assgn(bij,
a_workji) // local copy pragma xmp loop on
t1(i) for(i 0 i lt N1 i)
HPCC_fft235(bi, work, w2, N2, ip2) // 1-dim FFT
23
Evaluation Result FFT
  • Lines Of Code serial186 ? XMP217

matrix size44k 44k, double
24
Parallelization for Multicore
  • XMP multi-thread for multicore clusters
  • Currently
  • flat-MPI
  • OpenMPMPI
  • XMP extension for multicore clusters

25
Conclusion
  • XcalableMP
  • Parallel programming model for distributed memory
  • C Fortran language extension
  • two programming models
  • global viewdirective-based
  • loca viewone-sided comm using co-array
  • easy to program, high performance
  • Evaluation with HPCC Benchmarks
  • Stream, RandomAccess, Linpack, FFT
About PowerShow.com