ParCo 2003 Presentation - PowerPoint PPT Presentation

About This Presentation
Title:

ParCo 2003 Presentation

Description:

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop ... Unpack(recv_buf, tilen-1 1, nod); #pragma omp barrier. 2/10/2003. EuroPVM/MPI 2003 ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 46
Provided by: nikolaos8
Category:

less

Transcript and Presenter's Notes

Title: ParCo 2003 Presentation


1
Advanced Hybrid MPI/OpenMP Parallelization
Paradigms for Nested Loop Algorithms onto
Clusters of SMPs
Nikolaos Drosinos and Nectarios Koziris
National Technical University of Athens
Computing Systems Laboratory ndros,nkoziris_at_cs
lab.ece.ntua.gr www.cslab.ece.ntua.gr
2
Overview
  • Introduction
  • Pure MPI Model
  • Hybrid MPI-OpenMP Models
  • Hyperplane Scheduling
  • Fine-grain Model
  • Coarse-grain Model
  • Experimental Results
  • Conclusions Future Work

3
Introduction
  • Motivation
  • SMP clusters
  • Hybrid programming models
  • Mostly fine-grain MPI-OpenMP paradigms
  • Mostly DOALL parallelization

4
Introduction
  • Contribution
  • 3 programming models for the parallelization of
    nested loops algorithms
  • pure MPI
  • fine-grain hybrid MPI-OpenMP
  • coarse-grain hybrid MPI-OpenMP
  • Advanced hyperplane scheduling
  • minimize synchronization need
  • overlap computation with communication

5
Introduction
  • Algorithmic Model
  • FOR j0 min0 TO max0 DO
  • FOR jn-1 minn-1 TO maxn-1 DO
  • Computation(j0,,jn-1)
  • ENDFOR
  • ENDFOR
  • Perfectly nested loops
  • Constant flow data dependencies

6
Introduction
Target Architecture SMP clusters
7
Overview
  • Introduction
  • Pure MPI Model
  • Hybrid MPI-OpenMP Models
  • Hyperplane Scheduling
  • Fine-grain Model
  • Coarse-grain Model
  • Experimental Results
  • Conclusions Future Work

8
Pure MPI Model
  • Tiling transformation groups iterations into
    atomic execution units (tiles)
  • Pipelined execution
  • Overlapping computation with communication
  • Makes no distinction between inter-node and
    intra-node communication

9
Pure MPI Model
Example FOR j10 TO 9 DO FOR j20 TO 7 DO
Aj1,j2Aj1-1,j2 Aj1,j2-1
ENDFOR ENDFOR
10
Pure MPI Model
CPU1
NODE1
CPU0
4 MPI nodes
CPU1
NODE0
CPU0
11
Pure MPI Model
CPU1
NODE1
CPU0
4 MPI nodes
CPU1
NODE0
CPU0
12
Pure MPI Model
tile0 nod0 tilen-2 nodn-2 FOR tilen-1 0
TO DO Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod))
Compute(tile) MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod) END FOR
13
Overview
  • Introduction
  • Pure MPI Model
  • Hybrid MPI-OpenMP Models
  • Hyperplane Scheduling
  • Fine-grain Model
  • Coarse-grain Model
  • Experimental Results
  • Conclusions Future Work

14
Hyperplane Scheduling
  • Implements coarse-grain parallelism assuming
    inter-tile data dependencies
  • Tiles are organized into data-independent
    subsets (groups)
  • Tiles of the same group can be concurrently
    executed by multiple threads
  • Barrier synchronization between threads

15
Hyperplane Scheduling
CPU1
2 MPI nodes
NODE1
CPU0
x 2 OpenMP threads
CPU1
NODE0
CPU0
16
Hyperplane Scheduling
CPU1
2 MPI nodes
NODE1
CPU0
x 2 OpenMP threads
CPU1
NODE0
CPU0
17
Hyperplane Scheduling
pragma omp parallel group0 nod0
groupn-2 nodn-2 tile0 nod0 m0
th0 tilen-2 nodn-2 mn-2 thn-2
FOR(groupn-1) tilen-1 groupn-1 -
if(0 lt tilen-1 lt )
compute(tile) pragma omp
barrier
18
Overview
  • Introduction
  • Pure MPI Model
  • Hybrid MPI-OpenMP Models
  • Hyperplane Scheduling
  • Fine-grain Model
  • Coarse-grain Model
  • Experimental Results
  • Conclusions Future Work

19
Fine-grain Model
  • Incremental parallelization of computationally
    intensive parts
  • Relatively straightforward from pure MPI
  • Threads (re)spawned at computation
  • Inter-node communication outside of
    multi-threaded part
  • Thread synchronization through implicit barrier
    of omp parallel directive

20
Fine-grain Model
FOR(groupn-1) Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod)) pragma omp
parallel thread_idomp_get_thread_nu
m() if(valid(tile,thread_id,groupn-1))
Compute(tile) MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod)
21
Overview
  • Introduction
  • Pure MPI Model
  • Hybrid MPI-OpenMP Models
  • Hyperplane Scheduling
  • Fine-grain Model
  • Coarse-grain Model
  • Experimental Results
  • Conclusions Future Work

22
Coarse-grain Model
  • SPMD paradigm
  • Requires more programming effort
  • Threads are only spawned once
  • Inter-node communication inside multi-threaded
    part (requires MPI_THREAD_MULTIPLE)
  • Thread synchronization through explicit barrier
    (omp barrier directive)

23
Coarse-grain Model
pragma omp parallel thread_idomp_get_threa
d_num() FOR(groupn-1) pragma omp
master Pack(snd_buf, tilen-1 1,
nod) MPI_Isend(snd_buf, dest(nod))
MPI_Irecv(recv_buf, src(nod))
if(valid(tile,thread_id,groupn-1))
Compute(tile) pragma omp
master MPI_Waitall
Unpack(recv_buf, tilen-1 1, nod)
pragma omp barrier
24
Summary Fine-grain vs Coarse-grain
Fine-grain Coarse-grain
Threads re-spawning Threads are only spawned once
Inter-node MPI communication outside of multi-threaded region Inter-node MPI communication inside multi-threaded region, assumed by master thread
Intra-node synchronization through implicit barrier (omp parallel) Intra-node synchronization through explicit OpenMP barrier
25
Overview
  • Introduction
  • Pure MPI model
  • Hybrid MPI-OpenMP models
  • Hyperplane Scheduling
  • Fine-grain Model
  • Coarse-grain Model
  • Experimental Results
  • Conclusions Future Work

26
Experimental Results
  • 8-node SMP Linux Cluster (800 MHz PIII, 128 MB
    RAM, kernel 2.4.20)
  • MPICH v.1.2.5 (--with-devicech_p4,
    --with-commshared)
  • Intel C compiler 7.0 (-O3
  • -mcpupentiumpro -static)
  • FastEthernet interconnection
  • ADI micro-kernel benchmark (3D)

27
Alternating Direction Implicit (ADI)
  • Unitary data dependencies
  • 3D Iteration Space (X x Y x Z)

28
ADI 4 nodes
29
ADI 4 nodes
  • X lt Y
  • X gt Y

30
ADI X512 Y512 Z8192 4 nodes
31
ADI X128 Y512 Z8192 4 nodes
32
ADI X512 Y128 Z8192 4 nodes
33
ADI 2 nodes
34
ADI 2 nodes
  • X lt Y
  • X gt Y

35
ADI X128 Y512 Z8192 2 nodes
36
ADI X256 Y512 Z8192 2 nodes
37
ADI X512 Y512 Z8192 2 nodes
38
ADI X512 Y256 Z8192 2 nodes
39
ADI X512 Y128 Z8192 2 nodes
40
ADI X128 Y512 Z8192 2 nodes
Computation
Communication
41
ADI X512 Y128 Z8192 2 nodes
Computation
Communication
42
Overview
  • Introduction
  • Pure MPI model
  • Hybrid MPI-OpenMP models
  • Hyperplane Scheduling
  • Fine-grain Model
  • Coarse-grain Model
  • Experimental Results
  • Conclusions Future Work

43
Conclusions
  • Nested loop algorithms with arbitrary data
    dependencies can be adapted to the hybrid
    parallel programming paradigm
  • Hybrid models can be competitive to the pure MPI
    paradigm
  • Coarse-grain hybrid model can be more efficient
    than fine-grain one, but also more complicated
  • Programming efficiently in OpenMP not easier
    than programming efficiently in MPI

44
Future Work
  • Application of methodology to real applications
    and benchmarks
  • Work balancing for coarse-grain model
  • Performance evaluation on advanced
    interconnection networks (SCI, Myrinet)
  • Generalization as compiler technique

45
Questions?
http//www.cslab.ece.ntua.gr/ndros
Write a Comment
User Comments (0)
About PowerShow.com