Dynamic Multi Phase Scheduling for Heterogeneous Clusters - PowerPoint PPT Presentation

Loading...

PPT – Dynamic Multi Phase Scheduling for Heterogeneous Clusters PowerPoint presentation | free to download - id: cf875-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Dynamic Multi Phase Scheduling for Heterogeneous Clusters

Description:

CSS and TSS are devised for homogeneous systems ... CSS and TSS give the same chunk sizes both in dedicated and non-dedicated systems, respectively ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 30
Provided by: flor87
Learn more at: http://www.cslab.ece.ntua.gr
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Dynamic Multi Phase Scheduling for Heterogeneous Clusters


1
Dynamic Multi Phase Scheduling for Heterogeneous
Clusters
20th International Parallel and Distributed
Processing Symposium 25-29 April 2006
  • Florina M. Ciorba, Theodore Andronikos, Ioannis
    Riakiotakis,
  • Anthony T. Chronopoulos and George
    Papakonstantinou

National Technical University of
Athens Computing Systems Laboratory University
of Texas at San Antonio
cflorina_at_cslab.ece.ntua.gr www.cslab.ece.ntua.gr
2
Outline
  • Introduction
  • Notation
  • Some existing self-scheduling algorithms
  • Dynamic self-scheduling for dependence loops
  • Implementation and test results
  • Conclusions
  • Future work

3
Introduction
  • Motivation for dynamically scheduling
  • loops with dependencies
  • Existing dynamic algorithms can not cope with
    dependencies, because they lack inter-slave
    communication
  • Static algorithms are not always efficient
  • In their original form, if dynamic algorithms are
    applied to loops with dependencies, they yield a
    serial/invalid execution

4
Outline
  • Introduction
  • Notation
  • Some existing self-scheduling algorithms
  • Dynamic self-scheduling for dependence loops
  • Implementation and test results
  • Conclusions
  • Future work

5
Notation
  • Algorithmic model

FOR (i1l1 i1ltu1 i1) FOR (i2l2 i2ltu2
i2) FOR (inln inltun in)
Loop Body ENDFOR ENDFOR ENDFOR
  • Perfectly nested loops
  • Constant flow data dependencies
  • General program statements within the loop body
  • J index space of an n-dimensional uniform
    dependence loop

6
Notation
  • u1 synchronization dimension, un scheduling
    dimension
  • set of dependence vectors
  • PE processing element
  • P1,...,Pm slaves
  • N number of scheduling steps
  • Ci chunk size at the i-th scheduling step
  • Vi size (iteration-wise) of Ci along
    scheduling dimension un
  • VPk virtual computing power of slave Pk
  • Qk number of processes in the run-queue of
    slave Pk
  • available computing power of slave
    Pk
  • total available
    computing power of the cluster

7
Outline
  • Introduction
  • Notation
  • Some existing self-scheduling algorithms
  • Dynamic self-scheduling for dependence loops
  • Implementation and test results
  • Conclusions
  • Future work

8
Some existing self-scheduling algorithms
  • 3 self-scheduling algorithms
  • CSS Chunk Self-Scheduling,
  • Ci constant
  • TSS Trapezoid Self-Scheduling, Ci Ci-1 D,
    where D decrement, and the first chunk is F
    J/(2m) and the last chunk is L 1.
  • DTSS Distributed TSS, Ci Ci-1 D, where D
    decrement, and the first chunk is F J/(2A)
    and the last chunk is L 1.
  • CSS and TSS are devised for homogeneous systems
  • DTSS improves on TSS for heterogeneous systems by
    selecting the chunk sizes according to
  • the virtual computational power of the slaves, Vk
  • the number of processes in the run-queue of each
    PE, Qk

9
Some existing self-scheduling algorithms
  • J500010000
  • m 10 slaves
  • CSS and TSS give the same chunk sizes both in
    dedicated and non-dedicated systems, respectively
  • DTSS adjusts the chunk sizes to match the
    different Ak of slaves

Algorithm Chunk sizes
CSS 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 200
TSS 277 270 263 256 249 242 235 228 221 214 207 200 193 186 179 172 165 158 151 144 137 130 123 116 109 102 73
DTSS (dedicated) 392 253 368 237 344 221 108 211 103 300 192 276 176 176 252 160 77 149 72 207 130 183 114 159 98 46 87 41 44
DTSS (non-dedicated) 263 383 369 355 229 112 219 107 209 203 293 279 265 169 33 96 46 89 86 83 80 77 74 24 69 66 31 59 56 53 50 47 44 20 39 20 33 30 27 24 21 20 20 20 20 20 20 20 20 8
10
Outline
  • Introduction
  • Notation
  • Some existing self-scheduling algorithms
  • Dynamic self-scheduling for dependence loops
  • Implementation and test results
  • Conclusions
  • Future work

11
More notation
  • SP synchronization point
  • M number of SPs inserted along synchronization
    dimension u1
  • H interval (iteration-wise) between two SPs
    along u1
  • H is the same for every chunk
  • SCi,j the set of iterations of Ci between
    SPj-1 and SPj
  • Ci Vi M H
  • Current slave the slave assigned chunk Ci
  • Previous slave the slave assigned chunk Ci-1

12
Self-scheduling with synchronization
  • Chunks are formed along scheduling dimension,
    here say u2
  • SPs are inserted along synchronization dimension,
    u1
  • Phase 1 Apply self-scheduling algorithms to the
    scheduling dimension
  • Phase 2 Insert synchronization points along the
    synchronization dimension

13
The inter-slave communication scheme
  • Ci-1 is assigned to Pk-1, Ci assigned to Pk and
    Ci1 to Pk1
  • When Pk reaches SPj1, it sends to Pk1 only the
    data Pk1 requires (i.e., those iterations
    imposed by the existing dependence vectors)
  • Afterwards, Pk receives from Pk-1 the data
    required for the current computation
  • Slaves do not reach a SP at the same time, which
    leads to a wavefront execution fashion

14
Dynamic Multi-Phase Scheduling DMPS(x)
  • INPUT (a) An n-dimensional dependence nested
    loop.
  • (b) The choice of the algorithm CSS,
    TSS or DTSS.
  • (c) If CSS is chosen, then chunk
    size Ci.
  • (d) The synchronization interval H.
  • (e) The number of slaves m in
    case of DTSS, the virtual power Vk of
    every slave.
  • Master
  • Initialization (M.a) Register slaves. In case of
    DTSS, slaves report their Ak.
  • (M.b) Calculate F, L, N, D for
    TSS and DTSS. For CSS use the given Ci.
  • While there are unassigned iterations do
  • (M.1) If a request arrives, put it in the
    queue.
  • (M.2) Pick a request from the queue, and
    compute the next chunk size using CSS, TSS or
    DTSS.
  • (M.3) Update the current and previous slave
    ids.
  • (M.4) Send the id of the current slave to the
    previous one.

15
Dynamic Multi-Phase Scheduling DMPS(x)
Slave Pk Initialization (S.a) Register with
the master. In case of DTSS, report Ak.
(S.b) Compute M according to the given
H. (S.1) Send request to the master. (S.2)
Wait for reply if received chunk from master, go
to step 3, else go to OUTPUT. (S.3) While the
next SP is not reached, compute chunk i. (S.4)
If id of the send-to slave is known, go to step
5, else go to step 6. (S.5) Send computed data
to send-to slave (S.6) Receive data from the
receive-from slave and go to step 3.
OUTPUT Master If there are no more chunks to
be assigned, terminate. Slave Pk If no more
tasks come from master, terminate.
16
Dynamic Multi-Phase Scheduling DMPS(x)
  • Advantages of DMPS(x)
  • Can take as input any self-scheduling algorithm,
    without any modifications
  • Phase 2 is independent of Phase 1
  • Phase 1 deals with the heterogeneity load
    variation in the system
  • Phase 2 deals with minimizing the inter-slave
    communication cost
  • Suitable for any type of heterogeneous systems

17
Outline
  • Introduction
  • Notation
  • Some existing self-scheduling algorithms
  • Dynamic self-scheduling for dependence loops
  • Implementation and test results
  • Conclusions
  • Future work

18
Implementation and testing setup
  • The algorithms are implemented in C and C
  • MPI platform is used for master-slave and
    inter-slave communication
  • The heterogeneous system consists of 10 machines
  • 4 Intel Pentiums III, 1266 MHz with 1GB RAM
    (called zealots), assumed to have VPk 1.5 (one
    of them is the master)
  • 6 Intel Pentiums III, 500 MHz with 512MB RAM
    (called kids), assumed to have VPk 0.5.
  • Interconnection network is Fast Ethernet, at
    100Mbit/sec.
  • Dedicated system all machines are dedicated to
    running the program and no other loads are
    interposed during the execution.
  • Non-dedicated system at the beginning of
    programs execution, a resource expensive
    process is started on some of the slaves, halving
    their Ak.

19
Implementation and testing setup
  • System configuration zealot1 (master), zealot2,
    kid1, zealot3, kid2, zealot4, kid3, kid4, kid5,
    kid6.
  • Three series of experiments for both dedicated
    non-dedicated systems, for m 3,4,5,6,7,8,9
    slaves
  • DMPS(CSS)
  • DMPS(TSS)
  • DMPS(DTSS)
  • Two real-life applications heat equation,
    Floyd-Steinberg computation
  • Speedup Sp is computed with
  • where TPi serial execution time on
    slave Pi, 1 i m, and
  • TPAR parallel execution time (on m
    slaves)
  • In the plotting of Sp, VP is used instead of m on
    the x-axis.

20
Performance results Heat equation
Sync. interval H Dedicated system Series tested Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m
Sync. interval H Dedicated system Series tested 3 4 5 6 7 8 9
100 1) DMPS(CSS) 2.32 1.75 1.73 1.23 1.21 1.21 1.18
100 2) DMPS(TSS) 2.20 1.73 1.56 1.38 1.25 1.14 1.02
100 3) DMPS(DTSS) 1.42 1.14 1.00 0.95 0.91 0.85 0.78
150 1) DMPS(CSS) 2.31 1.74 1.71 1.21 1.22 1.21 1.18
150 2) DMPS(TSS) 2.18 1.72 1.54 1.38 1.25 1.14 1.02
150 3) DMPS(DTSS) 1.42 1.13 0.99 0.93 0.90 0.84 0.78
200 1) DMPS(CSS) 2.30 1.74 1.73 1.22 1.23 1.22 1.19
200 2) DMPS(TSS) 2.21 1.74 1.55 1.38 1.25 1.14 1.02
200 3) DMPS(DTSS) 1.42 1.13 0.99 0.94 0.90 0.83 0.78
21
Performance results Heat equation
Sync. interval H Non-dedicated system Series tested Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m
Sync. interval H Non-dedicated system Series tested 3 4 5 6 7 8 9
100 1) DMPS(CSS) 2.33 1.76 1.73 2.46 2.45 2.38 2.06
100 2) DMPS(TSS) 2.20 1.74 1.56 2.52 2.56 2.18 2.10
100 3) DMPS(DTSS) 1.95 1.45 1.30 1.31 1.33 1.38 1.25
150 1) DMPS(CSS) 2.33 1.74 1.72 2.46 2.49 2.43 2.05
150 2) DMPS(TSS) 2.19 1.72 1.54 2.42 2.23 2.31 2.06
150 3) DMPS(DTSS) 1.94 1.47 1.30 1.30 1.28 1.36 1.23
200 1) DMPS(CSS) 2.30 1.74 1.73 2.39 2.36 2.38 2.10
200 2) DMPS(TSS) 2.22 1.75 1.56 1.79 2.32 2.10 2.02
200 3) DMPS(DTSS) 1.96 1.44 1.29 1.29 1.27 1.32 1.21
22
Performance results Floyd-Steinberg
Sync. interval H Dedicated system Series tested Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m
Sync. interval H Dedicated system Series tested 3 4 5 6 7 8 9
50 1) DMPS(CSS) 27.79 22.14 16.78 16.69 16.53 11.38 11.36
50 2) DMPS(TSS) 25.32 19.77 17.30 15.41 13.80 12.43 11.40
50 3) DMPS(DTSS) 19.63 14.87 13.28 12.72 11.57 11.45 10.73
100 1) DMPS(CSS) 27.52 22.01 16.70 16.65 16.43 11.34 11.33
100 2) DMPS(TSS) 25.22 19.70 17.24 15.35 13.75 12.38 11.38
100 3) DMPS(DTSS) 19.63 14.80 13.21 12.66 11.52 11.34 10.64
150 1) DMPS(CSS) 27.58 22.03 16.75 16.70 16.44 11.43 11.43
150 2) DMPS(TSS) 25.22 19.70 17.22 15.34 13.75 12.39 11.38
150 3) DMPS(DTSS) 19.62 14.82 13.24 12.67 11.53 11.34 10.65
23
Performance results Floyd-Steinberg
Sync. interval H Non-dedicated system Series tested Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m Number of slaves m
Sync. interval H Non-dedicated system Series tested 3 4 5 6 7 8 9
50 1) DMPS(CSS) 27.72 22.13 16.76 23.81 22.32 22.47 22.44
50 2) DMPS(TSS) 25.18 19.72 17.24 22.34 24.14 22.26 20.95
50 3) DMPS(DTSS) 21.88 16.06 14.38 13.74 13.26 13.02 11.71
100 1) DMPS(CSS) 27.49 21.99 16.67 22.61 22.42 22.59 22.35
100 2) DMPS(TSS) 25.18 19.66 17.17 19.23 24.15 22.24 20.88
100 3) DMPS(DTSS) 21.85 15.96 14.32 13.65 13.11 12.80 11.58
150 1) DMPS(CSS) 27.57 22.01 16.74 22.49 22.48 22.32 22.46
150 2) DMPS(TSS) 25.17 19.65 17.20 26.20 24.14 22.02 20.82
150 3) DMPS(DTSS) 21.86 15.96 14.31 13.58 13.18 12.80 11.59
24
Interpretation of the results
  • Dedicated system
  • as expected, all algorithms perform better on a
    dedicated system, compared to a non-dedicated
    one.
  • DMPS(TSS) slightly outperforms DMPS(CSS) for
    parallel loops, because it provides better load
    balancing
  • DMPS(DTSS) outperforms both other algorithms
    because it explicitly accounts for systems
    heterogeneity
  • Non-dedicated system
  • DMPS(DTSS) stands out even more, since the other
    algorithms cannot handle extra load variations
  • The speedup for DMPS(DTSS) increases in all cases
  • H must be chosen so as to maintain the comm/comp
    ratio lt 1, for every test case
  • Even then, small variations of the value of H, do
    not significantly affect the overall performance.

25
Outline
  • Introduction
  • Notation
  • Some existing self-scheduling algorithms
  • Dynamic self-scheduling for dependence loops
  • Implementation and test results
  • Conclusions
  • Future work

26
Conclusions
  • Loops with dependencies can now be dynamically
    scheduled on heterogeneous dedicated
    non-dedicated systems
  • Distributed algorithms efficiently compensate for
    the systems heterogeneity for loops with
    dependencies, especially in non-dedicated systems

27
Outline
  • Introduction
  • Notation
  • Some existing self-scheduling algorithms
  • Dynamic self-scheduling for dependence loops
  • Implementation and test results
  • Conclusions
  • Future work

28
Future work
  • Establish a model for predicting the optimal
    synchronization interval H and minimize the
    communication
  • Extend all other self-scheduling algorithms, such
    that they can handle loops with dependencies and
    account for systems heterogeneity

29
Thank you
  • Questions?
About PowerShow.com