Windows-NT based Distributed Virtual Parallel Machine - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Windows-NT based Distributed Virtual Parallel Machine

Description:

... of PCs ( SMPs) Multi-threaded. Shared memory. User-mode ... Using a non-dedicated cluster. Dynamically identify idle machines. Move work to idle machines ... – PowerPoint PPT presentation

Number of Views:236
Avg rating:3.0/5.0
Slides: 33
Provided by: ayalitz
Category:

less

Transcript and Presenter's Notes

Title: Windows-NT based Distributed Virtual Parallel Machine


1
Windows-NT based Distributed Virtual Parallel
Machine
The MILLIPEDE ProjectTechnion, Israel
http//www.cs.technion.ac.il/Labs/Millipede
2
What is Millipede ?
A strong Virtual Parallel Machine employ
non-dedicated distributed environments
Programs
Implementation of Parallel Programming Langs
Distributed Environment
3
Programming Paradigms
SPLASH
Cilk/Calipso
CC
Other
Java
CParPar
ParC
ParFortran90
Bare Millipede
Events Mechanism (MJEC) Migration Services (MGS)
Millipede Layer
Distributed Shared Memory (DSM)
Communication Packages U-Net, Transis, Horus,
Operating System Services Communication,
Threads, Page Protection, I/O
Software Packages User-mode threads
4
So, whats in a VPM? Check list Using
non-dedicated cluster of PCs (
SMPs) Multi-threaded Shared memory User-mode Stron
g support for weak memory Dynamic page- and
job-migration Load sharing for maximal locality
of reference Convergence to optimal level of
parallelism
Millipede inside
Millipede inside
5
Using a non-dedicated cluster
  • Dynamically identify idle machines
  • Move work to idle machines
  • Evacuate busy machines
  • Do everything transparently
  • to native user
  • Co-existence of
  • several parallel
  • applications

6
Multi-Threaded Environments
  • Well known
  • Better utilization of resources
  • An intuitive and high level of abstraction
  • Latency hiding by comp. and comm. overlap
  • Natural for parallel programing paradigms
    environments
  • Programmer defined max-level of parallelism
  • Actual level of parallelism set dynamically.
    Applications scale up and down
  • Nested parallelism
  • SMPs

7
Convergence to Optimal Speedup
  • The Tradeoff Higher level of
    parallelism VS.
    Better locality of memory reference
  • Optimal speedup - not necessarily with the
    maximal number of computers
  • Achieved level of parallelism - depends on the
    program needs and onthe capabilities of the
    system

8
No/Explicit/Implicit Access Shared Memory
PVM
C-Linda
/ Receive data from master /
/ Retrieve data from DSM /
msgtype 0
rd
(
init data, ?
nproc, ?n, ?data)
pvm_recv
(-1,
msgtype)
pvm_upkint
(
nproc, 1, 1)
pvm_upkint
(
tids,
nproc, 1)
pvm_upkint
(n, 1, 1)
pvm_upkfloat
(data, n, 1)
/ Worker id is given at creation
/ Determine which slave I am
no need to compute it now /
(0..nproc-1)/
for(i0
ilt
nproc i)
if(
mytidtidsi)
mei break
/ do calculation. put result in DSM/
/ Do calculations with
data/
out
(result, id,
work(id, n, data,
nproc)
)
resultwork(me, n, data,
tids,
nproc)
/ send result to master /
pvm_initsend
(
PvmDataDefault)
Bare Millipede
pvm_pkint
(me, 1, 1)
pvm_pkfloat
(result, 1, 1)
msg_type 5
result
work(
milGetMyId(),
master
pvm_paremt
()
n, data,
pvm_send
(master,
msgtype)

milGetTotalIds())

/ Exit PVM before stopping /
pvm_exit
()
9
Relaxed Consistency(Avoiding false sharing and
ping pong)
  • Sequential, CRUW,
  • Sync(var), Arbitrary-CW Sync
  • Multiple relaxations for different shared
    variables within the same program
  • No broadcast, no central address servers
  • (so can work efficiently interconnected LANs)
  • New protocols welcome (user defined?!)
  • Step-by-step optimization towards maximal
    parallelism

page
copies
10
  • LU Decomposition 1024x1024 matrix written in
    SPLASH - Advantages gained when reducing
    consistency of a single variable (the Global
    structure)

11

MJEC - Millipede Job Event Control
An open mechanism with which various synchronizati
on methods can be implemented
  • A job has a unique systemwide id
  • Jobs communicate and synchronize by sending
    events
  • Although a job is mobile, its events follow and
    reach its events queue wherever it goes
  • Event handlers are context-sensitive

12
MJEC (cont)
  • Modes
  • In Execution-Mode arriving events are enqueued
  • In Dispatching-Mode events are dequeued and
    handled by a user-supplied dispatching routine

13
MJEC Interface
Execution Mode
milEnterDispatchingMode(func, context)
ret func(INIT, context)
No
Yes
retEXIT?
event pending?
ret func(event, context)
Yes
ret func(EXIT, context)
Wait for event
  • Registration and Entering Dispatch Mode
  • milEnterDispatchingMode((FUNC)foo, void
    context)
  • Post Event
  • milPostEvent(id target, int event, int data)
  • Dispatcher Routine Syntax
  • int foo(id origin, int event, int data, void
    context)

14
Experience with MJEC
  • ParC 250 lines SPLASH 120 lines
  • Easy implementation of many synchronization
    methods semaphores, locks, condition variables,
    barriers
  • Implementation of location-dependent services
    (e.g., graphical display)

15
Example - Barriers with MJEC
  • Barrier()
  • milPostEvent(BARSERV, ARR, 0)
  • milEnterDispatchingMode(wait_in_barrier, 0)
  • wait_in_barrier(src, event, context)
  • if (event DEP)
  • return EXIT_DISPATCHER
  • else
  • return STAY_IN_DISPATCHER

Barrier Server
Dispatcher ...
EVENT ARR
Job
Job
BARRIER(...)

Dispatcher
16
Example - Barriers with MJEC (cont)
BarrierServer() milEnterDispatchingMode(barrier
_server, info) barrier_server(src, event,
context) if (event ARR)
enqueue(context.queue, src) if
(should_release(context))
while(context.cntgt0) milPostEvent(context.
dequeue, DEP) return
STAY_IN_DISPATCHER
Barrier Server
Dispatcher ...
EVENT DEP
EVENT DEP
Job
Job
BARRIER(...)
BARRIER(...)
Dispatcher
Dispatcher
17
Dynamic Page- and Job-Migration
  • Migration may occur in case of
  • Remote memory access
  • Load imbalance
  • User comes back from lunch
  • Improving locality by location rearrangement
  • Sometimes migration should be disabled
  • by system ping-pong, critical section
  • by programmer control system

18
Locality of memory reference is THE dominant
efficiency factorMigration Can Help Locality
Only Job Migration
Only Page Migration
Page Job Migration
19
Load Sharing Max. Locality Minimum-Weight
multiway cut
p
p
q
q
r
r
20
Problems with themultiway cut model
  • NP-hard for cutsgt2. We have ngtX,000,000.
    Polynomial 2-approximations known
  • Not optimized for load balancing
  • Page replica
  • Graph changes dynamically
  • Only external accesses are
  • recorded gt only partial
  • information is available

21
Our Approach
Access
page 1 page 2 page 1 page 0
  • Record the history of remote accesses
  • Use this information when taking decisions
    concerning load balancing/load sharing
  • Save old information to avoid repeating bad
    decisions (learn from mistakes)
  • Detect and solve ping-pong situations
  • Do everything by piggybacking on communication
    that is taking place anyway

22
Ping Pong
  • Detection (local)
  • 1. Local threads attempt to use the page short
    time after it leaves the local host
  • 2. The page leaves the host shortly after arrival
  • Treatment (by ping-pong server)
  • Collect information regarding all participating
    hosts and threads
  • Try to locate an underloaded target host
  • Stabilize the system by locking-in pages/threads

23
Optimization

TSP
-
Effect of Locality
15 cities, Bare Millipede






sec

4000

NO-FS
3500

OPTIMIZED-FS

3000

FS
2500

2000

1500

1000

500

0

1
2
3
4
5
6
hosts





In the NO
-
FS case false sharing is avoided by aligning all
allocations to

page size. In the other two cases each page is
used by
2
threads in FS no

optimizations are used, and in OPTIMIZED
-
FS the history mechanism is

enabled.

24
TSP on 6 hosts k number of threads falsely
sharing a page




k

optimized?

DSM
-

ping
-
pong

Number of

execution

related

treatment msgs

thread

time (sec)

messages

migrations

2

Yes


5100


290


68


645

2

No

176120


0


23

1020


3

Yes


4080


279


87


620

3

No

160460


0


32

1514


4

Yes


5060


343


99


690

4

No

155540


0


44

1515


5

Yes


6160


443


139


700

5

No

162505


0


55

1442

25
Ping Pong Detection Sensitivity

26
Applications
  • Numerical computations Multigrid
  • Model checking BDDs
  • Compute-intensive graphics Ray-Tracing,
    Radiosity
  • Games, Search trees, Pruning, Tracking, CFD ...

27
Performance Evaluation
L - underloaded H - overloaded Delta(ms) - lock
in time t/o delta - polling (MGS,DSM) msg delta
- system pages delta T_epoch - max history
time ??? - remove old histories - refresh
old histories
L_epoch - histories length page
histories vs. job
histories migration heuristic -
which func? ping-pong - - what is initial
noise? - what freq. is PP?
28
  • LU Decomposition 1024x1024 matrix written in
    SPLASH
  • Performance improvements when there are few
    threads on each host

29
LU Decomposition 2048x2048 matrix written in
SPLASH -Super-Linear speedups due to the caching
effect.
30
Jacobi Relaxation 512x512 matrix (using 2
matrices, no false sharing) written in ParC
31
Overhead of ParC/Millipede on a single host.
Testing with Tracking algorithm
32
Info...
http//www.cs.technion.ac.il/Labs/Millipede
millipede_at_cs.technion.ac.il
Release available at the Millipede site !
Write a Comment
User Comments (0)
About PowerShow.com