Adaptive Twolevel Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines - PowerPoint PPT Presentation

About This Presentation

Title:

Adaptive Twolevel Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines

Description:

Faster context switch and synchronization among user-level threads ... controls the number of active kernel threads close to the number of allocated processors. ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 24

Provided by: kais4

Learn more at: https://www.cs.rochester.edu

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive Twolevel Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines

1
Adaptive Two-level Thread Management for MPI
Execution on Multiprogrammed Shared Memory
Machines

Kai Shen, Hong Tang, and Tao Yang
http//www.cs.ucsb.edu/research/tmpi
Department of Computer Science
University of California, Santa Barbara

2
MPI-Based Parallel Computation on Shared Memory
Machines

Shared Memory Machines (SMMs) or SMM Clusters
become popular for high end computing.
MPI is a portable high performance parallel
programming model.
? MPI on SMMs
Threads are easy to program. But people still use
MPI on SMMs
Better portability for running on other platforms
(e.g. SMM clusters)
Good data locality due to data partitioning.

3
Scheduling for Parallel Jobs in Multiprogrammed
SMMs

Gang-scheduling
Good for parallel programs which synchronize
frequently
Low resource utilization (Processor-fragmentation
not enough parallelism).
Space/time Sharing
Time sharing on dynamically partitioned machines
Short response time and high throughput.
Impact on MPI program execution
Not all MPI nodes are scheduled simultaneously
The number of available processors for each
application may change dynamically.
Optimization is needed for fast MPI execution on
SMMs.

4
Techniques Studied

Thread-Based MPI execution PPoPP99
Compile-time transformation for thread-safe MPI
execution
Fast context switch and synchronization
Fast communication through address sharing
Two-level thread management for multiprogrammed
environments
Even faster context switch/synchronization
Use scheduling information to guide
synchronization
Our prototype system TMPI

5
Related Work

MPI-related Work
MPICH, a portable MPI implementation
Gropp/Lusk/et al..
SGI MPI, highly optimized on SGI platforms.
MPI-2, multithreading within a single MPI node.
Scheduling and Synchronization
Process Control Tucker/Gupta and Scheduler
Activation Anderson et al. Focus on OS
research.
Scheduler-conscious Synchronization
Kontothanssis et al. Focus on primitives such
as barriers and locks.
Hood/Cilk threads Arora et al. and Loop-level
Scheduling Yue/Lilja. Focus on fine-grain
parallelism.

6
Outline

Motivations Related Work
Adaptive Two-level Thread Management
Scheduler-conscious Event Waiting
Experimental Studies

7
Context Switch/Synchronization in Multiprogrammed
Environments

In multiprogrammed environments, more
synchronization will lead to context switch
? context switch/synchronization has large
performance impact in multiprogrammed
environments
Conventional MPI implementation maps each MPI
node to an OS process.
Our earlier work maps each MPI node to a kernel
thread.
Two-level Thread Management maps each MPI node
to a user-level thread.
Faster context switch and synchronization among
user-level threads
Very few kernel-level context switches

8
System Architecture
...
MPI application
MPI application
...
TMPI Runtime
TMPI Runtime
...
User-level threads
User-level threads
System-wide resource management

Targeted at multiprogrammed environments
Two-level thread management

9
Adaptive Two-level Thread Management

System-wide resource manager (OS kernel or
User-level central monitor)
collects information about active MPI
applications
partitions processors among them.
Application-wide user-level thread management
maps each MPI node into a user-level thread
schedules user-level threads on a pool of kernel
threads
controls the number of active kernel threads
close to the number of allocated processors.
Big picture (in the whole system)
? Active kernel threads Processors
? Minimize kernel-level context switch

10
User-level Thread Scheduling

Every kernel thread can be
active executing an MPI node (user-level
thread)
suspended.
Execution invariant for each application
active kernel threads allocated processors
(minimize kernel-level context switch)
kernel threads MPI nodes
(avoid dynamic thread creation)
Every active kernel thread polls system resource
manager, which leads to
Deactivation suspending itself
Activation waking up some suspended kernel
threads
No-action
When to poll?

11
Polling in User-Level Context Switch

Context switch is a result of synchronization
(e.g. an MPI node waits for a message).
Underlying kernel thread polls system resource
manager during context switch
Two stack switches if deactivation
? suspend on a dummy stack
One stack switch otherwise
After optimization, 2?s in average on SGI Power
Challenge

12
Outline

Motivations Related Work
Adaptive Two-level Thread Management
Scheduler-conscious Event Waiting
Experimental Studies

13
Event Waiting Synchronization

All MPI synchronization is based on waitEvent

waiter
caller
waitEvent(pflag value)

Waiting could be
spinning
yielding/blocking

waiting
pflag value
wakeup
14
Tradeoff between spin and block

Basic rules for waiting using spin-then-block
Spinning wastes CPU cycles.
Blocking introduces context switch overhead
always-blocking is not good for dedicated
environments.
Previous work focuses on choosing the best spin
time.
Our optimization focus and findings
Fast context switch has substantial performance
impact
Use scheduling information to guide spin/block
decision
Spinning is futile when the caller is not
currently scheduled
Most blocking cost comes from cache flushing
penalty. (actual cost varies, up to several ms)

15
Scheduler-conscious Event Waiting

User-level scheduler provides
scheduling info
affinity info

16
Experimental Settings

Machines
SGI Origin 2000 system with 32 195MHz MIPS
R10000s with 2GB memory
SGI Power Challenge with 4 200MHz MPIS R4400s
with 256MB memory
Compare among
TMPI-2 TMPI with two-level thread management
SGI MPI SGIs native MPI implementation
TMPI original TMPI without two-level thread
management

17
Testing Benchmarks

Sync frequency is obtained by running each
benchmark with 4 MPI nodes on 4-processor Power
Challenge.
The higher the multiprogramming degree is, the
more synchronization will lead to context switch.
Sparse LU benchmarks have much more frequent
synchronization than others.

18
Performance evaluation on a Multiprogrammed
Workload

Workload contains a sequence of six jobs
launched with a fixed interval.
Compare job turnaround time in Power Challenge.

19
Workload with Certain Multiprogramming Degrees

Goal identify the performance impact of
multiprogramming degrees.
Experimental setting
Each workload has one benchmark program.
Run n MPI nodes on p processors (np).
Multiprogramming degree is n/p.
Compare megaflop rates or speedups of the kernel
part of each application.

20
Performance Impact of Multiprogramming Degree
(SGI Power Challenge)
21
Performance Impact of Multiprogramming Degree
(SGI Origin 2000)
Performance ratios of TMPI-2 over TMPI
Performance ratios of TMPI-2 over SGI MPI
22
Benefits of Scheduler-conscious Event Waiting
Improvement over simple spin-block on Power
Challenge
Improvement over simple spin-block on Origin 2000
23
Conclusions

Contributions for optimizing MPI execution
Adaptive two-level thread management
Scheduler-conscious event waiting
Great performance improvement up to an order of
magnitude, depending on applications and load
In multiprogrammed environments, fast context
switch/synchronization is important even for
communication-infrequent MPI programs.
Current and future work
Support threaded MPI on SMP-clusters

http//www.cs.ucsb.edu/research/tmpi

Write a Comment

User Comments (0)