Title: Adaptive Twolevel Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines
1Adaptive Two-level Thread Management for MPI
Execution on Multiprogrammed Shared Memory
Machines
- Kai Shen, Hong Tang, and Tao Yang
- http//www.cs.ucsb.edu/research/tmpi
- Department of Computer Science
- University of California, Santa Barbara
2MPI-Based Parallel Computation on Shared Memory
Machines
- Shared Memory Machines (SMMs) or SMM Clusters
become popular for high end computing. - MPI is a portable high performance parallel
programming model. - ? MPI on SMMs
- Threads are easy to program. But people still use
MPI on SMMs - Better portability for running on other platforms
(e.g. SMM clusters) - Good data locality due to data partitioning.
3Scheduling for Parallel Jobs in Multiprogrammed
SMMs
- Gang-scheduling
- Good for parallel programs which synchronize
frequently - Low resource utilization (Processor-fragmentation
not enough parallelism). - Space/time Sharing
- Time sharing on dynamically partitioned machines
- Short response time and high throughput.
- Impact on MPI program execution
- Not all MPI nodes are scheduled simultaneously
- The number of available processors for each
application may change dynamically. - Optimization is needed for fast MPI execution on
SMMs.
4Techniques Studied
- Thread-Based MPI execution PPoPP99
- Compile-time transformation for thread-safe MPI
execution - Fast context switch and synchronization
- Fast communication through address sharing
- Two-level thread management for multiprogrammed
environments - Even faster context switch/synchronization
- Use scheduling information to guide
synchronization - Our prototype system TMPI
5Related Work
- MPI-related Work
- MPICH, a portable MPI implementation
Gropp/Lusk/et al.. - SGI MPI, highly optimized on SGI platforms.
- MPI-2, multithreading within a single MPI node.
- Scheduling and Synchronization
- Process Control Tucker/Gupta and Scheduler
Activation Anderson et al. Focus on OS
research. - Scheduler-conscious Synchronization
Kontothanssis et al. Focus on primitives such
as barriers and locks. - Hood/Cilk threads Arora et al. and Loop-level
Scheduling Yue/Lilja. Focus on fine-grain
parallelism.
6Outline
- Motivations Related Work
- Adaptive Two-level Thread Management
- Scheduler-conscious Event Waiting
- Experimental Studies
7Context Switch/Synchronization in Multiprogrammed
Environments
- In multiprogrammed environments, more
synchronization will lead to context switch - ? context switch/synchronization has large
performance impact in multiprogrammed
environments - Conventional MPI implementation maps each MPI
node to an OS process. - Our earlier work maps each MPI node to a kernel
thread. - Two-level Thread Management maps each MPI node
to a user-level thread. - Faster context switch and synchronization among
user-level threads - Very few kernel-level context switches
8System Architecture
...
MPI application
MPI application
...
TMPI Runtime
TMPI Runtime
...
User-level threads
User-level threads
System-wide resource management
- Targeted at multiprogrammed environments
- Two-level thread management
9Adaptive Two-level Thread Management
- System-wide resource manager (OS kernel or
User-level central monitor) - collects information about active MPI
applications - partitions processors among them.
- Application-wide user-level thread management
- maps each MPI node into a user-level thread
- schedules user-level threads on a pool of kernel
threads - controls the number of active kernel threads
close to the number of allocated processors. - Big picture (in the whole system)
- ? Active kernel threads Processors
- ? Minimize kernel-level context switch
10User-level Thread Scheduling
- Every kernel thread can be
- active executing an MPI node (user-level
thread) - suspended.
- Execution invariant for each application
- active kernel threads allocated processors
- (minimize kernel-level context switch)
- kernel threads MPI nodes
- (avoid dynamic thread creation)
- Every active kernel thread polls system resource
manager, which leads to - Deactivation suspending itself
- Activation waking up some suspended kernel
threads - No-action
- When to poll?
11Polling in User-Level Context Switch
- Context switch is a result of synchronization
(e.g. an MPI node waits for a message). - Underlying kernel thread polls system resource
manager during context switch - Two stack switches if deactivation
- ? suspend on a dummy stack
- One stack switch otherwise
- After optimization, 2?s in average on SGI Power
Challenge
12Outline
- Motivations Related Work
- Adaptive Two-level Thread Management
- Scheduler-conscious Event Waiting
- Experimental Studies
13Event Waiting Synchronization
- All MPI synchronization is based on waitEvent
waiter
caller
waitEvent(pflag value)
- Waiting could be
- spinning
- yielding/blocking
waiting
pflag value
wakeup
14Tradeoff between spin and block
- Basic rules for waiting using spin-then-block
- Spinning wastes CPU cycles.
- Blocking introduces context switch overhead
always-blocking is not good for dedicated
environments. - Previous work focuses on choosing the best spin
time. - Our optimization focus and findings
- Fast context switch has substantial performance
impact - Use scheduling information to guide spin/block
decision - Spinning is futile when the caller is not
currently scheduled - Most blocking cost comes from cache flushing
penalty. (actual cost varies, up to several ms)
15Scheduler-conscious Event Waiting
- User-level scheduler provides
- scheduling info
- affinity info
16Experimental Settings
- Machines
- SGI Origin 2000 system with 32 195MHz MIPS
R10000s with 2GB memory - SGI Power Challenge with 4 200MHz MPIS R4400s
with 256MB memory - Compare among
- TMPI-2 TMPI with two-level thread management
- SGI MPI SGIs native MPI implementation
- TMPI original TMPI without two-level thread
management
17Testing Benchmarks
- Sync frequency is obtained by running each
benchmark with 4 MPI nodes on 4-processor Power
Challenge. - The higher the multiprogramming degree is, the
more synchronization will lead to context switch.
- Sparse LU benchmarks have much more frequent
synchronization than others.
18Performance evaluation on a Multiprogrammed
Workload
- Workload contains a sequence of six jobs
launched with a fixed interval. - Compare job turnaround time in Power Challenge.
19Workload with Certain Multiprogramming Degrees
- Goal identify the performance impact of
multiprogramming degrees. - Experimental setting
- Each workload has one benchmark program.
- Run n MPI nodes on p processors (np).
- Multiprogramming degree is n/p.
- Compare megaflop rates or speedups of the kernel
part of each application.
20Performance Impact of Multiprogramming Degree
(SGI Power Challenge)
21Performance Impact of Multiprogramming Degree
(SGI Origin 2000)
Performance ratios of TMPI-2 over TMPI
Performance ratios of TMPI-2 over SGI MPI
22Benefits of Scheduler-conscious Event Waiting
Improvement over simple spin-block on Power
Challenge
Improvement over simple spin-block on Origin 2000
23Conclusions
- Contributions for optimizing MPI execution
- Adaptive two-level thread management
Scheduler-conscious event waiting - Great performance improvement up to an order of
magnitude, depending on applications and load - In multiprogrammed environments, fast context
switch/synchronization is important even for
communication-infrequent MPI programs. - Current and future work
- Support threaded MPI on SMP-clusters
http//www.cs.ucsb.edu/research/tmpi