Title: Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines
1Adaptive Two-level Thread Management for MPI
Execution on Multiprogrammed Shared Memory
Machines
- Kai Shen, Hong Tang, and Tao Yang
- http//www.cs.ucsb.edu/research/tmpi
- Department of Computer Science
- University of California, Santa Barbara
2MPI-Based Parallel Computation on Shared Memory
Machines
- Shared Memory Machines (SMMs) or SMM Clusters
become popular for high end computing. - MPI is a portable high performance parallel
programming model. - ? MPI on SMMs
- Threads are easy to program. But MPI is still
used on SMMs - Better portability for running on other platforms
(e.g. SMM clusters) - Good data locality due to data partitioning.
3Scheduling for Parallel Jobs in Multiprogrammed
SMMs
- Gang-scheduling
- Good for parallel programs which synchronize
frequently - Affect resource utilization (Processor-fragmentati
on not enough parallelism to use allocated
resource). - Space/time Sharing
- Time sharing combined with dynamic partitioning
- High throughput. Popular in current OS (e.g.,
IRIX 6.5) - Impact on MPI program execution
- Not all MPI nodes are scheduled simultaneously
- The number of available processors for each
application may change dynamically. - Optimization is needed for fast MPI execution on
SMMs.
4Techniques Studied
- Thread-Based MPI execution PPoPP99
- Compile-time transformation for thread-safe MPI
execution - Fast context switch and synchronization
- Fast communication through address sharing
- Two-level thread management for multiprogrammed
environments - Even faster context switch/synchronization
- Use scheduling information to guide
synchronization - Our prototype system TMPI
5Impact of synchronization on coarse-grain
parallel programs
- Running a communication-infrequent MPI program
(SWEEP3D) on 8 SGI Origin 2000 processors with
multiprogramming degree 3. - Synchronization costs 43-84 of total time.
- Execution time breakdown for TMPI and SGI MPI
6Related Work
- MPI-related Work
- MPICH, a portable MPI implementation Gropp/Lusk
et al.. - SGI MPI, highly optimized on SGI platforms.
- MPI-2, multithreading within a single MPI node.
- Scheduling and Synchronization
- Process Control Tucker/Gupta and Scheduler
Activation Anderson et al. Focus on OS
research. - Scheduler-conscious Synchronization
Kontothanssis et al. Focus on primitives such
as barriers and locks. - Hood/Cilk threads Arora et al. and Loop-level
Scheduling Yue/Lilja. Focus on fine-grain
parallelism.
7Outline
- Motivations Related Work
- Adaptive Two-level Thread Management
- Scheduler-conscious Event Waiting
- Experimental Studies
8Context Switch/Synchronization in Multiprogrammed
Environments
- In multiprogrammed environments, synchronization
leads to more context switches ? large
performance impact. - Conventional MPI implementation maps each MPI
node to an OS process. - Our earlier work maps each MPI node to a kernel
thread. - Two-level Thread Management maps each MPI node
to a user-level thread. - Faster context switch and synchronization among
user-level threads - Very few kernel-level context switches
9System Architecture
...
MPI application
MPI application
...
TMPI Runtime
TMPI Runtime
...
User-level threads
User-level threads
System-wide resource management
- Targeted at multiprogrammed environments
- Two-level thread management
10Adaptive Two-level Thread Management
- System-wide resource manager (OS kernel or
User-level central monitor) - collects information about active MPI
applications - partitions processors among them.
- Application-wide user-level thread management
- maps each MPI node into a user-level thread
- schedules user-level threads on a pool of kernel
threads - controls the number of active kernel threads
close to the number of allocated processors. - Big picture (in the whole system)
- ? Active kernel threads Processors
- ? Minimize kernel-level context switch
11User-level Thread Scheduling
- Every kernel thread can be
- active executing an MPI node (user-level
thread) - suspended.
- Execution invariant for each application
- active kernel threads allocated processors
- (minimize kernel-level context switch)
- kernel threads MPI nodes
- (avoid dynamic thread creation)
- Every active kernel thread polls system resource
manager, which leads to - Deactivation suspending itself
- Activation waking up some suspended kernel
threads - No-action
- When to poll?
12Polling in User-Level Context Switch
- Context switch is a result of synchronization
(e.g. an MPI node waits for a message). - Underlying kernel thread polls system resource
manager during context switch - Two stack switches if deactivation
- ? suspend on a dummy stack
- One stack switch otherwise
- After optimization, 2?s in average on SGI Power
Challenge
13Outline
- Motivations Related Work
- Adaptive Two-level Thread Management
- Scheduler-conscious Event Waiting
- Experimental Studies
14Event Waiting Synchronization
- All MPI synchronization is based on waitEvent
waiter
caller
waitEvent(pflag value)
- Waiting could be
- spinning
- yielding/blocking
waiting
pflag value
wakeup
15Tradeoff between spin and block
- Basic rules for waiting using spin-then-block
- Spinning wastes CPU cycles.
- Blocking introduces context switch overhead
always-blocking is not good for dedicated
environments. - Previous work focuses on choosing the best spin
time. - Our optimization focus and findings
- Fast context switch has substantial performance
impact - Use scheduling information to guide spin/block
decision - Spinning is futile when the caller is not
currently scheduled - Most blocking cost comes from cache flushing
penalty. (actual cost varies, up to several ms)
16Scheduler-conscious Event Waiting
- User-level scheduler provides
- scheduling info
- affinity info
17Experimental Settings
- Machines
- SGI Origin 2000 system with 32 195MHz MIPS
R10000s with 2GB memory - SGI Power Challenge with 4 200MHz MPIS R4400s
with 256MB memory - Compare among
- TMPI-2 TMPI with two-level thread management
- SGI MPI SGIs native MPI implementation
- TMPI original TMPI without two-level thread
management
18Testing Benchmarks
- Sync frequency is obtained by running each
benchmark with 4 MPI nodes on 4-processor Power
Challenge. - The higher the multiprogramming degree, the more
spin-blocks (context switch) during each
synchronization - Sparse LU benchmarks have much more frequent
synchronization than others.
19Performance evaluation on a Multiprogrammed
Workload
- Workload contains a sequence of six jobs
launched with a fixed interval. - Compare job turnaround time in Power Challenge.
20Workload with Certain Multiprogramming Degrees
- Goal identify the performance impact of
multiprogramming degrees. - Experimental setting
- Each workload has one benchmark program.
- Run n MPI nodes on p processors (np).
- Multiprogramming degree is n/p.
- Compare megaflop rates or speedups of the kernel
part of each application.
21Performance Impact of Multiprogramming Degree
(SGI Power Challenge)
22Performance Impact of Multiprogramming Degree
(SGI Origin 2000)
Performance ratios of TMPI-2 over TMPI
Performance ratios of TMPI-2 over SGI MPI
23Benefits of Scheduler-conscious Event Waiting
Improvement over simple spin-block on Power
Challenge
Improvement over simple spin-block on Origin 2000
24Conclusions
- Contributions for optimizing MPI execution
- Adaptive two-level thread management
Scheduler-conscious event waiting - Great performance improvement up to an order of
magnitude, depending on applications and load - In multiprogrammed environments, fast context
switch/synchronization is important even for
communication-infrequent MPI programs. - Current and future work
- Support threaded MPI on SMP-clusters
http//www.cs.ucsb.edu/research/tmpi