Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines - PowerPoint PPT Presentation

About This Presentation
Title:

Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines

Description:

Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang http://www.cs.ucsb.edu/research/tmpi – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 25
Provided by: KaiS78
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines


1
Adaptive Two-level Thread Management for MPI
Execution on Multiprogrammed Shared Memory
Machines
  • Kai Shen, Hong Tang, and Tao Yang
  • http//www.cs.ucsb.edu/research/tmpi
  • Department of Computer Science
  • University of California, Santa Barbara

2
MPI-Based Parallel Computation on Shared Memory
Machines
  • Shared Memory Machines (SMMs) or SMM Clusters
    become popular for high end computing.
  • MPI is a portable high performance parallel
    programming model.
  • ? MPI on SMMs
  • Threads are easy to program. But MPI is still
    used on SMMs
  • Better portability for running on other platforms
    (e.g. SMM clusters)
  • Good data locality due to data partitioning.

3
Scheduling for Parallel Jobs in Multiprogrammed
SMMs
  • Gang-scheduling
  • Good for parallel programs which synchronize
    frequently
  • Affect resource utilization (Processor-fragmentati
    on not enough parallelism to use allocated
    resource).
  • Space/time Sharing
  • Time sharing combined with dynamic partitioning
  • High throughput. Popular in current OS (e.g.,
    IRIX 6.5)
  • Impact on MPI program execution
  • Not all MPI nodes are scheduled simultaneously
  • The number of available processors for each
    application may change dynamically.
  • Optimization is needed for fast MPI execution on
    SMMs.

4
Techniques Studied
  • Thread-Based MPI execution PPoPP99
  • Compile-time transformation for thread-safe MPI
    execution
  • Fast context switch and synchronization
  • Fast communication through address sharing
  • Two-level thread management for multiprogrammed
    environments
  • Even faster context switch/synchronization
  • Use scheduling information to guide
    synchronization
  • Our prototype system TMPI

5
Impact of synchronization on coarse-grain
parallel programs
  • Running a communication-infrequent MPI program
    (SWEEP3D) on 8 SGI Origin 2000 processors with
    multiprogramming degree 3.
  • Synchronization costs 43-84 of total time.
  • Execution time breakdown for TMPI and SGI MPI

6
Related Work
  • MPI-related Work
  • MPICH, a portable MPI implementation Gropp/Lusk
    et al..
  • SGI MPI, highly optimized on SGI platforms.
  • MPI-2, multithreading within a single MPI node.
  • Scheduling and Synchronization
  • Process Control Tucker/Gupta and Scheduler
    Activation Anderson et al. Focus on OS
    research.
  • Scheduler-conscious Synchronization
    Kontothanssis et al. Focus on primitives such
    as barriers and locks.
  • Hood/Cilk threads Arora et al. and Loop-level
    Scheduling Yue/Lilja. Focus on fine-grain
    parallelism.

7
Outline
  • Motivations Related Work
  • Adaptive Two-level Thread Management
  • Scheduler-conscious Event Waiting
  • Experimental Studies

8
Context Switch/Synchronization in Multiprogrammed
Environments
  • In multiprogrammed environments, synchronization
    leads to more context switches ? large
    performance impact.
  • Conventional MPI implementation maps each MPI
    node to an OS process.
  • Our earlier work maps each MPI node to a kernel
    thread.
  • Two-level Thread Management maps each MPI node
    to a user-level thread.
  • Faster context switch and synchronization among
    user-level threads
  • Very few kernel-level context switches

9
System Architecture
...
MPI application
MPI application
...
TMPI Runtime
TMPI Runtime
...
User-level threads
User-level threads
System-wide resource management
  • Targeted at multiprogrammed environments
  • Two-level thread management

10
Adaptive Two-level Thread Management
  • System-wide resource manager (OS kernel or
    User-level central monitor)
  • collects information about active MPI
    applications
  • partitions processors among them.
  • Application-wide user-level thread management
  • maps each MPI node into a user-level thread
  • schedules user-level threads on a pool of kernel
    threads
  • controls the number of active kernel threads
    close to the number of allocated processors.
  • Big picture (in the whole system)
  • ? Active kernel threads Processors
  • ? Minimize kernel-level context switch

11
User-level Thread Scheduling
  • Every kernel thread can be
  • active executing an MPI node (user-level
    thread)
  • suspended.
  • Execution invariant for each application
  • active kernel threads allocated processors
  • (minimize kernel-level context switch)
  • kernel threads MPI nodes
  • (avoid dynamic thread creation)
  • Every active kernel thread polls system resource
    manager, which leads to
  • Deactivation suspending itself
  • Activation waking up some suspended kernel
    threads
  • No-action
  • When to poll?

12
Polling in User-Level Context Switch
  • Context switch is a result of synchronization
    (e.g. an MPI node waits for a message).
  • Underlying kernel thread polls system resource
    manager during context switch
  • Two stack switches if deactivation
  • ? suspend on a dummy stack
  • One stack switch otherwise
  • After optimization, 2?s in average on SGI Power
    Challenge

13
Outline
  • Motivations Related Work
  • Adaptive Two-level Thread Management
  • Scheduler-conscious Event Waiting
  • Experimental Studies

14
Event Waiting Synchronization
  • All MPI synchronization is based on waitEvent

waiter
caller
waitEvent(pflag value)
  • Waiting could be
  • spinning
  • yielding/blocking

waiting
pflag value
wakeup
15
Tradeoff between spin and block
  • Basic rules for waiting using spin-then-block
  • Spinning wastes CPU cycles.
  • Blocking introduces context switch overhead
    always-blocking is not good for dedicated
    environments.
  • Previous work focuses on choosing the best spin
    time.
  • Our optimization focus and findings
  • Fast context switch has substantial performance
    impact
  • Use scheduling information to guide spin/block
    decision
  • Spinning is futile when the caller is not
    currently scheduled
  • Most blocking cost comes from cache flushing
    penalty. (actual cost varies, up to several ms)

16
Scheduler-conscious Event Waiting
  • User-level scheduler provides
  • scheduling info
  • affinity info

17
Experimental Settings
  • Machines
  • SGI Origin 2000 system with 32 195MHz MIPS
    R10000s with 2GB memory
  • SGI Power Challenge with 4 200MHz MPIS R4400s
    with 256MB memory
  • Compare among
  • TMPI-2 TMPI with two-level thread management
  • SGI MPI SGIs native MPI implementation
  • TMPI original TMPI without two-level thread
    management

18
Testing Benchmarks
  • Sync frequency is obtained by running each
    benchmark with 4 MPI nodes on 4-processor Power
    Challenge.
  • The higher the multiprogramming degree, the more
    spin-blocks (context switch) during each
    synchronization
  • Sparse LU benchmarks have much more frequent
    synchronization than others.

19
Performance evaluation on a Multiprogrammed
Workload
  • Workload contains a sequence of six jobs
    launched with a fixed interval.
  • Compare job turnaround time in Power Challenge.

20
Workload with Certain Multiprogramming Degrees
  • Goal identify the performance impact of
    multiprogramming degrees.
  • Experimental setting
  • Each workload has one benchmark program.
  • Run n MPI nodes on p processors (np).
  • Multiprogramming degree is n/p.
  • Compare megaflop rates or speedups of the kernel
    part of each application.

21
Performance Impact of Multiprogramming Degree
(SGI Power Challenge)
22
Performance Impact of Multiprogramming Degree
(SGI Origin 2000)
Performance ratios of TMPI-2 over TMPI
Performance ratios of TMPI-2 over SGI MPI
23
Benefits of Scheduler-conscious Event Waiting
Improvement over simple spin-block on Power
Challenge
Improvement over simple spin-block on Origin 2000
24
Conclusions
  • Contributions for optimizing MPI execution
  • Adaptive two-level thread management
    Scheduler-conscious event waiting
  • Great performance improvement up to an order of
    magnitude, depending on applications and load
  • In multiprogrammed environments, fast context
    switch/synchronization is important even for
    communication-infrequent MPI programs.
  • Current and future work
  • Support threaded MPI on SMP-clusters

http//www.cs.ucsb.edu/research/tmpi
Write a Comment
User Comments (0)
About PowerShow.com