MPICH2 A HighPerformance and Widely Portable OpenSource MPI Implementation - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

MPICH2 A HighPerformance and Widely Portable OpenSource MPI Implementation

Description:

Nemesis a new communication subsystem. New features and ... Traditional MPICH2 Developer APIs. Two APIs for porting MPICH2 to new communication architectures ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 21
Provided by: bunt
Category:

less

Transcript and Presenter's Notes

Title: MPICH2 A HighPerformance and Widely Portable OpenSource MPI Implementation


1
MPICH2 A High-Performance and Widely Portable
Open-Source MPI Implementation
  • Darius Buntinas
  • Argonne National Laboratory

2
Overview
  • MPICH2
  • High-performance
  • Open-source
  • Widely portable
  • MPICH2-based implementations
  • IBM for BG/L and BG/P
  • Cray for XT3/4
  • Intel
  • Microsoft
  • SiCortex
  • Myricom
  • Ohio State

3
Outline
  • Architectural overview
  • Nemesis a new communication subsystem
  • New features and optimizations
  • Intranode communication
  • Optimizing non-contiguous messages
  • Optimizing large messages
  • Current work in progress
  • Optimizations
  • Multi-threaded environments
  • Process manager
  • Other optimizations
  • Libraries and tools

4
Traditional MPICH2 Developer APIs
  • Two APIs for porting MPICH2 to new communication
    architectures
  • ADI3
  • CH3
  • ADI3 Implement a new device
  • Richer interface
  • 60 functions
  • More work to port
  • More flexibility
  • CH3 Implement a new CH3 channel
  • Simpler interface
  • 15 functions
  • Easier to port
  • Less flexibility

5
Jumpshot
Application
MPE
MPI Inteface
PMPI
ROMIO
MPI Layer
ADIO Interface
ADI3 Inteface
CH3 Device
MX
PVFS
...
GPFS
XFS
...
BG
Cray
MPD
  • Support for High-speed Networks
  • 10-Gigabit Ethernet iWARP, Qlogic PSM,
    InfiniBand, Myrinet (MX and GM)
  • Supports proprietary platforms
  • BlueGene/L, BlueGene/P, SiCortex, Cray
  • Distribution with ROMIO MPI/IO library
  • Profiling and visualization tools (MPE, Jumpshot)

CH3 Interface
SMPD
PMI Interface
Sock
Nemesis
SCTP
SHM
SSHM
. . . .
Nemesis Net Mod Interface
Gforker
TCP
IB/iWARP
PSM
MX
GM
6
Nemesis
  • Nemesis is a new CH3 channel for MPICH2
  • Shared-memory for intranode communication
  • Lock-free queues
  • Scalability
  • Improved intranode performance
  • Network modules for internode communication
  • New interface
  • New developer API Nemesis netmod interface
  • Simpler interface than ADI3
  • More flexible than CH3

7
Nemesis Lock-Free Queues
2
  • Atomic memory operations
  • Scalable
  • One recv queue per process
  • Optimized to reduce cache misses

1
8
Nemesis Network Modules
  • Improved interface for network modules
  • Allows optimized handling of noncontiguous data
  • Allows optimized transfer of large data
  • Optimized small contiguous message path
  • lt 2.5us over QLogic PSM
  • Future work
  • Multiple network modules
  • E.g., Myrinet for intra-cluster and TCP for
    inter-cluster
  • Dynamically loadable

9
Optimized Non-contiguous
  • Issues with non-contiguous data
  • Representation
  • Manipulation
  • Packing, generating other representations (e.g.,
    iov), etc
  • Dataloops MPICH2s optimized internal datatype
    representation
  • Efficiently describes non-contiguous data
  • Utilities to efficiently manipulate
    non-contiguous data
  • Dataloop is passed to network module
  • Previously, an I/O vector was generated then
    passed
  • Netmod implementation manipulates the dataloop.
    E.g.,
  • TCP uses iov
  • IB, PSM, pack data into send buffer.

10
Optimized Large Message Transfer Using Rendezvous
  • MPICH2 uses rendezvous to transfer large messages
  • Original implementation channel was oblivious to
    rendezvous
  • CH3 sent RTS, CTS, DATA
  • Shared mem Large messages would be sent through
    queue
  • Netmod Netmod would perform its own rendezvous
  • Shm Queues may not be the most efficient
    mechanism to transfer large data
  • E.g., network RDMA, inter-process copy mechanism,
    copy buffer
  • Netmod Redundant rendezvous
  • Developed LMT interface to support various
    mechanisms
  • Sender transfers data (put)
  • Receiver transfers data (get)
  • Both sender and receiver participate in data
    transfer
  • Modified CH3 to use LMT
  • Works with rendezvous protocol

11
Optimization LMT for Intranode Communication
  • For intranode, LMT copies through buffer in
    shared memory
  • Sender allocates shared memory region
  • Sends buffer ID to receiver in RTS packet
  • Receiver attaches to memory region
  • Both sender and receiver participate in transfer
  • Use double-buffering

Sender
Receiver
12
Current Work In Progress
  • Optimizations
  • Multi-threaded environments
  • Process manager
  • Other work
  • Atomic Operations Library

13
Current Optimization Work
  • Handle common case fast Eager contiguous
    messages
  • Identify this case early in the operation
  • Call netmods send_eager_contig() function
    directly
  • Bypass receive queue
  • Currently check unexp queue, post on posted
    queue, check network
  • Optimized check unexp queue, check network
  • Reduced instruction count by 48
  • Eliminate function calls
  • Collapse layers where possible
  • Merge Nemesis with CH3
  • Move Nemesis functionality to CH3
  • CH3 shared memory support
  • New CH3 channel/netmod interface
  • Cache-aware placement of fields in structures

14
Fine Grained Threading
  • MPICH2 supports multi-threaded applications
  • MPI_THREAD_MULTIPLE
  • Currently, thread safety is implemented with a
    single lock
  • Lock is acquired on entering an MPI function
  • And released on exit
  • Also released when making blocking communication
    system calls
  • Limits concurrency in communication
  • Only one thread can be in the progress engine at
    one time
  • New architectures have multiple DMA engines for
    communication
  • These can work independently of each other
  • Concurrency is needed in the progress engine for
    maximum performance
  • Even without independent network hardware
  • Internal concurrency can improve performance

15
Multicore-Aware Collectives
  • Intra-node communication is much faster than
    inter-node
  • Take advantage of this in collective algorithms
  • E.g., Broadcast
  • Send to one process per node, that process
    broadcasts to other processes on that node
  • Step further collectives over shared memory
  • E.g., Broadcast
  • Within a node, process writes data to shared
    memory region
  • Other processes read data
  • Issues
  • Memory traffic, cache misses, etc.

16
Process Manager
  • Enhanced support for third party process managers
  • PBS, Slurm,
  • Working on others
  • Replacement for existing process managers
  • Scalable to 10,000s of nodes
  • Fault-tolerant
  • Aware of topology

17
Other Work
  • Heterogeneous data representations
  • Different architectures use different data
    representations
  • E.g., big/little-endian, 32/64-bit, IEEE
    floats/non-IEEE floats, etc
  • Important for heterogeneous clusters and grids
  • Use existing datatype manipulation utilities
  • Fault-tolerance support
  • CIFTS fault-tolerance backplane
  • Fault detection and reporting

18
Atomic Operations Library
  • Lock-free algorithms use atomic assembly
    instructions
  • Assembly instructions are non-portable
  • Must be ported for each architecture and compiler
  • Were working on an atomic operations library
  • Implementations for various architectures and
    various compilers
  • Stand-alone library
  • Not all atomic operations are natively supported
    on all architectures
  • E.g., some have LL-SC but no SWAP
  • Such operations can be emulated using provided
    operations

19
Tools Included in MPICH2
  • MPE library for tracing MPI and other calls
  • Scalable log file format (slog2)
  • Jumpshot tool for visualizing log files
  • Supports threads
  • Collchk library for checking that the application
    calls collective operations correctly

20
For more information
  • MPICH2 website
  • http//www.mcs.anl.gov/research/projects/mpich2
  • SVN repository
  • svn co https//svn.mcs.anl.gov/repos/mpi/mpich2/tr
    unk mpich2
  • Developer pages
  • http//wiki.mcs.anl.gov/mpich2/index.php/Developer
    _Documentation
  • Mailing lists
  • mpich2-maint_at_mcs.anl.gov
  • mpich-discuss_at_mcs.anl.gov
  • Me
  • buntinas_at_mcs.anl.gov
  • http//www.mcs.anl.gov/buntinas
Write a Comment
User Comments (0)
About PowerShow.com