MPICH2 A HighPerformance and Widely Portable OpenSource MPI Implementation - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

MPICH2 A HighPerformance and Widely Portable OpenSource MPI Implementation

Description:

Nemesis a new communication subsystem. New features and ... Traditional MPICH2 Developer APIs. Two APIs for porting MPICH2 to new communication architectures ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 21

Provided by: bunt

Category:

more less

Transcript and Presenter's Notes

Title: MPICH2 A HighPerformance and Widely Portable OpenSource MPI Implementation

1
MPICH2 A High-Performance and Widely Portable
Open-Source MPI Implementation

Darius Buntinas
Argonne National Laboratory

2
Overview

MPICH2
High-performance
Open-source
Widely portable
MPICH2-based implementations
IBM for BG/L and BG/P
Cray for XT3/4
Intel
Microsoft
SiCortex
Myricom
Ohio State

3
Outline

Architectural overview
Nemesis a new communication subsystem
New features and optimizations
Intranode communication
Optimizing non-contiguous messages
Optimizing large messages
Current work in progress
Optimizations
Multi-threaded environments
Process manager
Other optimizations
Libraries and tools

4
Traditional MPICH2 Developer APIs

Two APIs for porting MPICH2 to new communication
architectures
ADI3
CH3
ADI3 Implement a new device
Richer interface
60 functions
More work to port
More flexibility
CH3 Implement a new CH3 channel
Simpler interface
15 functions
Easier to port
Less flexibility

5
Jumpshot
Application
MPE
MPI Inteface
PMPI
ROMIO
MPI Layer
ADIO Interface
ADI3 Inteface
CH3 Device
MX
PVFS
...
GPFS
XFS
...
BG
Cray
MPD

Support for High-speed Networks
10-Gigabit Ethernet iWARP, Qlogic PSM,
InfiniBand, Myrinet (MX and GM)
Supports proprietary platforms
BlueGene/L, BlueGene/P, SiCortex, Cray
Distribution with ROMIO MPI/IO library
Profiling and visualization tools (MPE, Jumpshot)

CH3 Interface
SMPD
PMI Interface
Sock
Nemesis
SCTP
SHM
SSHM
. . . .
Nemesis Net Mod Interface
Gforker
TCP
IB/iWARP
PSM
MX
GM
6
Nemesis

Nemesis is a new CH3 channel for MPICH2
Shared-memory for intranode communication
Lock-free queues
Scalability
Improved intranode performance
Network modules for internode communication
New interface
New developer API Nemesis netmod interface
Simpler interface than ADI3
More flexible than CH3

7
Nemesis Lock-Free Queues
2

Atomic memory operations
Scalable
One recv queue per process
Optimized to reduce cache misses

1
8
Nemesis Network Modules

Improved interface for network modules
Allows optimized handling of noncontiguous data
Allows optimized transfer of large data
Optimized small contiguous message path
lt 2.5us over QLogic PSM
Future work
Multiple network modules
E.g., Myrinet for intra-cluster and TCP for
inter-cluster
Dynamically loadable

9
Optimized Non-contiguous

Issues with non-contiguous data
Representation
Manipulation
Packing, generating other representations (e.g.,
iov), etc
Dataloops MPICH2s optimized internal datatype
representation
Efficiently describes non-contiguous data
Utilities to efficiently manipulate
non-contiguous data
Dataloop is passed to network module
Previously, an I/O vector was generated then
passed
Netmod implementation manipulates the dataloop.
E.g.,
TCP uses iov
IB, PSM, pack data into send buffer.

10
Optimized Large Message Transfer Using Rendezvous

MPICH2 uses rendezvous to transfer large messages
Original implementation channel was oblivious to
rendezvous
CH3 sent RTS, CTS, DATA
Shared mem Large messages would be sent through
queue
Netmod Netmod would perform its own rendezvous
Shm Queues may not be the most efficient
mechanism to transfer large data
E.g., network RDMA, inter-process copy mechanism,
copy buffer
Netmod Redundant rendezvous
Developed LMT interface to support various
mechanisms
Sender transfers data (put)
Receiver transfers data (get)
Both sender and receiver participate in data
transfer
Modified CH3 to use LMT
Works with rendezvous protocol

11
Optimization LMT for Intranode Communication

For intranode, LMT copies through buffer in
shared memory
Sender allocates shared memory region
Sends buffer ID to receiver in RTS packet
Receiver attaches to memory region
Both sender and receiver participate in transfer
Use double-buffering

Sender
Receiver
12
Current Work In Progress

Optimizations
Multi-threaded environments
Process manager
Other work
Atomic Operations Library

13
Current Optimization Work

Handle common case fast Eager contiguous
messages
Identify this case early in the operation
Call netmods send_eager_contig() function
directly
Bypass receive queue
Currently check unexp queue, post on posted
queue, check network
Optimized check unexp queue, check network
Reduced instruction count by 48
Eliminate function calls
Collapse layers where possible
Merge Nemesis with CH3
Move Nemesis functionality to CH3
CH3 shared memory support
New CH3 channel/netmod interface
Cache-aware placement of fields in structures

14
Fine Grained Threading

MPICH2 supports multi-threaded applications
MPI_THREAD_MULTIPLE
Currently, thread safety is implemented with a
single lock
Lock is acquired on entering an MPI function
And released on exit
Also released when making blocking communication
system calls
Limits concurrency in communication
Only one thread can be in the progress engine at
one time
New architectures have multiple DMA engines for
communication
These can work independently of each other
Concurrency is needed in the progress engine for
maximum performance
Even without independent network hardware
Internal concurrency can improve performance

15
Multicore-Aware Collectives

Intra-node communication is much faster than
inter-node
Take advantage of this in collective algorithms
E.g., Broadcast
Send to one process per node, that process
broadcasts to other processes on that node
Step further collectives over shared memory
E.g., Broadcast
Within a node, process writes data to shared
memory region
Other processes read data
Issues
Memory traffic, cache misses, etc.

16
Process Manager

Enhanced support for third party process managers
PBS, Slurm,
Working on others
Replacement for existing process managers
Scalable to 10,000s of nodes
Fault-tolerant
Aware of topology

17
Other Work

Heterogeneous data representations
Different architectures use different data
representations
E.g., big/little-endian, 32/64-bit, IEEE
floats/non-IEEE floats, etc
Important for heterogeneous clusters and grids
Use existing datatype manipulation utilities
Fault-tolerance support
CIFTS fault-tolerance backplane
Fault detection and reporting

18
Atomic Operations Library

Lock-free algorithms use atomic assembly
instructions
Assembly instructions are non-portable
Must be ported for each architecture and compiler
Were working on an atomic operations library
Implementations for various architectures and
various compilers
Stand-alone library
Not all atomic operations are natively supported
on all architectures
E.g., some have LL-SC but no SWAP
Such operations can be emulated using provided
operations

19
Tools Included in MPICH2

MPE library for tracing MPI and other calls
Scalable log file format (slog2)
Jumpshot tool for visualizing log files
Supports threads
Collchk library for checking that the application
calls collective operations correctly

20
For more information