Efficient Asynchronous Message Passing via SCI with ZeroCopying - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Efficient Asynchronous Message Passing via SCI with ZeroCopying

Description:

No SMP-style shared memory. Specially allocated memory regions were required ... Summary & Outlook. Efficient utilization of new SCI driver functionality for ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 28

Provided by: lfbsRwt

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Asynchronous Message Passing via SCI with ZeroCopying

1
Efficient Asynchronous Message Passing via SCI
with Zero-Copying
SCI Europe 2001 Trinity College Dublin

Joachim Worringen, Friedrich Seifert, Thomas
Bemmerl

2
Agenda

What is Zero-Copying? What is it good for?
Zero-Copying with SCI
Support through SMI-Library
Shared Memory Interface
Zero-Copy Protocols in SCI-MPICH
Memory Allocation Setups
Performance Optimizations
Performance Evaluation
Point-to-Point
Application Kernel
Asynchronous Communication

3
Zero-Copying

Transfer of data between two user-level
accessible memory buffers with N explicit
intermediate copiesN-wayCopying
No intermediate copy Zero-Copying
Effective Bandwidth and Efficiency

4
Efficiency Comparison
5
Zero-Copying with SCI

SCI does zero-copy by nature.
But SCI via IO-Bus is limited
No SMP-style shared memory
Specially allocated memory regions were required
No general zero-copy possible
New possibility
Using user-allocated buffers for SCI
communication
Allows general zero-copy!
Connection setup is always required.

6
SMI LibraryShared Memory Interface

High-Level SCI support library
for parallel applications or libraries
Application startup
Synchronization basic communication
Shared-Memory setup
Collective regions
Point-2-point regions
Individual regions
Dynamic memory management
Data transfer

7
Data Moving (I)

Shared Memory Paradigm
Import remote memory in local address space
Perform memcpy() or maybe DMA
SMI Support
region type REMOTE
Synchronous (PIO) SMI_Memcpy()
Asynchronous (DMA if possible) SMI_Imemcpy()
followed by SMI_Mem_wait()

Problems
High Mapping Overhead
Resource Usage (ATT entries on PCI-SCI adapter)

8
Mapping Overhead

Not suitable for dynamic memory setups!

9
Data Moving (II)

Connection Paradigm
Connect to remote memory location
No representation in local address space
only DMA possible
SMI support
Region type RDMA
Synchronous / Asynchronous DMASMI_Put/SMI_Iput,
SMI_Get/SMI_Iget, SMI_Memwait

Problems
Alignment restrictions
Source needs to be pinned down

10
Setup Acceleration

Memory buffer setup costs time !
Reduce number of operations to increase
performance
Desirable only one operation per buffer
Problem limited ressources
Solution caching of SCI segment states by
lazy-release
Leave buffers registered, remote segments
connected or mapped
Release unneeded resources if setup of new
resource fails
Different replacement strategies possibleLRU,
LFU, best-fit, random, immediate
Attention remote segment deallocation!
Callback on connection event to release local
connection
MPI persistent communication operations
Pre-register user buffer higher hold priority

11
Memory Allocation

Allocate good memory
MPI_Alloc_mem() / MPI_Free_mem()
Part of MPI-2 (mostly for single-sided
operations)
SCI-MPICH defines attributes
type shared, private or default
Shared memory performs best.
alignment none, specified or default
Non-shared memory should be page-aligned
Good memory should only be enforced for
communication buffers!

12
Zero-Copy Protocols

Applicable for hand-shake based rendez-vous
protocol
Requirements
registered user allocated buffers
or
regular SCI segments
good memory via MPI_Alloc_mem()
State of memory range must be known
SMI provides query functionality
Registering / Connection / Mapping may fail
Several different setups possible
Fallback mechanism required

13
Asynchronous Rendez-Vous
Control Messages
Irecv
Data Transfer
Wait
Wait
14
Test Setup

Systems used for performance evaluation
Pentium-III _at_ 800 MHz
512 MB RAM _at_ 133 MHz
64-bit / 66 MHz PCI (ServerWorks ServerSet III
LE)
Dolphin D330 (single ring topology)
Linux 2.4.4-bigphysarea
modified SCI driver (user memory for SCI)

15
Bandwidth Comparison
16
Application Kernel NPB IS

Parallel bucket sort
Keys are integer numbers
Dominant communicationMPI_Alltoallv for
distributed key array

17
MPI_Alltoallv Performance

MPI_Alltoallv is translated into point-to-point
operations MPI_Isend / MPI_Irecv / MPI_Waitall
Improved performance with asynchronous DMA
operations
Application speedup deduced

18
Asynchronous Communication

Goal Overlap Computation Communication
How to quantify the efficiency for this?
Typical overlapping effect

19
Saturation and Efficiency (I)

Two parameters are required
Saturation s
Duration of computation period required to make
total time (communication computation) increase
Efficiency e
Relation of overhead to message latency

20
Saturation and Efficiency (II)
ttotal
tmsg_a
ttotal - tbusy
tmsg_s
tbusy
21
Experimental Setup Overlap

Micro-Benchmark to quantify overlapping
latency MPI_Wtime()
if (sender)
MPI_Isend(msg, msgsize)
while (elapsed_time lt spinning_duration)
spin (with multiple threads)
MPI_Wait()
else
MPI_Recv()
latency MPI_Wtime() - latency

22
Experimental Setup Spinning

Different ways of keeping CPU busy
FIXEDSpin on single variable for a given amount
of CPU time
No memory stress
DAXPYPerform a given number of DAXPY operations
on vectors (vectorsizes x, y equivalent to
message size)
Stress memory system