Efficient Asynchronous Message Passing via SCI with ZeroCopying - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Efficient Asynchronous Message Passing via SCI with ZeroCopying

Description:

No SMP-style shared memory. Specially allocated memory regions were required ... Summary & Outlook. Efficient utilization of new SCI driver functionality for ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 28
Provided by: lfbsRwt
Category:

less

Transcript and Presenter's Notes

Title: Efficient Asynchronous Message Passing via SCI with ZeroCopying


1
Efficient Asynchronous Message Passing via SCI
with Zero-Copying
SCI Europe 2001 Trinity College Dublin
  • Joachim Worringen, Friedrich Seifert, Thomas
    Bemmerl

2
Agenda
  • What is Zero-Copying? What is it good for?
  • Zero-Copying with SCI
  • Support through SMI-Library
  • Shared Memory Interface
  • Zero-Copy Protocols in SCI-MPICH
  • Memory Allocation Setups
  • Performance Optimizations
  • Performance Evaluation
  • Point-to-Point
  • Application Kernel
  • Asynchronous Communication

3
Zero-Copying
  • Transfer of data between two user-level
    accessible memory buffers with N explicit
    intermediate copiesN-wayCopying
  • No intermediate copy Zero-Copying
  • Effective Bandwidth and Efficiency

4
Efficiency Comparison
5
Zero-Copying with SCI
  • SCI does zero-copy by nature.
  • But SCI via IO-Bus is limited
  • No SMP-style shared memory
  • Specially allocated memory regions were required
  • No general zero-copy possible
  • New possibility
  • Using user-allocated buffers for SCI
    communication
  • Allows general zero-copy!
  • Connection setup is always required.

6
SMI LibraryShared Memory Interface
  • High-Level SCI support library
  • for parallel applications or libraries
  • Application startup
  • Synchronization basic communication
  • Shared-Memory setup
  • Collective regions
  • Point-2-point regions
  • Individual regions
  • Dynamic memory management
  • Data transfer

7
Data Moving (I)
  • Shared Memory Paradigm
  • Import remote memory in local address space
  • Perform memcpy() or maybe DMA
  • SMI Support
  • region type REMOTE
  • Synchronous (PIO) SMI_Memcpy()
  • Asynchronous (DMA if possible) SMI_Imemcpy()
    followed by SMI_Mem_wait()
  • Problems
  • High Mapping Overhead
  • Resource Usage (ATT entries on PCI-SCI adapter)

8
Mapping Overhead
  • Not suitable for dynamic memory setups!

9
Data Moving (II)
  • Connection Paradigm
  • Connect to remote memory location
  • No representation in local address space
  • only DMA possible
  • SMI support
  • Region type RDMA
  • Synchronous / Asynchronous DMASMI_Put/SMI_Iput,
    SMI_Get/SMI_Iget, SMI_Memwait
  • Problems
  • Alignment restrictions
  • Source needs to be pinned down

10
Setup Acceleration
  • Memory buffer setup costs time !
  • Reduce number of operations to increase
    performance
  • Desirable only one operation per buffer
  • Problem limited ressources
  • Solution caching of SCI segment states by
    lazy-release
  • Leave buffers registered, remote segments
    connected or mapped
  • Release unneeded resources if setup of new
    resource fails
  • Different replacement strategies possibleLRU,
    LFU, best-fit, random, immediate
  • Attention remote segment deallocation!
  • Callback on connection event to release local
    connection
  • MPI persistent communication operations
  • Pre-register user buffer higher hold priority

11
Memory Allocation
  • Allocate good memory
  • MPI_Alloc_mem() / MPI_Free_mem()
  • Part of MPI-2 (mostly for single-sided
    operations)
  • SCI-MPICH defines attributes
  • type shared, private or default
  • Shared memory performs best.
  • alignment none, specified or default
  • Non-shared memory should be page-aligned
  • Good memory should only be enforced for
    communication buffers!

12
Zero-Copy Protocols
  • Applicable for hand-shake based rendez-vous
    protocol
  • Requirements
  • registered user allocated buffers
  • or
  • regular SCI segments
  • good memory via MPI_Alloc_mem()
  • State of memory range must be known
  • SMI provides query functionality
  • Registering / Connection / Mapping may fail
  • Several different setups possible
  • Fallback mechanism required

13
Asynchronous Rendez-Vous
Control Messages
Irecv
Data Transfer
Wait
Wait
14
Test Setup
  • Systems used for performance evaluation
  • Pentium-III _at_ 800 MHz
  • 512 MB RAM _at_ 133 MHz
  • 64-bit / 66 MHz PCI (ServerWorks ServerSet III
    LE)
  • Dolphin D330 (single ring topology)
  • Linux 2.4.4-bigphysarea
  • modified SCI driver (user memory for SCI)

15
Bandwidth Comparison
16
Application Kernel NPB IS
  • Parallel bucket sort
  • Keys are integer numbers
  • Dominant communicationMPI_Alltoallv for
    distributed key array

17
MPI_Alltoallv Performance
  • MPI_Alltoallv is translated into point-to-point
    operations MPI_Isend / MPI_Irecv / MPI_Waitall
  • Improved performance with asynchronous DMA
    operations
  • Application speedup deduced

18
Asynchronous Communication
  • Goal Overlap Computation Communication
  • How to quantify the efficiency for this?
  • Typical overlapping effect

19
Saturation and Efficiency (I)
  • Two parameters are required
  • Saturation s
  • Duration of computation period required to make
    total time (communication computation) increase
  • Efficiency e
  • Relation of overhead to message latency

20
Saturation and Efficiency (II)
ttotal
tmsg_a
ttotal - tbusy
tmsg_s
tbusy
21
Experimental Setup Overlap
  • Micro-Benchmark to quantify overlapping
  • latency MPI_Wtime()
  • if (sender)
  • MPI_Isend(msg, msgsize)
  • while (elapsed_time lt spinning_duration)
  • spin (with multiple threads)
  • MPI_Wait()
  • else
  • MPI_Recv()
  • latency MPI_Wtime() - latency

22
Experimental Setup Spinning
  • Different ways of keeping CPU busy
  • FIXEDSpin on single variable for a given amount
    of CPU time
  • No memory stress
  • DAXPYPerform a given number of DAXPY operations
    on vectors (vectorsizes x, y equivalent to
    message size)
  • Stress memory system

23
DAXPY 64kiB Message
24
DAXPY 256kiB Message
25
FIXED 64kiB Message
26
Asynchronous Performance
  • Saturation and Efficiency derived from
    experiments

27
Summary Outlook
  • Efficient utilization of new SCI driver
    functionality for MPI communication
  • Max. bandwidth of 230 MiB/s (regular) 190
    MiB/s (user)
  • Connection overhead hidden by segment caching
  • Asynchronous communication pays off much
    earlier than before
  • New (?) quantification scheme for efficiency of
    asynchronous communication
  • Flexible MPI memory allocation supports MPI
    application writer
  • Connection-oriented DMA transfers reduce resource
    utilization
  • DMA alignment problems
  • Segment callback required for improved connection
    caching
Write a Comment
User Comments (0)
About PowerShow.com