Realizing the Performance Potential of the Virtual Interface Architecture presentation

About This Presentation

Transcript and Presenter's Notes

Title: Realizing the Performance Potential of the Virtual Interface Architecture

1
Realizing the Performance Potential of the
Virtual Interface Architecture

Evan Speight, Hazim Abdel-Shafi, and John K.
Bennett
Rice University, Dep. Of Electrical and Computer
Engineering
Presented by Constantin Serban, R.U.

2
VIA Goals

Communication infrastructure for System Area
Networks (SANs)
Targets mainly high speed cluster applications
Efficiently harnesses the communication
performance of underlying networks

3
Trends

The peak bandwidth increase two order of
magnitude over past decade while user latency
decreased modestly.
The latency introduced by the protocol is
typically several times the latency of the
transport layer.
The problem becomes acute especially for small
messages

4
Targets

VI architecture addresses the following issues
Decrease the latency especially for small
messages (used in synchronization)
Increase the aggregate bandwidth (only a fraction
of the peak bandwidth is utilized)
Reduce the CPU processing due to the message
overhead

5
Overhead

Overhead mainly comes from two sources
Every network access requires one-two traps into
the kernel
user/kernel mode switch is time consuming
Usually two data copies occur
From the user buffer to the message passing API
From message layer to the kernel buffer

6
VIA approach

Remove the kernel from the critical path
Moving communication code out of the kernel into
user space
Provide 0-copy protocol
Data is sent/received directly into the user
buffer, no message copy is performed

VIA emerged as a standardization effort from
Compaq, Intel, and Microsoft
It was built on several academic ideas
The main architecture most similar to U-Net
Essential features derived from VMMC
Among current implementations
GigaNet cLan VIA implemented in hardware
Tandem ServerNet VIA software driver emulated
Myricom Myrinet - software emulated in firmware

8
VIA architecture
9
VIA operations

Set-Up/Tear-Down
VIA is point-to-point connection oriented
protocol
VI-endpoint the core concept in VIA
Register/De-Register Memory
Connect/Disconnect
Transmit
Receive
RDMA

10
VIA operations

Set-Up/Tear-Down VIA is point-to-point
connection oriented protocol
VI-endpoint the core concept in VIA
VipCreateVi function creates a VI endpoint in the
user space.
The user-level library passes the call to the
kernel agent which passes the creation
information to the NIC.
OS thus controls the application access to the
NIC

11
VIA operations - contd

Register/De-Register Memory
All data buffers and descriptors reside in a
registered memory
NIC performs DMA I/O operation in this registered
memory
Registration pins down the pages into the
physical memory and provides a handle to
manipulate the pages and transfer the addresses
to the NIC
It is performed once, usually at the beginning of
the communication session

12
VIA operations - contd

Connect/Disconnect
Before communication, each endpoint is connected
to a remote endpoint
The connection is passed to the kernel agent and
down to the NIC
VIA does not define any addressing scheme,
existing schemes can be used in various
implementations

13
VIA operations - contd

Transmit/receive
The sender builds a descriptor for the message to
be sent. The descriptor points to the actual data
buffer. Both descriptor and data buffer resides
in a registered memory area.
The application then posts a doorbell to signal
the availability of the descriptor.The doorbell
contains the address of the descriptor.
The doorbells are maintained in an internal queue
inside the NIC

14
VIA operations - contd

Transmit/receive (contd)
Meanwhile, the receiver creates a descriptor that
points to an empty data buffer and posts a
doorbell in the receiver NIC queue
When the doorbell in the sender queue has reached
the top of the queue, through a double
indirection the data is sent into the network.
The first doorbell/ descriptor is picked up from
the receiver queue and the buffer is filled out
with data

15
VIA operations - contd

RDMA
As a mechanism derived from VMMC, VIA allows
Remote DMA operations
RDMA Read and Write
Each node allocates a receive buffer and
registers it with the NIC. Additional structures
that contain read and write pointers to the
receive buffers are exchanged during connection
setu
Each node can read and write to the remote node
address directly.
These operations posts potential implementation
problems.

16
Evaluation Benchmarks

Two VI implementations
GigaNet cLan B125MB/sec, Latency 480ns
Tandem ServerNet, 50MB/S, Latency 300ns
Performance measured
Bandwidth and Latency
Poling vs. Blocking
CPU Utilization

17
Bandwidth
18
Latency
19
Latency Polling/Blocking
20
CPU utilization
21
MPI performance using VIA

The challenge is to deliver performance to
distributed application
Software layers such MPI are mostly used between
VIA and the application provide increased
usability but they bring additional overhead
How to optimize this layer in order to use it
efficiently with VIA ?

22
MPI VIA - performance
23
MPI observations

Difference between MPI-UDP and MPI-VIA-baseline
is remarkable
MPI-VIA-baseline is dramatically far from
VIA-Native
Several improvements proposed to shift MPI-Via to
be closer to VIA native reduce MPI overhead

24
MPI Improvements

Eliminating unnecessary copies
MPI UDP and VIA use a single set of receiving
buffers, thus data should be copied to the
application allow the user to register any
buffer
Choosing a synchronization primitive
All synchronization formerly using OS
constructs/events. Better implementation using
swap processor commands
No Acknowledge
Remove the acknowledge of the message by
switching to a reliable VIA mode

25
VIA - Disadvantages

Polling vs. blocking synchronization a tradeoff
between CPU consumption and overhead
Memory registration locking large amount of
memory makes virtual memory mechanisms
inefficient. Registering / deregistering on the
fly is slow
Point-to-point vs. multicast VIA lacks multicast
primitives. Implementing multicast over the
actual mechanism, makes communication inefficient

26
Conclusion

Small latency for small messages. Small messages
have a strong impact on application behavior
Significant improvement over UDP communication
(still after recent TCP/UDP hardware
implementations?)
At the expense of an uncomfortable API

Write a Comment

User Comments (0)

About PowerShow.com

Realizing the Performance Potential of the Virtual Interface Architecture PowerPoint PPT Presentation