Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO

About This Presentation

Title:

Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO

Description:

Evaluating Network Processing Efficiency with Processor Partitioning ... One solution is to offload TCP/IP to network interface cards. TOE (TCP offload engines) ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 28

Provided by: prahla

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO

1
Evaluating Network Processing Efficiency with
Processor Partitioning and Asynchronous I/O

By
Rao Bayyana

2
Introduction

Network processing can saturate server running
high-speed TCP/IP applications
Packet processing can take up too many cycles
Idea is to increase CPU cycles for application
processing
One solution is to offload TCP/IP to network
interface cards

3
TOE (TCP offload engines)

They have not demonstrated significant
improvements
Processing elements of TOE are behind mainstream
processors
Suffer from resource limitations such as memory

4
Hardware developments

Faster Multi-core processors
Shared caches
Reduce memory latency by transferring data
between processors
High bandwidth point to point links
Memory and IO bandwidth will scale with the
number processing units
Reduction in latency

5
More developments

TCP/IP stack improvements
OS-bypass reduce OS overheads
Asynchronous IO increase IO concurrency

6
Embedded Transport Acceleration (ETA)

Dedicate set of hardware threads or processing
units to do network processing Packet
Processing Engines
PPE features
Avoids costly context switches and cache
conflicts with the application process
Avoids costly interrupts by polling network
interfaces and applications
OS-bypass for most of the IO

7
Asynchronous IO

Asynchronous version of traditional socket
Application issue operations without being
blocked
Applications receive event signifying completion
of IO
Sockets status need not be verified

8
Asynchronous IO

Provides for concurrent IO operation
No need of separate software threads
Separate queues of operation - PPEs and the
applications can independently speedup or slow
down with respect to each other

9
Architecture Overview
10
Architecture Overview (Cont.)

Host cpus, Direct User Sockets Interface, PPE cpu
DUSI implements all the Asynchrounous IO
functions
All the queue data structures are shared between
host applications and the PPE
Queue operations are extremely fast lock-free
operations

11
Architecture (Cont.)

Applications establish communication with PPE
through a User Adaptation Layer.
DTI Direct Transport Interface, analogous to
BSD socket.
References queue data structure
Application post/receive notifications through
DTI
Polling directly from cache
NIC, ETA data structures

12
ETA Queues

Every application has an instance of UAL
One Doorbell Queue (DbQ) DUSI posts
applications requests to alert PPE
One or more Event Queues (EvQ) PPE posts
completion events to notify application
Transmit and Receive descriptor queues for each
DTI, pointers to send/receive buffers are posted
here. They can be bound to any EvQs.

13
Asynchronous IO calls

Dusi_Open_Ual
Creates UAL, initialize application related data
structures
Synchronous function, interacts with ETA kernel
agent
Provides access to kernel related functions to
pin memory and translate user virtual addresses
to kernel virtual addresses

14
Asynchronous IO calls (Cont.)

Application is responsible for creating Event
Queues
Receive/Transmission event queues
Dusi_Accept
Permits application to initiate receive on the
tcp stream before accept is completed

15
Asynchronous IO calls (Cont.)

Dusi_Send/Receive
Pointer to the buffer is provided for DTIs
descriptor queue
Notification is sent to DbQ
PPE transfers data between NIC and user provided
buffer
Prior to this user virtual space and kernel
virtual space need to be mapped

16
Asynchronous IO calls (Cont.)

DUSI provides for multiple send and receive
requests
DUSI allows for blocking on a single queue
Once the event completes DUSI raises the signal

17
USERVER

Open source web server implemented with SPED
architecture
SPED
One cpu processes multiple connections in an
infinite event loop
Userver for standard linux
Sendfile system call

18
Userver with AIO

Userver executes an infinite loop
1. deque completion events
2. based on completion event determine what
action to take on what connection
3. issue the appropriate AIO operation
Three event queues
Accept completions
Receive completions
Send and shutdown completions

19
State machine for handling a single connection
20
Receive event processing

Upon successful receive completion event
New receive is posted in anticipation of another
request from the client
Sending a static file
Userver mmaps requested files and registers
mmapped regions with DUSI
DUSI uses kernel agent to pin down the memory
sections
Data is transferred from memory to NIC

21
Performance Evaluation

Experimental Environment
Work load that minimized disk IO
Client request for static file
File system is small enough to fit in memory
Two processor system
ETA, one proc for application processing, other
for network processing
Best base line configured system

22
Processing Efficiency and Capacity
23
Processing Efficiency and Capacity

Base line with send file performed much better
over without send file
For the most part ETA cpu utilization is about
50
PPE is always polling
ETA reached peak throughput faster compared to
Baseline with sendfile

24
ETA vs Baseline Throughput w/ Encryption
25
Processing Efficiency with Normalized 1-CPU data

ETAAIO has a modest advantage over the best
baseline configuration
ETA uses single cpu for network processing
On baseline system there may be in instance of
TCP/IP executing on both processors
This might cause cache conflicts and semaphore
contention

26
Processing Efficiency with Normalized 1-CPU data
(cont)

1-CPU configuration
One cpu
2 NICS instead of 4 NICS
Normalized 1-cpu results to 2-cpu ETA by doubling
throughput
Free of the multiprocessing overhead of packet
processing

Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO - PowerPoint PPT Presentation

Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO

Evaluating Network Processing Efficiency with Processor Partitioning ... One solution is to offload TCP/IP to network interface cards. TOE (TCP offload engines) ... – PowerPoint PPT presentation