Title: Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO
1Evaluating Network Processing Efficiency with
Processor Partitioning and Asynchronous I/O
2Introduction
- Network processing can saturate server running
high-speed TCP/IP applications - Packet processing can take up too many cycles
- Idea is to increase CPU cycles for application
processing - One solution is to offload TCP/IP to network
interface cards
3TOE (TCP offload engines)
- They have not demonstrated significant
improvements - Processing elements of TOE are behind mainstream
processors - Suffer from resource limitations such as memory
4Hardware developments
- Faster Multi-core processors
- Shared caches
- Reduce memory latency by transferring data
between processors - High bandwidth point to point links
- Memory and IO bandwidth will scale with the
number processing units - Reduction in latency
5More developments
- TCP/IP stack improvements
- OS-bypass reduce OS overheads
- Asynchronous IO increase IO concurrency
6Embedded Transport Acceleration (ETA)
- Dedicate set of hardware threads or processing
units to do network processing Packet
Processing Engines - PPE features
- Avoids costly context switches and cache
conflicts with the application process - Avoids costly interrupts by polling network
interfaces and applications - OS-bypass for most of the IO
7Asynchronous IO
- Asynchronous version of traditional socket
- Application issue operations without being
blocked - Applications receive event signifying completion
of IO - Sockets status need not be verified
8Asynchronous IO
- Provides for concurrent IO operation
- No need of separate software threads
- Separate queues of operation - PPEs and the
applications can independently speedup or slow
down with respect to each other
9Architecture Overview
10Architecture Overview (Cont.)
- Host cpus, Direct User Sockets Interface, PPE cpu
- DUSI implements all the Asynchrounous IO
functions - All the queue data structures are shared between
host applications and the PPE - Queue operations are extremely fast lock-free
operations
11Architecture (Cont.)
- Applications establish communication with PPE
through a User Adaptation Layer. - DTI Direct Transport Interface, analogous to
BSD socket. - References queue data structure
- Application post/receive notifications through
DTI - Polling directly from cache
- NIC, ETA data structures
12ETA Queues
- Every application has an instance of UAL
- One Doorbell Queue (DbQ) DUSI posts
applications requests to alert PPE - One or more Event Queues (EvQ) PPE posts
completion events to notify application - Transmit and Receive descriptor queues for each
DTI, pointers to send/receive buffers are posted
here. They can be bound to any EvQs.
13Asynchronous IO calls
- Dusi_Open_Ual
- Creates UAL, initialize application related data
structures - Synchronous function, interacts with ETA kernel
agent - Provides access to kernel related functions to
pin memory and translate user virtual addresses
to kernel virtual addresses
14Asynchronous IO calls (Cont.)
- Application is responsible for creating Event
Queues - Receive/Transmission event queues
- Dusi_Accept
- Permits application to initiate receive on the
tcp stream before accept is completed
15Asynchronous IO calls (Cont.)
- Dusi_Send/Receive
- Pointer to the buffer is provided for DTIs
descriptor queue - Notification is sent to DbQ
- PPE transfers data between NIC and user provided
buffer - Prior to this user virtual space and kernel
virtual space need to be mapped
16Asynchronous IO calls (Cont.)
- DUSI provides for multiple send and receive
requests - DUSI allows for blocking on a single queue
- Once the event completes DUSI raises the signal
17USERVER
- Open source web server implemented with SPED
architecture - SPED
- One cpu processes multiple connections in an
infinite event loop - Userver for standard linux
- Sendfile system call
18Userver with AIO
- Userver executes an infinite loop
- 1. deque completion events
- 2. based on completion event determine what
action to take on what connection - 3. issue the appropriate AIO operation
- Three event queues
- Accept completions
- Receive completions
- Send and shutdown completions
19State machine for handling a single connection
20Receive event processing
- Upon successful receive completion event
- New receive is posted in anticipation of another
request from the client - Sending a static file
- Userver mmaps requested files and registers
mmapped regions with DUSI - DUSI uses kernel agent to pin down the memory
sections - Data is transferred from memory to NIC
21Performance Evaluation
- Experimental Environment
- Work load that minimized disk IO
- Client request for static file
- File system is small enough to fit in memory
- Two processor system
- ETA, one proc for application processing, other
for network processing - Best base line configured system
22Processing Efficiency and Capacity
23Processing Efficiency and Capacity
- Base line with send file performed much better
over without send file - For the most part ETA cpu utilization is about
50 - PPE is always polling
- ETA reached peak throughput faster compared to
Baseline with sendfile
24ETA vs Baseline Throughput w/ Encryption
25Processing Efficiency with Normalized 1-CPU data
- ETAAIO has a modest advantage over the best
baseline configuration - ETA uses single cpu for network processing
- On baseline system there may be in instance of
TCP/IP executing on both processors - This might cause cache conflicts and semaphore
contention
26Processing Efficiency with Normalized 1-CPU data
(cont)
- 1-CPU configuration
- One cpu
- 2 NICS instead of 4 NICS
- Normalized 1-cpu results to 2-cpu ETA by doubling
throughput - Free of the multiprocessing overhead of packet
processing
27Efficiency Graph
- Around peak throughput CPU utilization same for
ETA and baseline