Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO

Description:

Evaluating Network Processing Efficiency with Processor Partitioning ... One solution is to offload TCP/IP to network interface cards. TOE (TCP offload engines) ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 28
Provided by: prahla
Category:

less

Transcript and Presenter's Notes

Title: Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous IO


1
Evaluating Network Processing Efficiency with
Processor Partitioning and Asynchronous I/O
  • By
  • Rao Bayyana

2
Introduction
  • Network processing can saturate server running
    high-speed TCP/IP applications
  • Packet processing can take up too many cycles
  • Idea is to increase CPU cycles for application
    processing
  • One solution is to offload TCP/IP to network
    interface cards

3
TOE (TCP offload engines)
  • They have not demonstrated significant
    improvements
  • Processing elements of TOE are behind mainstream
    processors
  • Suffer from resource limitations such as memory

4
Hardware developments
  • Faster Multi-core processors
  • Shared caches
  • Reduce memory latency by transferring data
    between processors
  • High bandwidth point to point links
  • Memory and IO bandwidth will scale with the
    number processing units
  • Reduction in latency

5
More developments
  • TCP/IP stack improvements
  • OS-bypass reduce OS overheads
  • Asynchronous IO increase IO concurrency

6
Embedded Transport Acceleration (ETA)
  • Dedicate set of hardware threads or processing
    units to do network processing Packet
    Processing Engines
  • PPE features
  • Avoids costly context switches and cache
    conflicts with the application process
  • Avoids costly interrupts by polling network
    interfaces and applications
  • OS-bypass for most of the IO

7
Asynchronous IO
  • Asynchronous version of traditional socket
  • Application issue operations without being
    blocked
  • Applications receive event signifying completion
    of IO
  • Sockets status need not be verified

8
Asynchronous IO
  • Provides for concurrent IO operation
  • No need of separate software threads
  • Separate queues of operation - PPEs and the
    applications can independently speedup or slow
    down with respect to each other

9
Architecture Overview
10
Architecture Overview (Cont.)
  • Host cpus, Direct User Sockets Interface, PPE cpu
  • DUSI implements all the Asynchrounous IO
    functions
  • All the queue data structures are shared between
    host applications and the PPE
  • Queue operations are extremely fast lock-free
    operations

11
Architecture (Cont.)
  • Applications establish communication with PPE
    through a User Adaptation Layer.
  • DTI Direct Transport Interface, analogous to
    BSD socket.
  • References queue data structure
  • Application post/receive notifications through
    DTI
  • Polling directly from cache
  • NIC, ETA data structures

12
ETA Queues
  • Every application has an instance of UAL
  • One Doorbell Queue (DbQ) DUSI posts
    applications requests to alert PPE
  • One or more Event Queues (EvQ) PPE posts
    completion events to notify application
  • Transmit and Receive descriptor queues for each
    DTI, pointers to send/receive buffers are posted
    here. They can be bound to any EvQs.

13
Asynchronous IO calls
  • Dusi_Open_Ual
  • Creates UAL, initialize application related data
    structures
  • Synchronous function, interacts with ETA kernel
    agent
  • Provides access to kernel related functions to
    pin memory and translate user virtual addresses
    to kernel virtual addresses

14
Asynchronous IO calls (Cont.)
  • Application is responsible for creating Event
    Queues
  • Receive/Transmission event queues
  • Dusi_Accept
  • Permits application to initiate receive on the
    tcp stream before accept is completed

15
Asynchronous IO calls (Cont.)
  • Dusi_Send/Receive
  • Pointer to the buffer is provided for DTIs
    descriptor queue
  • Notification is sent to DbQ
  • PPE transfers data between NIC and user provided
    buffer
  • Prior to this user virtual space and kernel
    virtual space need to be mapped

16
Asynchronous IO calls (Cont.)
  • DUSI provides for multiple send and receive
    requests
  • DUSI allows for blocking on a single queue
  • Once the event completes DUSI raises the signal

17
USERVER
  • Open source web server implemented with SPED
    architecture
  • SPED
  • One cpu processes multiple connections in an
    infinite event loop
  • Userver for standard linux
  • Sendfile system call

18
Userver with AIO
  • Userver executes an infinite loop
  • 1. deque completion events
  • 2. based on completion event determine what
    action to take on what connection
  • 3. issue the appropriate AIO operation
  • Three event queues
  • Accept completions
  • Receive completions
  • Send and shutdown completions

19
State machine for handling a single connection
20
Receive event processing
  • Upon successful receive completion event
  • New receive is posted in anticipation of another
    request from the client
  • Sending a static file
  • Userver mmaps requested files and registers
    mmapped regions with DUSI
  • DUSI uses kernel agent to pin down the memory
    sections
  • Data is transferred from memory to NIC

21
Performance Evaluation
  • Experimental Environment
  • Work load that minimized disk IO
  • Client request for static file
  • File system is small enough to fit in memory
  • Two processor system
  • ETA, one proc for application processing, other
    for network processing
  • Best base line configured system

22
Processing Efficiency and Capacity
23
Processing Efficiency and Capacity
  • Base line with send file performed much better
    over without send file
  • For the most part ETA cpu utilization is about
    50
  • PPE is always polling
  • ETA reached peak throughput faster compared to
    Baseline with sendfile

24
ETA vs Baseline Throughput w/ Encryption
25
Processing Efficiency with Normalized 1-CPU data
  • ETAAIO has a modest advantage over the best
    baseline configuration
  • ETA uses single cpu for network processing
  • On baseline system there may be in instance of
    TCP/IP executing on both processors
  • This might cause cache conflicts and semaphore
    contention

26
Processing Efficiency with Normalized 1-CPU data
(cont)
  • 1-CPU configuration
  • One cpu
  • 2 NICS instead of 4 NICS
  • Normalized 1-cpu results to 2-cpu ETA by doubling
    throughput
  • Free of the multiprocessing overhead of packet
    processing

27
Efficiency Graph
  • Around peak throughput CPU utilization same for
    ETA and baseline
Write a Comment
User Comments (0)
About PowerShow.com