Task Partitioning for MultiCore Network Processors - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Task Partitioning for MultiCore Network Processors

Description:

Network Processors are much harder to optimise for than CPUs ... Abstracted Lock Optimisation for C Programs. Take an existing C program ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 29
Provided by: djs9
Category:

less

Transcript and Presenter's Notes

Title: Task Partitioning for MultiCore Network Processors


1
Task Partitioning for Multi-Core Network
Processors
  • Rob Ennals, Richard Sharp
  • Intel Research, Cambridge
  • Alan Mycroft
  • Programming Languages Research Group,
  • University of Cambridge Computer Laboratory

2
Talk Overview
  • Network Processors
  • What they are, and why they are interesting
  • Architecture Mapping Scripts (AMS)
  • How to separate your high level program from low
    level details
  • Task Pipelining
  • How it can go wrong, and how to make sure it goes
    right

3
Network Processors
  • Designed for high speed packet processing
  • Up to 40Gb/s
  • High performance per watt
  • ASIC performance with CPU programmability
  • Highly parallel
  • Multiple programmable cores
  • Specialised co-processors
  • Exploit the inherent parallelism of packet
    processing
  • Products available from many manufacturers
  • Intel, Broadcom, Hifn, Freescale, EZChip,
    Xelerated, etc

4
Lots of Parallelism
  • Intel IXP 2800 16 cores, each with 8 threads
  • EZChip NP-1c 5 different types of cores
  • Agere APP several specialised cores
  • FreeScale C-5 16 cores, 5 co-processors
  • Hifn 5NP4G 16 cores
  • Xelerated X10 200 VLIW packet engines
  • BroadCom BCM1480 4 cores

5
Pipelined Programming Model
  • Used by many NP designs
  • Packets flow between cores
  • Why do this?
  • Cores may have different functional units
  • Cores may maintain state tables locally
  • Cores may have limited code space
  • Reduce contention for shared resources
  • Makes it easier to preserve packet ordering

Core
Core
Core
Core
6
An Example IXP2800
  • 16 microengine cores
  • Each with 8 concurrent threads
  • Each with local memory and specialised functional
    units
  • Pipelined programming model
  • Dedicated datapath between adjacent microengines
  • Exposed IO Latency
  • Separate operations to schedule IO, and to wait
    for it to finish
  • No cache hierarchy
  • Must manually cache data in faster memories
  • Very powerful, but hard to program

7
IXP2800
72
72
72
Stripe/byte align
RDRAM 1
RDRAM 3
RDRAM 2
MEv2 2
MEv2 3
MEv2 4
MEv2 1
Rbuf 64 _at_ 128B
S P I 4 or C S I X
16b
MEv2 7
MEv2 6
MEv2 5
MEv2 8
XScale Core 32K IC 32K DC
G A S K E T
PCI (64b) 66 MHz
Tbuf 64 _at_ 128B
64b
16b
MEv2 10
MEv2 11
MEv2 12
MEv2 9
Hash 64/48/128
Scratch 16KB
MEv2 15
MEv2 14
MEv2 13
QDR SRAM 2
QDR SRAM 1
QDR SRAM 3
MEv2 16
QDR SRAM 4
CSRs -Fast_wr -UART -Timers -GPIO -BootROM/SlowPo
rt
E/D Q
E/D Q
E/D Q
E/D Q
18
18
18
18
18
18
18
18
8
IXDP-2400
  • Things are even harder in practice
  • Systems contain multiple NPs!

Packets from network
Packets to network
9
What People Do Now
  • Design their programs around the architecture
  • Explicitly program each microengine thread
  • Explicity access low level functional units
  • Manually hoist IO operations to be early
  • THIS SUCKS!
  • High level program gets polluted with low level
    details
  • IO hoisting breaks modularity
  • Programs are hard to understand, hard to modify,
    hard to write, hard to maintain, and hard to port
    to other platforms.

10
The PacLang Project
  • Aiming to make it easier to program Network
    Processors
  • Based around the PacLang language
  • C-like syntax and semantics
  • Statically allocated threads, linked by queues
  • Abstracts away all low level details
  • A number of interesting features
  • Linear type system
  • Architecture Mapping scripts (this talk)
  • Various other features in progress
  • A prototype implementation is available

11
Architecture Mapping Scripts
  • Our compiler takes two files
  • A high level PacLang program
  • An architecture mapping script (AMS)
  • PacLang program contains no low-level details
  • Portable across different architectures
  • Very easy to read and debug
  • Low level details are all in the AMS
  • Specific to a particular architecture
  • Can change performance, but not semantics
  • Tells the compiler how to transform the program
    so that it executes efficiently

12
Design Flow with an AMS
PacLang Program
AMS
Compiler
Refine AMS
Analyse Performance
Deploy
13
Advantages of the AMS Approach
  • Improved code readability and portability
  • The code isnt polluted with low-level details
  • Easier to get programs correct
  • Correctness depends only on the PacLang program
  • The AMS can change the performance, but not the
    semantics
  • Easy exploration of optimisation choices
  • You only need to modify the AMS
  • Performance
  • The programmer still has a lot of control over
    the generated code.
  • No need to pass all control over to someone
    elses optimiser

14
AMS Optimiser Good
  • Writing an optimiser that can do everything
    perfectly is hard
  • Network Processors are much harder to optimise
    for than CPUs
  • More like hardware synthesis than conventional
    compilation
  • Writing a program that applies an AMS is easier
  • AMS can fill in gaps left by an optimiser
  • Write an optimiser that usually does a reasonable
    job
  • Use an AMS to deal with places where the
    optimiser does poorly
  • Programmers like to have control
  • I may know exactly how I want to map my program
    to hardware
  • Optimisers can give unpredictable behaviour

15
An AMS is an addition, not an alternative to an
automatic optimiser!
  • This is a sufficiently important point that it is
    worth making twice

16
What can an AMS say?
  • How to pipeline a task across multiple
    microengines
  • What to store in each kind of memory
  • When to move data between different memories
  • How to represent data in memory (e.g. pack or
    not?)
  • How to protect shared resources
  • How to implement queues
  • Which code should be considered the critical path
  • Which code should be placed on the XScale core
  • Low level details such as loop unrolling and
    function inlining
  • Which of several alternative algorithms to use
  • And whatever else one might think of

17
AMS-based program pipelining
  • High-level program has problem-orientated
    concurrency
  • Division of program into tasks models the problem
  • Tasks do not map directly to hardware units
  • AMS transforms this to implementation-oriented
    concurrency
  • Original tasks are split and joined to make new
    tasks
  • New tasks map directly to hardware units

Hardware Task
Hardware Task
AMS
Hardware Task
Hardware Task
Compiler
User Task
Hardware Task
Hardware Task
User Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
18
Task Pipelining
  • Convert one repeating task into several tasks
    with a queue between them

A B C
Pipeline Transform
A
B
C
19
Pipelining is not always safe
  • May change the behaviour of the program

1,2,1,2,...
q.enq(1) q.enq(2)
Pipeline Transform
Iterations of t1 get ahead of t2
1,1,2,2,...
Elements now written to queue out of order!
q.enq(1)
q.enq(2)
t2
t1
20
Pipelining Safety is tricky (1/3)
  • Concurrent tasks interact in complex ways

q2.enq(q1.deq)
passes values from q1 to q2
1,1,...
1,1,2,2,...
values can appear on q2 out of order
q1
q2
q1.enq(1)
q2.enq(2)
Pipeline split point
21
Pipelining Safety is tricky (2/3)
  • Concurrent tasks interact in complex ways

q1.enq(3) q2.enq(4)
q1 says 1,1 written before 3. q2 says 4 written
before 2. t4 says 3 written before 4. unsplit
task says 2 written before 1,1. This
combination not possible in the original program.
t3
1,1,3,...
4,2,2,...
q1
q2
q1.enq(1)
q2.enq(2)
Pipeline split point
22
Pipelining Safety is tricky (3/3)
Unsafe
Safe
23
Checking Pipeline Safety
  • Difficult for programmer to know if pipeline is
    safe
  • Fortunately, our compiler checks safety
  • Rejects AMS if pipelining is unsafe
  • Applies a safety analysis that checks that
    pipelining cannot change observable program
    behaviour
  • I wont subject you to the full safety analysis
    now
  • Read the details in the paper

24
Task Rearrangement in Action
IP Options ARP ICMP Err
Classify IP(1/3)
Tx
Rx
IP(2/3)
IP(2/3)
25
The PacLang Language
  • High level language, abstracting all low level
    details
  • Not IXP specific can be targeted to any
    architecture
  • Our toolset can also generate Click modules
  • C-like, imperative language
  • Static threads, connected by queues
  • Advanced type system
  • Linearly typed packets allow better packet
    implementation
  • Packet views make it easer to work with
    multiple protocols

26
Performance
  • One of the main aims of PacLang
  • No feature is added to the language if it cant
    be implemented efficiently
  • PacLang programs run fast
  • We have implemented a high performance IP
    forwarder
  • It achieves 3Gb/s on a RadiSys ENP2611, IXP2400
    card
  • Worst case, using min-size packets
  • Using a standard longest-prefix-match algorithm
  • Using only 5 of the 8 available micro-engines
    (including drivers)
  • Competitive with other IP forwarders on the same
    platform

27
Availability
  • A preview release of the PacLang compiler is
    available
  • Download it from Intel Research Cambridge, or
    from SourceForge
  • Full source-code is available
  • A research prototype, not a commercial quality
    product
  • Runs simple demo programs
  • But lacks many features that would be needed in a
    full product
  • Not all AMS features are currently working

28
A Tangent LockBend
  • Abstracted Lock Optimisation for C Programs
  • Take an existing C program
  • Add some pragmas telling the compiler how to
    transform the program to use a different locking
    strategy
  • Fine grained, ordered, optimistic, two phase, etc
  • Compiler verifies that program semantics is
    preserved

Compiler
LockBend Pragmas
Program with Optimised Locking Strategy
Legacy C Program
Write a Comment
User Comments (0)
About PowerShow.com