Task Partitioning for MultiCore Network Processors - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Task Partitioning for MultiCore Network Processors

Description:

Network Processors are much harder to optimise for than CPUs ... Abstracted Lock Optimisation for C Programs. Take an existing C program ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 29

Provided by: djs9

Category:

more less

Transcript and Presenter's Notes

Title: Task Partitioning for MultiCore Network Processors

1
Task Partitioning for Multi-Core Network
Processors

Rob Ennals, Richard Sharp
Intel Research, Cambridge
Alan Mycroft
Programming Languages Research Group,
University of Cambridge Computer Laboratory

2
Talk Overview

Network Processors
What they are, and why they are interesting
Architecture Mapping Scripts (AMS)
How to separate your high level program from low
level details
Task Pipelining
How it can go wrong, and how to make sure it goes
right

3
Network Processors

Designed for high speed packet processing
Up to 40Gb/s
High performance per watt
ASIC performance with CPU programmability
Highly parallel
Multiple programmable cores
Specialised co-processors
Exploit the inherent parallelism of packet
processing
Products available from many manufacturers
Intel, Broadcom, Hifn, Freescale, EZChip,
Xelerated, etc

4
Lots of Parallelism

Intel IXP 2800 16 cores, each with 8 threads
EZChip NP-1c 5 different types of cores
Agere APP several specialised cores
FreeScale C-5 16 cores, 5 co-processors
Hifn 5NP4G 16 cores
Xelerated X10 200 VLIW packet engines
BroadCom BCM1480 4 cores

5
Pipelined Programming Model

Used by many NP designs
Packets flow between cores
Why do this?
Cores may have different functional units
Cores may maintain state tables locally
Cores may have limited code space
Reduce contention for shared resources
Makes it easier to preserve packet ordering

Core
Core
Core
Core
6
An Example IXP2800

16 microengine cores
Each with 8 concurrent threads
Each with local memory and specialised functional
units
Pipelined programming model
Dedicated datapath between adjacent microengines
Exposed IO Latency
Separate operations to schedule IO, and to wait
for it to finish
No cache hierarchy
Must manually cache data in faster memories
Very powerful, but hard to program

7
IXP2800
72
72
72
Stripe/byte align
RDRAM 1
RDRAM 3
RDRAM 2
MEv2 2
MEv2 3
MEv2 4
MEv2 1
Rbuf 64 _at_ 128B
S P I 4 or C S I X
16b
MEv2 7
MEv2 6
MEv2 5
MEv2 8
XScale Core 32K IC 32K DC
G A S K E T
PCI (64b) 66 MHz
Tbuf 64 _at_ 128B
64b
16b
MEv2 10
MEv2 11
MEv2 12
MEv2 9
Hash 64/48/128
Scratch 16KB
MEv2 15
MEv2 14
MEv2 13
QDR SRAM 2
QDR SRAM 1
QDR SRAM 3
MEv2 16
QDR SRAM 4
CSRs -Fast_wr -UART -Timers -GPIO -BootROM/SlowPo
rt
E/D Q
E/D Q
E/D Q
E/D Q
18
18
18
18
18
18
18
18
8
IXDP-2400

Things are even harder in practice
Systems contain multiple NPs!

Packets from network
Packets to network
9
What People Do Now

Design their programs around the architecture
Explicitly program each microengine thread
Explicity access low level functional units
Manually hoist IO operations to be early
THIS SUCKS!
High level program gets polluted with low level
details
IO hoisting breaks modularity
Programs are hard to understand, hard to modify,
hard to write, hard to maintain, and hard to port
to other platforms.

10
The PacLang Project

Aiming to make it easier to program Network
Processors
Based around the PacLang language
C-like syntax and semantics
Statically allocated threads, linked by queues
Abstracts away all low level details
A number of interesting features
Linear type system
Architecture Mapping scripts (this talk)
Various other features in progress
A prototype implementation is available

11
Architecture Mapping Scripts

Our compiler takes two files
A high level PacLang program
An architecture mapping script (AMS)
PacLang program contains no low-level details
Portable across different architectures
Very easy to read and debug
Low level details are all in the AMS
Specific to a particular architecture
Can change performance, but not semantics
Tells the compiler how to transform the program
so that it executes efficiently

12
Design Flow with an AMS
PacLang Program
AMS
Compiler
Refine AMS
Analyse Performance
Deploy
13
Advantages of the AMS Approach

Improved code readability and portability
The code isnt polluted with low-level details
Easier to get programs correct
Correctness depends only on the PacLang program
The AMS can change the performance, but not the
semantics
Easy exploration of optimisation choices
You only need to modify the AMS
Performance
The programmer still has a lot of control over
the generated code.
No need to pass all control over to someone
elses optimiser

14
AMS Optimiser Good

Writing an optimiser that can do everything
perfectly is hard
Network Processors are much harder to optimise
for than CPUs
More like hardware synthesis than conventional
compilation
Writing a program that applies an AMS is easier
AMS can fill in gaps left by an optimiser
Write an optimiser that usually does a reasonable
job
Use an AMS to deal with places where the
optimiser does poorly
Programmers like to have control
I may know exactly how I want to map my program
to hardware
Optimisers can give unpredictable behaviour

15
An AMS is an addition, not an alternative to an
automatic optimiser!

This is a sufficiently important point that it is
worth making twice

16
What can an AMS say?

How to pipeline a task across multiple
microengines
What to store in each kind of memory
When to move data between different memories
How to represent data in memory (e.g. pack or
not?)
How to protect shared resources
How to implement queues
Which code should be considered the critical path
Which code should be placed on the XScale core
Low level details such as loop unrolling and
function inlining
Which of several alternative algorithms to use
And whatever else one might think of

17
AMS-based program pipelining

High-level program has problem-orientated
concurrency
Division of program into tasks models the problem
Tasks do not map directly to hardware units
AMS transforms this to implementation-oriented
concurrency
Original tasks are split and joined to make new
tasks
New tasks map directly to hardware units

Hardware Task
Hardware Task
AMS
Hardware Task
Hardware Task
Compiler
User Task
Hardware Task
Hardware Task
User Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
18
Task Pipelining

Convert one repeating task into several tasks
with a queue between them

A B C
Pipeline Transform
A
B
C
19
Pipelining is not always safe

May change the behaviour of the program

1,2,1,2,...
q.enq(1) q.enq(2)
Pipeline Transform
Iterations of t1 get ahead of t2
1,1,2,2,...
Elements now written to queue out of order!
q.enq(1)
q.enq(2)
t2
t1
20
Pipelining Safety is tricky (1/3)

Concurrent tasks interact in complex ways

q2.enq(q1.deq)
passes values from q1 to q2
1,1,...
1,1,2,2,...
values can appear on q2 out of order
q1
q2
q1.enq(1)
q2.enq(2)
Pipeline split point
21
Pipelining Safety is tricky (2/3)

Concurrent tasks interact in complex ways

q1.enq(3) q2.enq(4)
q1 says 1,1 written before 3. q2 says 4 written
before 2. t4 says 3 written before 4. unsplit
task says 2 written before 1,1. This
combination not possible in the original program.
t3
1,1,3,...
4,2,2,...
q1
q2
q1.enq(1)
q2.enq(2)
Pipeline split point
22
Pipelining Safety is tricky (3/3)
Unsafe
Safe
23
Checking Pipeline Safety

Difficult for programmer to know if pipeline is
safe
Fortunately, our compiler checks safety
Rejects AMS if pipelining is unsafe
Applies a safety analysis that checks that
pipelining cannot change observable program
behaviour
I wont subject you to the full safety analysis
now
Read the details in the paper

24
Task Rearrangement in Action
IP Options ARP ICMP Err
Classify IP(1/3)
Tx
Rx
IP(2/3)
IP(2/3)
25
The PacLang Language

High level language, abstracting all low level
details
Not IXP specific can be targeted to any
architecture
Our toolset can also generate Click modules
C-like, imperative language
Static threads, connected by queues
Advanced type system
Linearly typed packets allow better packet
implementation
Packet views make it easer to work with
multiple protocols

26
Performance

One of the main aims of PacLang
No feature is added to the language if it cant
be implemented efficiently
PacLang programs run fast
We have implemented a high performance IP
forwarder
It achieves 3Gb/s on a RadiSys ENP2611, IXP2400
card
Worst case, using min-size packets
Using a standard longest-prefix-match algorithm
Using only 5 of the 8 available micro-engines
(including drivers)
Competitive with other IP forwarders on the same
platform

27
Availability

A preview release of the PacLang compiler is
available
Download it from Intel Research Cambridge, or
from SourceForge
Full source-code is available
A research prototype, not a commercial quality
product
Runs simple demo programs
But lacks many features that would be needed in a
full product
Not all AMS features are currently working

28
A Tangent LockBend

Abstracted Lock Optimisation for C Programs
Take an existing C program
Add some pragmas telling the compiler how to
transform the program to use a different locking
strategy
Fine grained, ordered, optimistic, two phase, etc
Compiler verifies that program semantics is
preserved

Compiler
LockBend Pragmas
Program with Optimised Locking Strategy
Legacy C Program

Write a Comment

User Comments (0)