Reconfigurable Architectures for High Bandwidth Network Processing Systems

About This Presentation

Title:

Reconfigurable Architectures for High Bandwidth Network Processing Systems

Description:

Title: Slide 1 Author: Sakir Sezer Last modified by: John McCanny Created Date: 1/13/2005 10:43:13 AM Document presentation format: A4 Paper (210x297 mm) – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 68

Provided by: Sak82

Category:

more less

Transcript and Presenter's Notes

Title: Reconfigurable Architectures for High Bandwidth Network Processing Systems

1
Reconfigurable Architectures for High Bandwidth
Network Processing Systems

Professor John McCanny CBE FRS FREng
Dr Sakir Sezer, Dr Maire McLoone

2
Institute of Electronics Communications and
Information Technology
You see things and say Why? but I dream things
that never were and say Why not?
George Bernard Shaw
3
Purpose of Talk

An overview of Research on reconfigurable
architectures for Network Processing applications
Three aspects
Node throughput
QoS
Data security

4
Structure of talk

Convergence of Communication systems
Processing demands of future networks
Trade-offs of reconfigurability in network
processing in the context of Application specific
architectures for
Programmable Data-Link layer Datagram Processing
Programmable Packet scheduling Architectures
Configurable Cryptographic Architectures
Conclusions

5
Convergence of Communication and Information
Systems
Broadcast, Multicast VoD, TV, Radio
Convergence of Technology, Applications and
Services
Real Time Interactive Services
Best effort Services
W L A N
6
Bandwidth Demand Vs Moores Law
Data processing demand at network- and access
nodes doubles every 6-9 Months
Technology Gap
Network Processing Gap
Internet Traffic doubles every 12 Months
Moores Law Silicon Integration
Capability doubles every 18 Months
Existing data processing architectures are unable
to keep up with network processing demands !!
7
Issues

Internet traffic is continuously doubling every
12 months
Emerging services require
Higher bandwidth (VoD, DVB-IP, VoIP)
Higher degree of security (Internet Banking,
internet shopping, e-business)
Network processing demands a consequence of
Smaller packet sizes of real-time and interactive
services
QoS requirements of real-time and interactive
services
Complex security processing of sensitive data
Network protection from viruses and intruders

8
Network Processors Architectures -High
performance with flexibility

To cope with exponential growth in bandwidth
demands
Complex traffic profiles and heterogeneity of
service
To efficiently utilise resources by dynamically
adapting network nodes to various traffic
patterns
Capability for on-demand and customised QoS
support
Cost effective upgrades to new communication
protocols
Ideally high levels of compute power, high levels
of flexibility

9
Application Specific, Configurable Network
Processing Architectures

Programmable Data-Link layer Datagram Processing
Frame Delineation
Frame Check Sequence
Programmable Packet scheduling Architectures
Logic-level reconfigurable packet scheduling
architecture
System-level configurable packet scheduling
architecture
Configurable Cryptographic Architectures
Iterative and Non-iterative architectures

Data-Link Layer
Protocol processing

11
Data-Link layer Protocol processing

Data Link Layer protocols enables a
point-to-point connection between two peers over
a physical link.

Common Data Link Layer protocols are ATM,
Ethernet, PPP, GFP, Frame Relay, Fibre Channel
etc.
12
IP over SDH/SONET Data Link Layer Protocols
Internet Protocol
Network Layer
Data Link Layer
Ethernet Bridge, VLAN
Frame Relay
ATM (MPOA, MPLS,..)
PPP
GFP
PHY Layer
SDH/SONET
Legacy Protocols
Emerging Protocols
13
Data-link layer frame processing

Frame processing involves two key functions
frame delineation and
frame check sequence (FCS)
The circuit architectures for both functions are
determined by
the protocols and
the data-path width
Numerous Frame Delineation and FCS architectures
for PPP, ATM and GFP investigated.
Scalability
Throughput
Hardware costs
Programmable frame processing architecture is
desirable to support a variety of protocols

Data-Link Layer
Protocol processing
Frame Delineation

15
PPP 32-bit ACCM Transmitter Circuit
Includes Asynchronous-Control-Character-Map
(ACCM) function.
16
PPP Frame Delineation CircuitPost-layout
Synthesis - Altera Stratix II
Hardware Complexity O(N)N2
17
8 bit and 32 bit Data Paths

32 bit data path requires additional hardware
for rearrangement of data words before and after
transmission.
Scaling involves a significant area penalty.
Complex data reorganization circuits designed to
overcome the limitations set by an octet based
protocol
Requires an increased logic cost by factors of 15
and 26 for the ACCM receiver and the ACCM
transmitter circuits respectively.
Majority of logic increase due to the number of
byte comparators, as well as the provision of
extra routing and the conditional multiplexers

18
ATM Frame Delineation

ATM Frame 5 byte header, 48 byte payload
Based on Cyclic Coding
Header Error Check HEC
Cyclic Redundancy Check (CRC)
Provides header error detection and frame
delineation
5th header byte (HEC) calculated from CRC
computation of 1st 4 header bytes
CRC polynomial G(x) x8x2x1

8 Bits
First 4 header bytes
GFCUNI/VPI
VPI
VPI
VCI
ATM Header
CRC Computation G(x) x8x2x1
VCI
VCI
PT CLP
HEC
19
ATM Bit-by-Bit HEC HUNT 4-Bit Data-Path
Architecture
Enable O/P if Match
8 cycles of Data
Compare with next 2 nibbles
Reset CRC Unit
20
4, 8, 16, 32 and 64 bit implementations
Altera Stratix Technology
16 bit design - 2.5 Gbps supports SONET OC48
line rate 64 bit design - 6.8 Gbps
21
Generic Frame Procedure (GFP)

The Generic Frame Procedure is a Layer-2 framing
protocol for data over high-capacity optical
networks.
Recently standardised (ITU-T G.7041) to replace
ATM and PPP in high capacity Wide Area Networks
(WANs)
GFP is scalable, allowing the implementation of
wide data-path architectures.
GFP deploys a CRC based frame delineation
architecture similar to ATM HEC HUNT and
synchronisation technique

22
GFP Frame Structure
16-bit GFP Core Header Error Check (CHEC) field
is used for frame delineation
23
GFP Frame Delineation 64-bit Datapath with 1-bit
Header Error Correction Circuit
Preliminary Design study
Max CLK 165 MHz ALUTs 1107 Register 653 ALMs
751 LABs 149 Throughput 10.5 Gbps
Altera Stratix II-3 FPGA Technology
24
GFP Frame Delineation 64-bit Datapath with 1-bit
Header Error Correction Circuit

Cadence Encounter UMC-130nm
Clock frequency 250 MHz
Total area 0.12 mm2
Throughput 16.0 Gbps
Total-power 1.6x10-02 Watts
Internal-power 1.4x10-02 Watts
Switching-power 2.3x10-03 Watts
Leakage-power 8.1x10-05 Watts

UMC-130nm Reference Design
Fastest implementation in the literature
25
Programmable Frame Delineation
Target 10Gbps not achievable in FPGA Technology,
should be with ASIC
Altera Stratix II-3 FPGA Technology
26
32-Bit Protocol Processing Circuit Decomposition
27
Programmable ATM/GFP Protocol Frame Delineation
Architecture
Separate Data Path
Shared Common Elements
28
Architectural Studies Frame Delineation
Architectures

Header Error Check architectures easier to scale
than pattern based architectures.
Examined feasibility of driving a a common
programmable frame delineation architecture for
layer-2 protocols (PPP, ATM, GFP) operating at at
least 2.5 Gbps.
Unable to derive due to diverse nature of
techniques used
Implementation of GFP and ATM frame delineation
techniques, which are based on a similar header
error check method, have shown significant
diversity and restrictions for a common
architecture.
However, a programmable architecture that is
slightly faster and smaller can be derived that
is highly suitable for standard cell based
implementation key aspects reduction of
registers by 50, a key cost

29
Frame Delineation Architectures - Conclusions

Options
Multiplexed specific-purpose circuits
FPGA that can be reconfigured to implement a
specific protocol
First the more efficient implementation (area and
speed) for 10 or less protocols
Derivation of a programmable datapath based on
common low level functional elements is a
potential low hardware cost option

Data-Link Layer
Protocol processing
Frame Check Sequence

31
Frame Check Sequence Circuits

Data integrity is paramount for data-link layer
protocols
Cycle Redundancy Check (CRC) is the preferred
methodology detection bit and burst errors in
payloads of protocols due to medium related
noise.
Commonly used CRC types for layer -2 protocols

32
Investigated Architectures

Hardwired parallel CRC circuits for a given port
size and generator polynomial G(x).
Semi-reconfigurable parallel CRC circuit with
reconfigurable input port size and a given
generator polynomial G(x).
Fully reconfigurable CRC computation circuit for
any generator polynomial G(x) of up to the power
of 32 and port sizes of 4, 8, 16, 24, and 32 bits.

33
Parallel Hardwired CRC-8 Circuit
34
Parallel CRC-32 with Programmable Input Bus
Input bus configuration
Programmable input bus is required, if the frame
size is not a multiple of the port size, or the
frame data is not aligned to the Less-Significant
Byte (LSB) of the input bus as illustrated below
Requires feedback circuit reconfiguration
35
Dedicated Parallel CRC architectures
Altera Stratix II-3 FPGA Technology
36
Programmable CRC with Partial Programmability
(uses multiplexing)
Altera Stratix II-3 FPGA Technology
37
Parallel and Fully Reconfigurable CRC Computation
Circuit for High Speed Data Processing
Patent Pending
Max CLK 114.98 MHz ALUTs 2240 Register 1365 A
LMs 1620 LABs 292 Supports throughput rates
above 2.5Gbps (3.68 Gbps)
Altera Stratix II-3 FPGA Technology
38
Parallel and Fully Reconfigurable CRC Computation
Circuit for High Speed Data Processing

Cadence Encounter UMC-130nm
Clock frequency 125 MHz
Total area 0.27mm2
Throughput 8.0 Gbps
Total-power 5.9x10-3 Watts
Internal-power 4.2x10-3 Watts
Switching-power 1.6x10-3 Watts
Leakage-power 1.2x10-4 Watts

UMC-130nm Reference Design
39
Performance Evaluation

There is a trade-off cost for programmability
Fully reconfigurable CRC is 8x larger and 2x
slower than the hard-wired CRC-32
For CRCs with small polynomials and input bus
sizes, the area cost difference can be a factor
100
Hardware efficient programmability for parallel
FEC circuits can be achieved by multiplexing
between different custom implementations

40
Performance Evaluation

If polynomial G(x) is not known, then a fully
programmable implementation is an appropriate
solution
Other applications include storage where CRC
computation can be 30 of overall

Programmable Packet Scheduling

42
Programmable IP packet scheduling

Programmability of Internet protocol packet
scheduling an essential feature to deal with
Complex traffic profiles and heterogeneity of
service
Efficient bandwidth resource utilisation
To provide on-demand and customised QoS support

43
Current Internet QoS Problems

Internet Routers support best effort packet
delivery only Best Effort Service

Delay guarantee for delay sensitive services
cannot be provided
Real time and interactive services (Voice, Video)
will not meet users expectation of quality

44
How we can provide QoS in Internet
Multiple Lane Model The Motorway
Resource Reservation Model Railway / Aircraft
Single Lane Model
Proposed Method Integrated Services IntServ
Current Internet Best Effort Service
Proposed and partially deployed
Method Differential Services DiffServ
45
Packet scheduling is paramount for QoS

A packet scheduler decides when to send each
packet based on
Traffic type ID (Tag, DiffServe)
Flow ID (Source Destination Address, IntServ)
Scheduling algorithm and the deployed service
policy determines the QoS performance of the
Network
Scheduling Method Tradeoffs
Computation Complexity ? Desired Fairness

Router / Switch
Output Control
46
Switch adaptation via partial reconfiguration
Adaptation achieved by partially reconfiguring
FPGA by adding or removing (i.e. reconfiguring)
packet FIFO circuits and output packet schedulers
Partial Reconfiguration
47
Issues relating to run-time, gate level
configurable logic

Limited memory resources, off chip memory access
a bottleneck
Reconfiguration interrupts traffic flow - QoS
degradation.
Partial reconfiguration is limited to similar
scheduling policies.
Runtime and partial configuration adds additional
complexity
Partial reconfiguration control remains an
unsolved challenge despite a promising model.
Current FPGA technology immature in terms of
design tools and run-time reconfiguration support

48
Systems Level Approach for Programmable Packet
Scheduling
Patent Pending
Packet handling isolated from schedule policy
functions
49
Systems Level Approach for Programmable Packet
Scheduling

Packet handling functions isolated from
scheduler policy specific functions.
Individual queues replaced by a more complex
shared buffer architecture that can support
multiple queues
controlled via address pointers and link lists.
Provides clear separation of the packet
scheduling architecture into
a circuit purely responsible for dealing with
packet service policies i.e. scheduling
algorithms and
a circuit concerned with packet handling e.g.
store/retrieve
Allows flexible programmability of the scheduling
policy and number of packet queues without
reconfiguring the hardware
Comparable throughput rates to implementations
with physically built queues

50
Configurable Packet Scheduling
Patent Pending

Cadence Encounter UMC130nm
Clock frequency 143 MHz
Number of IOs 478 Pins
Total area 14.4 mm2
Number of Sessions 1,000,000
Number of Packets External DDR
Up to 30 Million packets can be support
Throughput 35.8M packets/sec
Throughput 40 Gbps line rate
(assuming a mean IP packet size of 130 bytes)

Address Translation Table
Search Trie Memory
51
Conclusions

Queue and scheduling policy adaptation via
address pointer, lookup tables and packet
time-stamp processing
Low programming complexity
Service requirements can be translated
immediately
No traffic interruption is required
instant change of queues and scheduling policies

52
Conclusions

Performance comparable with customized
implementations
Complex, but affordable data processing hardware
Programmability at the system level NOT
reconfigurability at gate level
Programming scheduler does not require place and
route at gate level

Programmable
Cryptographic Architectures

54
Programmable Cryptographic Architectures

Encryption needs to be performed on data in
real-time
100 Mbps networks, 1G Ethernet, 10G Ethernet
This holds the key to successful growth of
applications such as WLANs, satellite
communications, e-businesses
Software architectures are too slow
Hardware solutions required

55
Programmable Cryptographic Architectures

Reconfigurable Cryptographic Architectures can be
used to provide the security requirements of many
applications
FPGAs are well suited for crypto algorithms
allow algorithm agility
support alterable architecture parameters,
scalable security (DES/ 3DES)
Clever mapping of complex math operations onto
special purpose silicon architectures

56
Private Key Algorithms AES

NIST requested a new Advanced Encryption Standard
(AES) to replace DES - Sept 1997
Interim measure TripleDES
RIJNDAEL AES Winner - Oct 2000
Developed by Joan Daemen, Vincent Rijmen
Replaced DES as Federal Standard in November 2001
128-bit Data, 128, 192 or 256-bit Key

57
Reconfigurable AES Architecture

In conjunction with AES, NIST recommended 5 modes
of operation
Electronic Codebook (ECB) mode
Cipher Block Chaining (CBC) mode
Output Feedback (OFB) mode
Ciphertext Feedback (CFB) mode
Counter (CTR) mode a simplification of OFB mode

58
Private Key Algorithms AES
59
Reconfigurable AES Architecture

Reconfigurable AES architecture with following
features
Iterative architecture
On-chip key scheduling
Support for 3 key lengths
Encryption Decryption
Support for feedback modes of operation

60
Performance Evaluation
Reconfigurable AES V Specific-purpose Enc/Dec
Device Area Throughput (Mbps)
AES Encryptor 128-bit Key XCV400E 1987 Slices 18 BRAMs 423
AES Decryptor 128-bit Key XCV600E 2121 Slices 18 BRAMs 557
AES Enc/Dec 128-bit key Supports 5 modes XCV600E 4681 Slices 20 BRAMs 310
61
Performance Evaluation

2 additional BRAMs required in reconfigurable
design as memory re-use possible
Reconfigurable Design
Throughput reduced up to 40
Area increased by 10
However, modes of operation supported
gt Area/speed penalty acceptable trade-off in
favour of using reconfiguration over multiple
specific-purpose circuits

Conclusions

63
Conclusions

Frame Delineation
Common architecture could not be found
Separate FPGA circuit for each or multiplex
between separate circuits on an ASIC
Derivation of a programmable datapath based on
common low level functional elements is a
potential low hardware cost option
CRC circuits well defined (i.e. 8 ) options
Fully reconfigurable ASIC possible but larger and
slower than 8 separate versions
If G(x) and number of options is not known then
use fully programmable solution

64
Conclusions

Packet Scheduler
Systems level approach deploying address pointer,
lookup tables and packet time-stamp processing
the most appropriate approach
Enables programmability while supporting line
rates beyond 100 Gbps
Best approach, tackle at the Systems and
Architecture level rather than FPGA level
Current FPGA technology and design tools for
run-time reconfiguration too immature for packet
scheduling

65
Conclusions

Encryption/Decryption
Re-configurable architecture identified
Supports a number of modes of operation
Reconfigurable Design
Throughput reduced up to 40
Area increased by 10
Area/speed penalty acceptable trade-off compared
with reconfiguration of multiple specific circuits

66
Reconfigurable Architectures for High Bandwidth
Network Processing Systems

Professor John McCanny CBE FRS FREng
Dr Sakir Sezer (s.sezer_at_ecit.qub.ac.uk) ,
Dr Maire McLoone (m.mcloone_at_ecit,qub.ac.uk)

67
Institute of Electronics Communications and
Information Technology
You see things and say Why? but I dream things
that never were and say Why not?
George Bernard Shaw

Write a Comment

User Comments (0)