Title: Hardware Implementation of Fast Forwarding Engine using Standard Memory and Dedicated Circuit
1Hardware Implementation ofFast Forwarding Engine
usingStandard Memory and Dedicated Circuit
- Kazuya ZAITSU, Shingo ATA, Ikuo OKA
- (Osaka City University, Japan)
- Koji YAMAMOTO
- (Renesas Design Corporation, Japan)
- Yasuto KURODA, Kazunari INOUE
- (Renesas Electronics Corporation, Japan)
2Outline
- Background
- Objective
- Proposed hardware architecture
- Hardware architecture evaluation
- FPGA implementation
- Hardware evaluation
- Conclusion
3What is TCAM?
- TCAM Ternary Content Addressable Memory
- Feature
- Very high speed searching
- Input data for matching, output memory address
- 3rd matching state of dont care in addition 1s
and 0s - Application
- Looking up the routing table in IP routers
Addr . Prefix
1 192.168..
2 192.168.100.
3 192.168.101.
Input 192.168.101.1
Output 3
Routing table
4TCAM problems
- Manufacturing cost
- /bit is 4 times more expensive than SRAM.
- Power consumption
- All logical gates must be energized for every
search. - Capacity
- Expensive price-per-bit-ratio and power-saving
activities - Hard to pursue denser TCAM
Search performance Manufacturing cost Power Consumption Capacity
Requirements High Low Low High
TCAM High High High Low
5Objective
- Propose a new hardware architecture
- Focus on the address lookup in the routing table
of routers - RAM-based design
- Named Custom Memory
- Hardware design of the Custom Memory
- Verify the effectiveness of the Custom Memory
- Effectiveness of our architecture
- Dramatically reduce its cost and power
consumption - Implementation to the FPGA
Speed Cost Power Capacity
Custom Memory High Low Low High
TCAM High High High Low
6Design concepts
Speed Cost Power Capacity Interface
Custom Memory High Low Low High Same as TCAM
- Divide the memory area into equal-sized tables
- Low power
- RAM-based design
- Low cost, low power, high capacity
- Lookup operation by single access
- High search performance
- Same physical user interface as TCAM
- Aim to replace the TCAM in the market
7Architectural overview
Divide into subtables
RAM based design
Custom Memory
Command
RAM
Search device 0
Address
Table 0
Table 1
Search device 1
IP addr.
???
Table -1
???
Comparator
Prefix
Search device N
Same physical user interface as TCAM
8Search device partitioning
- How to decide a device to store?
Partitioning based on prefix length
Example
Search device 0 (prefix length 8)
6.0.0.0/8 24.128.0.0/9 62.30.0.0/16 112.63.240/
20 184.128.191.0/24 232.95.225.1/32
Search device 1 (prefix length 9)
???
Search device N (prefix length 32)
9Table partitioning
- How to decide a table to store?
- bits in prefix are extracted for index bits.
- Remainder bits are stored.
- How to determine the index bits?
Extract last bits from prefix
Example ( 8)
Search device (prefix length 16)
RAM
0
01011000
01101011
???
00110111
154.1.0.0/16 ?10011010.00000001 ?10011010.00000001
1
1
01011000
???
empty
10011010
???
???
???
-2
empty
???
empty
Index bits
Remainder bits
-1
01001111
???
empty
10Search operation
Custom Memory
Table
Search Command
Destination IP Address
Search device (prefix length 8)
Index calculator
Search device (prefix length 9)
Destination IP Address
RAM
Table 0
Input-output controller
LPM comparator
Table 1
???
???
Table -1
Search device (prefix length 32)
???
Comparator
Hit address
Hit address
11Evaluation of partitioning
- Which bits are better to use as index bits?
- Distribution of table is affected to the cost.
- Evaluation metric
- Maximum number of prefixes in the table
Extract last bits from prefix
RAM
???
word lines
???
of prefixes in table
???
???
???
???
Comp.
???
Comp.
comparators
Table
11
12Effectiveness of indexing
- Top k bits using the top bits for index bits
- proposal using the last bits for index bits
- bottom ideal value (unrealizable)
Max of prefixes in table ( )
Prefix length
13FPGA implementation
- ALTRA Stratix IV GX FPGA Development Kit
- Verilog-HDL
- Parameters
- 4 search devices
- 256 tables/device
- 128 prefixes/table
Search device 0
RAM
Table 0
Search device 1
Table 1
???
128 prefixes
Table 255
Search device 2
Comparator
Search device 3
14Hardware evaluation
Search performance Power consumption (mA) Chip area (ratio)
Custom Memory every clock (125MHz) 6.53 (52) 62
TCAM every clock (360MHz) 12.43 100
(4k bits Array, Vdd1.0V, Room Temp. 125Msps)
Custom Memory
RAM
RAM
???
RAM
word lines
RAM
???
???
???
???
RAM
RAM
???
Comp.
???
Comp.
comparators
Operation area
15FPGA experiment
- Examine the hardware operation
- Use a raw data (BGP routing table)
16Conclusion
- Design RAM-based fast forwarding engine
- Hardware architecture
- FPGA implementation
- Reduce the costs and power
- 62 cost (compare with TCAM)
- 52 power consumption (compare with TCAM)
- Future work
- Implementation parameter optimization
- Handling of the table overflow