Title: Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress
1Block Design ReviewLookupforIPv4 MR, LC
Ingress and LC Egress
John DeHart jdd_at_arl.wustl.edu http//www.arl.wustl
.edu/projects/techX
2Revision History
- 10/11/06 (JDD)
- Created
- 10/23/06 (JDD)
- Finished for presentation on 10/24/06
- 10/24/06 (JDD)
- Updates from comments during review.
- Added more TCAM info
- Added information on format of Database entry
files
3Guidelines for Design Reviews
- Definition of interfaces In/Out
- Block diagram of module
- Including list of files where code for each
block/module exists. - Macros
- List macros and files where they can be found
- For each macro, provide a few lines of comments
in the code that describes the macro. - Document local and global registers used by
macro. - Memory assumptions
- What addresses are pre-defined, etc
- Initialization of Memory
- Data Structures
- Control Blocks
- Details of memory accesses, xfer register usage,
signal usage. - Critical path
- Testing
- Develop a well defined acceptance test that
convinces you that your block works - Document acceptance test
- Pktgen project file?
- Known bugs
4Contents
Lookup
Switch Tx
QM/Schd
Hdr Format
S W I T C H
Phy Int Rx
Key Extract
QM/Schd
Lookup
Key Extract
Switch Rx
Phy Int Tx
Hdr Format
5File locations
- Code
- src/applications/LC_Ingress/src/lookup/PL/lookup.u
c - src/applications/LC_Egress/src/lookup/PL/lookup.uc
- src/applications/IPv4_MR/src/lookup/PL/lookup.uc
- Configuration and Database Entry Files
- src/applications/LC_Ingress/build/PL/LCI_config.tx
t - LC_Ingress_Database_64bKey_64bResult_BothQM.txt
- src/applications/LC_Engress/build/PL/LCE_config.tx
t - LC_Egress_Database_24bKey_64bResult.txt
- src/applications/IPv4_MR/build/PL/IPv4_config.txt
- GM_Database_144Key_128bResult.txt
- IDT Includes
- src/IDT_NSE/data_plane_IXP2XXX/include/Iipc.uc
- Which then includes Iipc.h from same directory
- IDT Simulation Library
- Typical Installed location
- C/IDT_NSE/simulation/windows/IDT75K234.dll
- Repository location
- src/IDT_NSE/simulation/windows/IDT75K234.dll
6TCAM Documentation
- Docs are distributed sprinkled through the
different installation directories - We have gathered most of the important stuff
here - /project/techX/DataSheets/IDT
- The following documents are located in the above
directory - Datasheet (Under non-disclosure)
- 75K72234_datasheet.pdf
- User Manual
- 75K72234_UserManual.pdf
- Instruction Latency Application Note
- 75K72234_latency.pdf
- SLAM Simulation
- IDT75K234SLAM_UsersManual.pdf
- Dataplane Macros
- NSEDataPlaneMacroAPIGuide.pdf
- IMS API
- IMS_API.pdf
7WU Macros
- LC Ingress
- dl_nn_ring_init
- dl_source_1ME_NN_4words
- dl_sink_1ME_NN_4words
- IPv4_MR
- dl_nn_ring_init
- dl_source_1ME_NN_9words
- dl_sink_1ME_NN_4words
- LC Egress
- dl_nn_ring_init
- dl_source_1ME_NN_4words
- dl_sink_1ME_NN_5words
- Diagnostics
- GetTimeStamp
- CompareTimeStamps
8IDT Macros
- IipcStartTimestamp
- Does CAP read and write to set bit in
MISC_CONTROL to start the timestamp counter. - IipcFormContextFromCsrMeCtx
- Sets up the Context field for the TCAM command
word based on the ME and context - 128 Contexts per LA-1 Interface
- IipcMakeBase
- Form the base address word for any instruction
for this context - Address is 22 bit WORD address, covers 16 MByte
address space - IipcMakeDirectInstruction
- Form the command word for any of the 4 Direct
instructions - Result of IipcMakeBase and IipcMakeDirectInstructi
on will be passed as the two address parameters
to sramwrite - sramwrite, w00, iipc_base_word,
iipc_command_word, count - IipcDelayUsingFutureCount(cycles)
- Sets the Future Count register to this many
cycles - Sets the Future Count Signal register
- Ctx_arb on that signal
- IipcSramRead
- Performs and SRAM read until Done bit is set in
result. - We dont use this if any more.
9Lookup Initialization and Control
- XScale utility to initialize NSE and Databases
- Control Plane and XScale mechanisms to read and
write TCAM entries while system is active.
10Lookup Miscellany
- Bugs No known bugs
- Testing
- Minimal testing done so far
- Some simple functional tests to show distribution
of packets across all output ports based on Key
fields for each of the three projects. - More complete test plan needed.
- Still To Do
- Add information on how to configure Filters for
Lookup engine. - Handle init_done signal from Rx
- Turn on optimizer
- Substrate only lookup for IPv4_MR GPE?NPE pkts
- Add second database to IPv4 MR
- DB1 GM/EM Database
- DB2 Route Lookup
- LD bit in Lookup Result
- Clean up definition of DB Ids.
- Consider making Lookup code one common file with
ifdefs to differentiate - Consider removing ifdef DONE_BIT_FIX code
- Refers to a Done bit bug in the Dual Port QDR
(which is what we have) - I have not seen this bug mentioned anywhere else.
11TCAM Entries in Simulation
- Four Parts to a TCAM Entry in simulation
- dbindex
- Slot in database occupied by entry.
- Start at 0
- Incremented by 1 for each entry
- Not dependent on size
- core
- What is matched against a provided key
- mask
- Indicates what part of the entry(core) has to
match key supplied to give a hit - data
- Results data
- Configuration and Database Entry files
- src/applications/LC_Ingress/build/PL/LCI_config.tx
t - LC_Ingress_Database_64bKey_64bResult_BothQM.txt
- src/applications/LC_Engress/build/PL/LCE_config.tx
t - LC_Egress_Database_24bKey_64bResult.txt
- src/applications/IPv4_MR/build/PL/IPv4_config.txt
- GM_Database_144Key_128bResult.txt
12TCAM Entries in Simulation
- LC Ingress Database entry from file
- src/applications/LC_Ingress/build/PL/
LC_Ingress_Database_64bKey_64bResult_BothQM.txt -
- dbindex 0x0
- core
0x51C0A80002110001 - SL Type 0x5
- Port 1
- IP DA192.168.0.2
- IP Proto 17 (UDP)
- UDP DPort 0x0001
- Exact Match everything, except
wildcard Port - mask
0xf0ffffffffffffff - data
0x0001004A01100001 - VLAN(16b)0x0001
- Stats_Index(16b)74(0x4A)
13TCAM Entries in Simulation
- IPv4 MR Database entry from file
- src/applications/IPv4_MR/build/PL/GM_Database_144K
ey_128bResult.txt -
- dbindex 0x0
- core
0x0AAA0002C0A84001C0A82002000100020011 - MR ID (VLAN) 0x0AAA
- UDP DPort0x0002
- IP DA192.168.64.1
- IP SA192.168.32.02
- TCP/UDP SPort0x0001
- TCP/UDP DPort0x0002,
- TCP_FLAGS_Proto0x0011
(ProtoUDP, no TCP Flags) - mask
0xffffffffffffffffffffffffffffffffffff Exact
match everything - data
0x0000003780FC99F95555666601000001 - Reserved(3b), Drop Bit(1b)
- Reserved(12b)
- Cntr_Index(16b)55(0x37),
- Tx IP DAddr128.252.153.249,
- Tx UDP Dport0x5555
14TCAM Entries in Simulation
- LC Egress Database entry from file
- src/applications/LC_Egress/build/PL/LC_Egress_Data
base_24bKey_64bResult.txt -
- dbindex 0x0
- core 0x11000100
- IP Proto (8b) 0x11 (UDP)
- UDP SPort (16b) 1
- Rsvd(8b) 0
- mask 0xffffffff
Exact Match. - data 0x000101000021
- Rsvd(4b) 0
- VLAN(12b)0x001
- Rsvd(4b)0
- Port(4b)1
- Rsvd(4b)
- QID(20b)33 (0x00021)
15Basics of TCAM Operation
- Instruction is given to TCAM as an sram write
- Address bus gives instruction
- 4 Direct Instructions
- Lookup This is all we use right now.
- MultiHit Lookup (MHL) or Simultaneous
Multi-Database Lookup - Which one is determined by a bit in a config
register - Preload
- Indirect Uses data field to specify
subinstruction - Data bus gives
- Subinstruction for Indirect instructions (There
are 16 subinstructions) - Data for all instructions
- Our lookup keys go here.
- Example IPv4 MR Lookup (Key of 144 bits in 5
words) - Load xfer registers w00, w01, w02, w03, w04
with the lookup key - sram write, w00, iipc_base_word,
iipc_command_word, 5 - More about iipc_base_word and iipc_command_word
later - 5 number of data words needed for key
- Result is read back from Contexts Results
Mailbox - This is an SRAM read, not a TCAM Read
instruction.
16LC Ingress Lookup
Lookup
Switch Tx
QM/Schd
Hdr Format
S W I T C H
Phy Int Rx
Key Extract
- Main functions
- Perform TCAM Lookup
- Pass Through Data
- Buf Handle
- IP Pkt Length and Ethernet Header Length
- Single code path with possible loop around Result
Read - NN communication
- Uses 8 threads
17LC Ingress Lookup Block Interfaces
Lookup
Switch Tx
Hdr Format
S W I T C H
Phy Int Rx
Key Extract
Buf Handle(32b)
Buf Handle(32b)
IP Pkt Length (16b)
Reserved (8b)
IP Pkt Length (16b)
Eth Hdr Len (8b)
Reserved (8b)
Eth Hdr Len (8b)
Lookup Key63-32 (32b)
VLAN (16b)
Stats Index (16b)
Rsvd (4b)
Lookup Key 31-0 (32b)
QID (20b)
DAddr (8b)
Port (4b)
Lookup Result
Lookup Key
D_Addr318 (24b)
SL (4b)
Port (4b)
VLAN (16b)
Stats Index (16b)
Rsvd (4b)
D_Addr70 (8b)
UDP DPort (16b)
Protocol (8b)
QID (20b)
DAddr (8b)
Port (4b)
18LC Ingress Lookup Block Diagram
mem access
dl_source()
Signal next ctx
Load Xfer Regs
NN Dequeue (4W)
SRAM Write 2W
Send Lookup Request
init signal
Wait for prev ctx
TimeStamp Delay
ctx_swap
Read Result
SRAM Read 2W
Signal next ctx
ctx_swap
Check Done Bit
NN Enqueue (4W)
Wait for prev ctx
Reformat Output
dl_sink()
19IPv4 MR Lookup
- Main functions
- Perform TCAM Lookup
- Pass Through Data
- Buf Handle
- IP Pkt Length and Offset
- Slice Data Ptr
- Exception Bits
- Single code path with possible loop around Result
Read - NN communication
- Uses 8 threads
20IPv4 MR Lookup Block Interfaces
Lookup
Tx
DeMux
Rx
Parse
Header Format
Buf Handle(32b)
IP Pkt Length (16b)
IP Pkt Offset (16b)
Rx UDP DPort(16b)
Slice ID (VLAN) (16b)
Cntr Index (16b)
R S V d (1b)
D (1b)
H (1b)
Exception Bits (12b)
L D (1b)
Tx IP DAddr (32b)
Tx UDP SPort(16b)
Tx UDP DPort (16b)
Port (4b)
QID(20b)
DA(8b)
Slice Data Ptr (32b)
Slice Data Ptr (32b)
Reserved (28b)
Code (4b)
Reserved (28b)
Code (4b)
Lookup Key (144b)
Slice ID/Rx UDP DPort (32b)
IP DAddr (32b)
IP SAddr (32b)
SPort (16b)
DPort (16b)
Proto/TCP_Flags(16b)
21IPv4 MR Functional Block Results
Lookup Key (144b)
TCAM Status Bits
As given to HF Lookup Result (128b)
Stored in TCAM Lookup Result (128b)
Cntr Index (16b)
D 1b
Reserved (11b)
D O N e 1b
H I t 1b
M H I t 1b
L D 1b
Tx IP DAddr (32b)
Tx UDP SPort(16b)
Tx UDP DPort (16b)
Port (4b)
QID(20b)
DA(8b)
22IPv4 MR Lookup Block Diagram
mem access
dl_source()
Signal next ctx
Load Xfer Regs
NN Dequeue (9W)
SRAM Write 5W
Send Lookup Request
init signal
Wait for prev ctx
TimeStamp Delay
ctx_swap
Read Result
SRAM Read 4W
Signal next ctx
ctx_swap
Check Done Bit
NN Enqueue (9W)
Wait for prev ctx
Reformat Output
dl_sink()
23LC Egress Lookup
S W I T C H
QM/Schd
Lookup
Key Extract
Switch Rx
Phy Int Tx
Hdr Format
- Main functions
- Perform TCAM Lookup
- Pass Through Data
- Buf Handle
- IP Pkt Length and Ethernet Header Length
- IP Destination Address
- Single code path with possible loop around Result
Read - NN communication
- Uses 8 threads
24LC Egress Lookup Block Interfaces
S W I T C H
Lookup
Key Extract
Switch Rx
Phy Int Tx
Hdr Format
Buf Handle(32b)
Buf Handle(32b)
IP DAddr (32b)
IP DAddr (32b)
Lookup Result 63-32 (32b)
Lookup Key UDP SPort (16b)
Lookup Key IP Proto (8b)
Reserved (8b)
Lookup Result 31-0 (32b)
Lookup Result
Lookup Key
25LC Egress Lookup Block Diagram
mem access
dl_source()
Signal next ctx
Load Xfer Regs
NN Dequeue (4W)
SRAM Write 1W
Send Lookup Request
init signal
Wait for prev ctx
TimeStamp Delay
ctx_swap
Read Result
SRAM Read 2W
Signal next ctx
ctx_swap
Check Done Bit
NN Enqueue (5W)
Wait for prev ctx
Reformat Output
dl_sink()
26Performance
27Packet Sizes
28Cycle Budget (min eth packets)
- To hit 5 Gb rate
- 76B per min IPv4 packet (64 min Eth 12B IFS)
- 1.4Ghz clock rate
- 5 Gb/sec 1B/8b packet/76B 8.22 Mp/sec
- 1.4Gcycle/sec 1 sec/ 8.22 Mp 170.3 cycles
per packet - compute budget 170 cycles
- latency budget (threads170)
- 8 threads 1360 cycles
- To hit 10 Gb rate
- 76B per min IPv4 packet (64 min Eth 12B IFS)
- 1.4Ghz clock rate
- 10 Gb/sec 1B/8b packet/76B 16.44 Mp/sec
- 1.4Gcycle/sec 1 sec/ 16.44 Mp 85.16 cycles
per packet - compute budget 85 cycles
- latency budget (threads85)
- 8 threads 680 cycles
29Cycle Budget (IPv4 MN packets)
- To hit 5 Gb rate
- 90B per min IPv4 packet (78 min IPv4MN 12B IFS)
- 1.4Ghz clock rate
- 5 Gb/sec 1B/8b packet/90B 6.94 Mp/sec
- 1.4Gcycle/sec 1 sec/ 6.94 Mp 201.7 cycles
per packet - compute budget 201 cycles
- latency budget (threads201)
- 8 threads 1608 cycles
- To hit 10 Gb rate
- 90B per min IPv4 packet (78 min IPv4MN 12B IFS)
- 1.4Ghz clock rate
- 10 Gb/sec 1B/8b packet/90B 13.88 Mp/sec
- 1.4Gcycle/sec 1 sec/ 13.88 Mp 100.86 cycles
per packet - compute budget 100 cycles
- latency budget (threads100)
- 8 threads 800 cycles
30TCAM Instruction Latency Analysis
- QDR Clock 200 MHz, 5ns period
- TCAM core Clock 200 MHz, 5ns period
- NPU Clock 1400 MHz, 0.714 ns period
- 1 QDR cycle 1 TCAM cycle 7 NPU cycles
- TCAM Lookup Latencies
- QDR xfer 1 cycle per word in key
- Instruction Fifo constant 2 cycles
- Synchronizer constant 3 cycles
- Execution Latency fct(key width, output data
width) - Table in IDT Latency Application Note
- Re-Synchronizer constant 1 cycle
31TCAM Instruction Latency Analysis
- IPv4 MR
- Key 144 bit (5 words)
- Output data 128 bit
- QDR Xfer 5 cycles
- Constants 2 3 1 6 cycles
- Execution Latency 36 cycles
- Total Latency 47 TCAM cycles (235 ns) (329 NPU
cycles) - LC Ingress
- Key 64 bit (2 words)
- Output data 64 bit
- QDR Xfer 2 cycles
- Constants 2 3 1 6 cycles
- Execution Latency 32 cycles
- Total Latency 40 TCAM cycles (200 ns) (280 NPU
cycles) - LC Egress
- Key 24 bit (1 words)
- Output data 64 bit
- QDR Xfer 1 cycles
- Constants 2 3 1 6 cycles
32TCAM Performance (Rates in M/sec)
LC_Egress
LC_Ingress
IPv4 MR
33TCAM Performance (Rates in M/sec)
LC_Egress
LC_Ingress
IPv4 MR
34IPv4 Performance Snapshot
610 Cycles
sram write
sram read
Timestamp Delay
dl_sink ctx_arb
dl_sink processing
Timestamp Delay setup
dl_source Xfer reg loads
Ctx_arb vs br_signal optimization
35IPv4 Performance Snapshot
Write issued At 34016
Write issued At 33333
34016 33333 683 Cycles
- IPv4 MR lookup
- Hack to Parse loop and repeatedly call dl_sink
with same buf_handle - Should guarantee that there is always something
in NN ring for lookup to pick up - Hack to HF set dlNextBlock to IX_DROP
- Keep Tx from trying to transmit something bad.
36LC_Ingress Performance Snapshots
gt563 Cycles
- LC Ingress lookup
- unloaded
37LC_Ingress Performance Snapshots
Write issued At 60494
Write issued At 59888
60494 59888 606 Cycles
- LC Ingress lookup
- Hack to KE stub loop and repeatedly call dl_sink
with same buf_handle - Should guarantee that there is always something
in NN ring for lookup to pick up - Hack to HF stub set dl_next_block to IX_DROP
- Keep Tx from trying to transmit something bad.
38LC_Egress Performance Snapshots
560 Cycles
- LC Egress lookup
- Unloaded
39LC_Egress Performance Snapshots
610 Cycles
- LC Egress lookup
- Loaded with KE and HF hacks.
40Performance Summary
- Processing Cycles
- LC Ingress41
- IPv4 MR 57
- LC Egress43
- Abort Cycles
- LC Ingress16
- IPv4 MR 16
- LC Egress16
- Latency Cycles
- LC Ingress 560 57 503?
- IPv4 MR 610 73 537?
- LC Egress 560 59 501?
- Expected performance
- LC Ingress 10Gb/s
- IPv4 MR 5Gb/s
- LC Egress 10Gb/s
41Optimizations Possibilities
- May still be some code we can move out of
processing loop or at least between sram write or
read and the ctx swap. - dl_sink has a possible improvement.
- ctx_arb vs. br_signal/br_!signal
42Extra Slides
43Image Slide Template
44Text Slide Template