Title: Efficient Memory Utilization on Network Processors for Deep Packet Inspection
1Efficient Memory Utilization on Network
Processors for Deep Packet Inspection
- Piti Piyachon
- Yan Luo
- Electrical and Computer Engineering Department
- University of Massachusetts Lowell
2Our Contributions
- Study parallelism of a pattern matching algorithm
- Propose Bit-Byte Aho-Corasick Deterministic
Finite Automata - Construct memory model to find optimal settings
to minimize the memory usage of DFA
3DPI and Pattern Matching
- Deep Packet Inspection
- Inspect packet header payload
- Detect computer viruses, worms, spam, etc.
- Network intrusion detection application Bro,
Snort, etc. - Pattern Matching requirements
- Matching predefined multiple patterns (keywords,
or strings) at the same time - Keywords can be any size.
- Keywords can be anywhere in the payload of a
packet. - Matching at line speed
- Flexibility to accommodate new rule sets
4Classical Aho-Corasick (AC) DFA example 1
- A set of keywords
- he, her, him, his
Failure edges back to state 1 are shown as dash
line. Failure edges back to state 0 are not shown.
5Memory Matrix Model of AC DFA
- Snort (Dec05) 2733 keywords
- 256 next state pointers
- width 15 bits
- gt 27,000 states
- keyword-ID width 2733 bits
- 27538 x (2733 256 x 15) 22 MB
22 MB is too big for on-chip RAM
6Bit-AC DFA (Tan-Sherwoods Bit-Split)
Need 8 bit-DFA
7Memory Matrix of Bit-AC DFA
- Snort (Dec05) 2733 keywords
- 2 next state pointers
- width 9 bits
- 361 states
- keyword-ID width 16 bits
- 1368 DFA
- 1368 x 361 x (16 2 x 9) 2 MB
8Bit-AC DFA Techniques
- Shrinking the width of keyword-ID
- From 2733 to 16 bits
- By dividing 2733 keywords into 171 subsets
- Each subset has 16 keywords
- Reducing next state pointers
- From 256 to 2 pointers
- By dividing each input byte into 1 bits
- Need 8 bit-DFA
- Extra benefits
- The number of states (per DFA) reduces from
27,000 to 300 states. - The width of next state pointer reduces from 15
to 9 bits. - Memory
- Reduced from 22 MB to 2 MB
- The number of DFA ?
- With 171 subsets, each subset has 8 DFA.
- Total DFA 171 x 8 1,368 DFA
What can we do better to reduce the memory usage?
9Classical AC DFA example 2
28 states
Failure edges are not shown.
10Byte-AC DFA
- Considering 4 bytes at a time
- 4 DFA
- lt 9 states / DFA
- 256 next state pointers!
Similar to Dharmapurikar-Lockwoods JACK DFA,
ANCS05
11Bit-Byte-AC DFA
- 4 bytes at a time
- Each byte divides into bits.
- 32 DFA ( 4 x 8)
- lt 6 states/DFA
- 2 next state pointers
12Memory Matrix of Bit-Byte-AC DFA
- Snort (Dec05) 2733 keywords
- 4 bytes at a time
- lt 36 states/DFA
- 2 next state pointers
- width 6 bits
- keyword-ID width 3 bits
- 29152 DFA ( 911 x 32)
- 29152 x 36 x (3 2 x 6) 1.9 MB
- 1.9 MB is a little better than 2 MB.
- This is because
- It is not any optimal setting.
- Each DFA has different number of states.
- Dont need to provide same size of memory matrix
for every DFA.
13Bit-Byte-AC DFA Techniques
- Still keeping the width of keyword-ID as low as
Bit-DFA. - Still keeping next state pointers as small as
Bit-DFA. - Reducing states per DFA by
- Skipping bytes
- Exploiting more shared states than Bit-DFA
- Results of reducing states per DFA
- from 27,000 to 36 states
- The width of next state pointer reduces from 15
to 6 bits.
14Construction of Bit-Byte AC DFA
bit 3 of byte 0
4 bytes (considered) at a time
15Construction of Bit-Byte AC DFA
4 bytes (considered) at a time
16Construction of Bit-Byte AC DFA
4 bytes (considered) at a time
17Construction of Bit-Byte AC DFA
4 bytes (considered) at a time
18Construction of Bit-Byte AC DFA
4 bytes (considered) at a time
19Construction of Bit-Byte AC DFA
4 bytes (considered) at a time
20Construction of Bit-Byte AC DFA
4 bytes (considered) at a time
21Construction of Bit-Byte AC DFA
4 bytes (considered) at a time
22Construction of Bit-Byte AC DFA
4 bytes (considered) at a time
23Construction of Bit-Byte AC DFA
Failure edges are not shown.
24Construction of Bit-Byte AC DFA
25Construction of Bit-Byte AC DFA
32 bit-byte DFA need to be constructed.
26Bit-Byte-DFA Searching
27Bit-Byte-DFA Searching
0
A failure edge is shown as necessary.
28Bit-Byte-DFA Searching
29Bit-Byte-DFA Searching
0
A failure edge is shown as necessary.
30Bit-Byte-DFA Searching
31Find the optimal settings to minimize memory
- When k keywords per subset
- The width of keyword-ID k bits
- k 1, 2, 3, , K
- when K the number of keywords in the whole set.
- Snort (Dec.2005) K 2733 keywords
- b bit(s) extracted for each byte
- b 1, 2, 4, 8
- of next state pointers 2b
- The example 2 b 1
- Beyond b gt 8
- gt 256 next state pointers
- B Bytes considered at a time
- B 1, 2, 3,
- The example 2 B 4
- Total Memory (T) is a function of k, b, and B.
- T f(k, b, B)
32Ts Formula
,
and
,
when
Total memory of all bit-ACs in all subset
33Find the optimal k
- Each pair of (b, B) has one optimal k for a
minimal T.
keywords per subset
34Find the optimal b
- Each setting of k, b, and B has different optimal
point. - Choosing only the optimal setting to compare.
- b 2 is the best.
keywords per subset
35Find the optimal B
- b 2
- T reduces while B increases.
- Non-linearly
- B gt 16,
- T begins to increase.
- B 16 is the best for Snort (Dec05).
keywords per subset
36Comparing with Existing Works
- Tan-Sherwoods, Brodie-Cytron-Taylors, and Ours
- Our Bit-Byte DFA when B16
- The optimal point at b2 and k12
- 272 KB
- 14 of 2001 KB (Tans)
- 4 of 6064 KB (Brodies)
keywords per subset
37Comparing with Existing Works
- Tan-Sherwoods and Ours At B 1
- (Tans on ASIC)
- 2001 KB
- k 16 is not the optimal setting for B1.
- Each bit-DFA uses same storages capacity, which
fits the largest one (worst case). - (Ours on NP)
- 396 KB lt 2001 KB
- k 3 is the optimal setting for B1.
- Each bit-DFA uses exactly memory space to hold
it.
keywords per subset
38Results with an NP Simulator
- NePSim2
- An open source IXP24xx/28xx simulator
- NP Architecture based on IXP2855
- 16 MicroEngines (MEs)
- 512 KB
- 1.4 GHz
- Bit-Byte AC DFA b2, B16, k12
- T 272 KB
- 5 Gbps
keywords per subset
39Conclusion
- Bit-Byte DFA model can reduce memory usage up to
86. - Implementing on NP uses on-chip memory more
efficiently without wasting space, comparing to
ASIC. - NP has flexibility to accommodate
- The optimal setting of k, b, and B.
- Different sizes of Bit-Byte DFA.
- New rule sets in the future.
- The optimal setting may change.
- The performance (using a NP simulator) satisfies
line speed up to 5 Gbps throughput.
keywords per subset
40Thank you
Question? Piti_Piyachon_at_student.uml.edu Yan_Luo_at_u
ml.edu