Title: Automatic Synthesis of Efficient Intrusion Detection Systems on FPGAs by Zachary K' Baker and Viktor
1Automatic Synthesis of Efficient
IntrusionDetection Systems on FPGAsby Zachary
K. Baker and Viktor K. PrasannaUniversity of
Southern California, Los Angeles, CA, USAFPL
2004Review 12-9-04
2Outline
- Introduction
- Related Work in Automated IDS Generation
- Approach
- Tree-based Prefix Sharing
- Performance Results for Tool-Generated Designs
- Conclusion
3Background on network security and intrusion
detection
- Performance
- ability to match against a large set of patterns
- Features
- automatically optimize and synthesize large
designs - Application for Intrusion detection software
- Snort
- Hogwash
- Trends
- Move away from general-purpose microprocessor
- Admins remove rules from the databases when
performance is limited - Move to string matching to an FPGA based
reconfigurable hardware
4Related Work in Automated IDS Generation
- Open-source Intrusion detection software
- Snort
- Hogwash
- String matching is just one component of the many
functions used in Snort and Hogwash
5Related work in FPGA-based network scanning
- First reassemble TCP stream (WashU Schuehler)
- demultiplex the TCP/IP stream into substreams
- Sort packets based on rules
- (WashU Mike Attig)
- Washington University research approach
- Deterministic Finite Automata pattern matchers
(WashU RegEx) - Spread the load over several parallel matching
units - Four parallel reduce the runtime by 4X
- Limited number of regular expressions patterns
support - Bloom Filters
- Provides large amount of string matching
- Pre-decoded shift-and-compare shift registers
- Alternative approaches
- This paper lacks references much of the related
work
6Approach of this work
- Partitions large rule sets into multiple
pipelines - Minimizes FPGA memory area usage
- Optimizes Throughput
- String search (one part of IDS)
- Characters tend to repeat across strings
- Each string only contains a few dozen characters
- pipelines created
- 2-4 for 400 patterns
- 8 for 600 1000 patterns
- Rules
- repeated characters within a partition is
maximized - number of characters repeated between partitions
is minimized - Tools
- Process graphs with trie nodes
- Generate synthesizeable VHDL circuit description
7Approach
8Contribution of this work
- Tool
- accepts rule strings
- creates pipelined distribution networks
- converts template-generated Java to netlists
w/JHDL - Reduces amount of routing required
- Reduces complexity of finite automata state
machines - Proposed Tool combines common prefixes to form
matching trees - Adds pre-decoded wide parallel inputs
9Approach
- Basic idea - characters shared across patterns do
not need to be redundantly compared - Architectures
- a pre-decoded shift-and-compare
- Data pre-decoded" into its character equivalent
- Large array of AND gates asserts output for
character hot coded - Gate delays provided through a shift-register
- Appropriate decoded character is selected from
each time-delayed shift-register stage - tree-based area optimization prevents repeating
comparisons - Precomputing redundancy information
- shared decoding
- pushes all character-level comparisons to the
beginning of the comparator pipelines - reduces the single character match operation to
the inspection of a single bit. - Eg valueh represented with a single bit, one
bit for each possible value - Enables pattern groups to be handled by an
independent pipeline - Eg acker
10Patterns of text
- Set of strings appear start in common partition
- Strings (vertices) with characters in common have
are connected (with an edge) - Partitions are established that have characters
in common - i.e., fully connected groups are partitioned
together - reducing the cut between the partitions decreases
the number of pipeline registers
Objective - maximize the number of edges between
nodes within the group minimize the number of
edges between nodes in different groups
11Formalized operation
Node - collection of patterns is the number of
characters piped through each pipeline Pattern -
Each pattern is composed of letters. Edge
Every node with a given letter is connected by an
edge to every other node with that letter. The
edge is added between any vertex-patterns that
have a common character
- vertex V
- graph R
- pattern p
- ruleset T
- edge E
- character class C
- k, l edge instances
12Tree-based Prefix Sharing
- efficient search of matching rules
- Boyer-Moore algorithm
- Aho-Corasick algorithm
- hashing mechanism utilizing the Bloom filter
- pre-decoding strategy - converts characters to
single bit lines in the cycle before they are
required in the state machine - reduces the area of designs
- allows more patterns to be packed in the FPGA
- customized
- for the 4-bit blocks of characters
- four character prefixes map to four decoded bits
- Fits Xilinx Virtex 4-bit lookup table
13Tree-based sharing Example operation
/scripts
/unsafe.cgi
/cgi-bin
-win
/cgi
/insecure.cgi
/unsafe.cgi
(Ignored characters)
14Graph theory
- Every node with a given letter is connected by an
edge to every other node with that letter - Patterns are partitioned n-ways to reduces
pipeline register width Improve character usage
within pipeline - number of repeated characters within a partition
is maximized - number of characters repeated between partitions
is minimized - system is composed of n pipelines, each with a
minimum of bit lines - Uses mincut physical design automation to
minimize representation - Weighted reduces the problems associated with
large rulesets - Uses edge weighting based on number of characters
in the pattern - Patterns sharing character locality to be more
likely grouped together - keep similar prefixes together but doesnt force
incompatible patterns in the same group
15On-line Tools
- Tools on-line at
- http//halcyon.usc.edu/zbaker/idstools
- Tool Package
- Partitioning tools - KMETIS toolset
- the trie data structure modulehttp//search.cpan.
org/avif/Tree-Trie-0.4/Trie.pm - IDS database - Hogwash
16Clock Period and Area Efficiency as a function of
of partitions and Size of Ruleset
Clock Period(has small effect for different s
ofPartitions)
Area is smallestwith smaller ofPartitions.
(3x range of area)
- For large rulesets, the penalty to add more
partitions is not so bad (1.5 x larger) - Clock period is relatively unaffected for all
cases
17Architecture Comparison
18Comparisons to Related Work
(Unit Size / Performance 1000)
Pattern size, average unit size for a 16
character pattern (in logic cells one slice is
two logic cells), and performance (in Mb/s/cell)
(Where are Bloom filters on this chart ?!?
)Whole list of 16-character patterns fit in one
filter !This chart does not consider BlockRAMs!
19Performance Results for Tool-Generated Designs
- Performance
- General improvement is approximately 30
- 602 pattern ruleset reduces the area by almost
50 in some cases - unpartitioned experiments - increased area due to
the tree architecture - large numbers of patterns share the same pipeline
increased fanout - impossible to make fair comparisons without
reimplementing all other designs - performance - throughput/area
- small, fast designs perform best with Virtex II
Pro XC2VP100 - Area increases moderately as the number of
partitions increases - Platform - Xilinx ML-300 contains a Virtex II Pro
XC2VP7 - VHDL file size - 300kB for the 361 ruleset
- 9,000 lines of __HDL code
- 1200 slices
20Conclusion
- Throughput (1 / Clock_rate) not greatly affected
by of partitions - Area increases with of Partitions
- Performance of approach (speed/area)
- Comparable to that achieved by UCLA RDL
- About 2x better than GaTech and U.Create
- About 30x better than Los Alamos
- Not compared to Bloom filters