Title: Construction of Aho Corasick automaton in Linear time for Integer Alphabets
1Construction ofAho Corasickautomaton inLinear
timefor Integer Alphabets
- Gad Landau Shiri Dori
- University of Haifa
2Overview
- Classic Aho Corasick
- Our algorithm
- Goto Function
- Failure Function
- Combining the two
- Queries in O(m logS)
3Set Pattern Matching Problem
- Find patterns in text
- PP1, P2, ... Pq, in T
- Aho and Corasick solved it in 75
- Generalized version of KMP
- Uses a state machine
4Aho Corasick - Example
P her, iris, he, is
T the iris for her
5Aho Corasick - Example
P her, iris, he, is
T the iris for her
Travel along the Goto function, which is a trie
of all patterns
If stuck, travel along KMP-style Failure link
6Aho Corasick - Example
P her, iris, he, is
T the iris for her
?
Travel along the Goto function, which is a trie
of all patterns
h
i
When found a pattern, output it
he
ir
is
her
iri
If stuck, travel along KMP-style Failure link
iris
7Aho Corasick Definitions
- Goto function a trie of the patterns
- Failure function for each label, the largest
suffix which is a prefix of a pattern - KMP, but prefix of any pattern qualifies
- Output function patterns ending at this label
8Classic Aho Corasick - Analysis
- Constructed in O(n) (cumulative pattern length)
- Answered queries in O(m k)
- ... For constant alphabets only!
- For integer alphabets, algorithm changes
depending on branching method - List, Array or Search Tree
- Recent developments inspire for better!
- Farach-97 Karkkäinen Sanders-03
- Ko Aluru-03 Kim, Sim, Park Park-03
9Our Results
- Our algorithm achieves better results
- Construction in O(n) time, O(n) space
- Query in O(m logS)
- Works for integer alphabets, S O(nc)
10Algorithm Goto Function
- Sort patterns in time linear to their length
- By building suffix array of SpP1P2...Pq,
and just ignoring non-pattern suffixes - Or by two-pass radix sort, O(D S) O(n)
- Paige Tarjan, 87 Andersson Nilsson, 94
- Now create the trie in lexicographic order
- Hold a list of sons insert each new node to the
end of the list - Lo and behold - we have a sorted trie
11Example - Goto Function
P her, iris, he, is
Sorting Patterns
P he, her, iris, is
12Example Goto Function
P he, her, iris, is
?
he
he, her
he, her, iris
he, her, iris, is
h
i
he
ir
is
her
iri
iris
13Example Goto Function
P he, her, iris, is
?
i
h
i
r
s
Sorted List, keep the tail
he
ir
is
her
iri
iris
14Algorithm Failure Function
- We need to construct Failure links on trie
- Original algorithm included traversing trie
- We found a deep connection between
- Failure function of the patterns, and
- Suffix Tree of the reversed patterns
- Or Enhanced Suffix Array
- Abouelhoda, Kurtz Ohlebusch-04 Kim, Jeon
Park-04 - Well learn by example...
15Example Failure Function
- Failure function iris ? is
- The reverses siri, si
- si is a prefix of siri
- (with so is prefix of a pattern)
PR eh, reh, siri, si
16Understanding Failure Function
- Failure function is defined as largest suffix,
which is a prefix of any pattern - Reverse largest suffix ? largest prefix
- Any prefix of a label will be its ancestor in ST
- Largest means nearest
- prefix of pattern ? suffix of pattern
- It will be a node in the ST, marked by a
- So closest ancestor which is marked by
17Algorithm Failure Function
- We found a deep connection between
- Failure function of the patterns, and
- Suffix Tree of the reversed patterns
- We define SpP1P2...Pq
- We define TR to be the suffix tree of (Sp)R
- TR can be built in linear time
- Can use Enhanced Suffix Array, ER, instead
- Note TR is a Generalized Suffix Tree
- How will we link the trie and TR?
18Example 1-to-1 Mapping
Note r doesnt get a link since its not
marked by a
19Example 1-to-1 Mapping
20Algorithm Review
- Build Goto function (trie)
- Sort patterns
- Construct trie
- Build Failure function
- Construct TR
- Compute proper ancestor for -marked nodes
- Combine information
- Through mapping, create Failure links on trie
21Adjustment for Integer Alphabet
- We used recent developments (SA, ST)
- Constructed Goto using suffix array
- Found a connection between Failure function and
suffix trees - Thus, reduced the construction to O(n)
- Yet, manage to keep queries at O(m logS)
- Again - how?
22Queries in O(m logS)
- Weve built the trie in O(n)
- But we have a sorted list
- Search is compromised
- Our simple solution
23Example Goto Function
P he, her, iris, is
?
i
h
i
r
s
r
s
he
ir
is
Array can be searched in logS
her
iri
iris
24Queries in O(m logS)
- Once the trie is complete
- Convert lists in each node to arrays
- Arrays size is known O(n) space overall
- Binary search can now be employed
- Reduce the time in each node to logS
- Can be applied to Suffix Tree built from Suffix
Array LCP
25The End
26Algorithm Combining the two
- Build a 1-to-1 mapping between -marked nodes in
TR and trie nodes - We compute mapping through the string
- For each char in Sp, we keep its Goto node
- For each suffix tree node, we know what indices
it represents (in (Sp)R, and so in Sp) - Now, build Failure links atop the trie
- Like we saw in the example
27Algorithm Failure Function
- For each node, find its proper ancestor
- Closest ancestor marked with a
- Found with a simple preorder traversal
- The properties of TR ensure that...
- For each failure link v1?v2
- And their corresponding nodes, u1 and u2
- u2 proper ancestor of u1
- If we link trie and TR, we find the Failure!
- How will we link them?
28Example - automaton - Goto
Travel along the Goto function, which is a trie
of all patterns
P her, their, eye, iris, he, is
29Example - TR
P her, their, eye, iris, he, is
30Example - TR and Failure
P her, their, eye, iris, he, is
iris ? is the ? he ? e eye ? e their ? ir ? ?
31TR - Reversed Suffix Tree
- We defined SpP1P2...Pq
- We define TR to be the suffix tree of (Sp)R
- This tree has interesting properties
- Each trie node v is represented by exactly one TR
node u, so that Label(v) Label(u)R - In TR, a nodes label is a prefix of its childs
label in the trie, it is a suffix of the
original - A -marked node in TR means that the original
label is a prefix of a pattern
32Example - TR
?
e (e)
r (r)
h (h)
i (i)
si (is)
eh (he)
reh (her)
iri (iri)
ri (ir)
siri (iris)
P her, iris, he, is
33Example - TR and Failure
- We took
- Failure of their ? ir (from iris)
- Largest suffix, which is a prefix of a pattern
- Their reverse strings are rieht, ri
- Now prefix... its ancestor in a suffix tree!
- To be a prefix of a pattern px, should be a
suffix of the reverse pattern (px)R - So it will be in suffix tree, and end with a
P her, their, eye, iris, he, is