Construction of Aho Corasick automaton in Linear time for Integer Alphabets - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Construction of Aho Corasick automaton in Linear time for Integer Alphabets

Description:

If stuck, travel along KMP-style Failure link ... Or by two-pass radix sort, O(D S) = O(n) Paige & Tarjan, '87; Andersson & Nilsson, 94 ... – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 34
Provided by: criHa
Category:

less

Transcript and Presenter's Notes

Title: Construction of Aho Corasick automaton in Linear time for Integer Alphabets


1
Construction ofAho Corasickautomaton inLinear
timefor Integer Alphabets
  • Gad Landau Shiri Dori
  • University of Haifa

2
Overview
  • Classic Aho Corasick
  • Our algorithm
  • Goto Function
  • Failure Function
  • Combining the two
  • Queries in O(m logS)

3
Set Pattern Matching Problem
  • Find patterns in text
  • PP1, P2, ... Pq, in T
  • Aho and Corasick solved it in 75
  • Generalized version of KMP
  • Uses a state machine

4
Aho Corasick - Example
P her, iris, he, is
T the iris for her
5
Aho Corasick - Example
P her, iris, he, is
T the iris for her
Travel along the Goto function, which is a trie
of all patterns
If stuck, travel along KMP-style Failure link
6
Aho Corasick - Example
P her, iris, he, is
T the iris for her
?
Travel along the Goto function, which is a trie
of all patterns
h
i
When found a pattern, output it
he
ir
is
her
iri
If stuck, travel along KMP-style Failure link
iris
7
Aho Corasick Definitions
  • Goto function a trie of the patterns
  • Failure function for each label, the largest
    suffix which is a prefix of a pattern
  • KMP, but prefix of any pattern qualifies
  • Output function patterns ending at this label

8
Classic Aho Corasick - Analysis
  • Constructed in O(n) (cumulative pattern length)
  • Answered queries in O(m k)
  • ... For constant alphabets only!
  • For integer alphabets, algorithm changes
    depending on branching method
  • List, Array or Search Tree
  • Recent developments inspire for better!
  • Farach-97 Karkkäinen Sanders-03
  • Ko Aluru-03 Kim, Sim, Park Park-03

9
Our Results
  • Our algorithm achieves better results
  • Construction in O(n) time, O(n) space
  • Query in O(m logS)
  • Works for integer alphabets, S O(nc)

10
Algorithm Goto Function
  • Sort patterns in time linear to their length
  • By building suffix array of SpP1P2...Pq,
    and just ignoring non-pattern suffixes
  • Or by two-pass radix sort, O(D S) O(n)
  • Paige Tarjan, 87 Andersson Nilsson, 94
  • Now create the trie in lexicographic order
  • Hold a list of sons insert each new node to the
    end of the list
  • Lo and behold - we have a sorted trie

11
Example - Goto Function
P her, iris, he, is
Sorting Patterns
P he, her, iris, is
12
Example Goto Function
P he, her, iris, is
?
he
he, her
he, her, iris
he, her, iris, is
h
i
he
ir
is
her
iri
iris
13
Example Goto Function
P he, her, iris, is
?
i
h
i
r
s
Sorted List, keep the tail
he
ir
is
her
iri
iris
14
Algorithm Failure Function
  • We need to construct Failure links on trie
  • Original algorithm included traversing trie
  • We found a deep connection between
  • Failure function of the patterns, and
  • Suffix Tree of the reversed patterns
  • Or Enhanced Suffix Array
  • Abouelhoda, Kurtz Ohlebusch-04 Kim, Jeon
    Park-04
  • Well learn by example...

15
Example Failure Function
  • Failure function iris ? is
  • The reverses siri, si
  • si is a prefix of siri
  • (with so is prefix of a pattern)

PR eh, reh, siri, si
16
Understanding Failure Function
  • Failure function is defined as largest suffix,
    which is a prefix of any pattern
  • Reverse largest suffix ? largest prefix
  • Any prefix of a label will be its ancestor in ST
  • Largest means nearest
  • prefix of pattern ? suffix of pattern
  • It will be a node in the ST, marked by a
  • So closest ancestor which is marked by

17
Algorithm Failure Function
  • We found a deep connection between
  • Failure function of the patterns, and
  • Suffix Tree of the reversed patterns
  • We define SpP1P2...Pq
  • We define TR to be the suffix tree of (Sp)R
  • TR can be built in linear time
  • Can use Enhanced Suffix Array, ER, instead
  • Note TR is a Generalized Suffix Tree
  • How will we link the trie and TR?

18
Example 1-to-1 Mapping
Note r doesnt get a link since its not
marked by a
19
Example 1-to-1 Mapping
20
Algorithm Review
  • Build Goto function (trie)
  • Sort patterns
  • Construct trie
  • Build Failure function
  • Construct TR
  • Compute proper ancestor for -marked nodes
  • Combine information
  • Through mapping, create Failure links on trie

21
Adjustment for Integer Alphabet
  • We used recent developments (SA, ST)
  • Constructed Goto using suffix array
  • Found a connection between Failure function and
    suffix trees
  • Thus, reduced the construction to O(n)
  • Yet, manage to keep queries at O(m logS)
  • Again - how?

22
Queries in O(m logS)
  • Weve built the trie in O(n)
  • But we have a sorted list
  • Search is compromised
  • Our simple solution

23
Example Goto Function
P he, her, iris, is
?
i
h
i
r
s
r
s
he
ir
is
Array can be searched in logS
her
iri
iris
24
Queries in O(m logS)
  • Once the trie is complete
  • Convert lists in each node to arrays
  • Arrays size is known O(n) space overall
  • Binary search can now be employed
  • Reduce the time in each node to logS
  • Can be applied to Suffix Tree built from Suffix
    Array LCP

25
The End
26
Algorithm Combining the two
  • Build a 1-to-1 mapping between -marked nodes in
    TR and trie nodes
  • We compute mapping through the string
  • For each char in Sp, we keep its Goto node
  • For each suffix tree node, we know what indices
    it represents (in (Sp)R, and so in Sp)
  • Now, build Failure links atop the trie
  • Like we saw in the example

27
Algorithm Failure Function
  • For each node, find its proper ancestor
  • Closest ancestor marked with a
  • Found with a simple preorder traversal
  • The properties of TR ensure that...
  • For each failure link v1?v2
  • And their corresponding nodes, u1 and u2
  • u2 proper ancestor of u1
  • If we link trie and TR, we find the Failure!
  • How will we link them?

28
Example - automaton - Goto
Travel along the Goto function, which is a trie
of all patterns
P her, their, eye, iris, he, is
29
Example - TR
P her, their, eye, iris, he, is
30
Example - TR and Failure
P her, their, eye, iris, he, is
iris ? is the ? he ? e eye ? e their ? ir ? ?
31
TR - Reversed Suffix Tree
  • We defined SpP1P2...Pq
  • We define TR to be the suffix tree of (Sp)R
  • This tree has interesting properties
  • Each trie node v is represented by exactly one TR
    node u, so that Label(v) Label(u)R
  • In TR, a nodes label is a prefix of its childs
    label in the trie, it is a suffix of the
    original
  • A -marked node in TR means that the original
    label is a prefix of a pattern

32
Example - TR

?
e (e)
r (r)
h (h)
i (i)
si (is)




eh (he)
reh (her)
iri (iri)
ri (ir)
siri (iris)





P her, iris, he, is
33
Example - TR and Failure
  • We took
  • Failure of their ? ir (from iris)
  • Largest suffix, which is a prefix of a pattern
  • Their reverse strings are rieht, ri
  • Now prefix... its ancestor in a suffix tree!
  • To be a prefix of a pattern px, should be a
    suffix of the reverse pattern (px)R
  • So it will be in suffix tree, and end with a

P her, their, eye, iris, he, is
Write a Comment
User Comments (0)
About PowerShow.com