Pattern Matching Algorithms: An Overview - PowerPoint PPT Presentation

About This Presentation

Title:

Pattern Matching Algorithms: An Overview

Description:

Knuth, Morris, Pratt (1977): automata method Boyer, Moore (1977): can be sublinear KMP Automaton P = ababcb Dictionary Matching is an alphabet. – PowerPoint PPT presentation

Number of Views:1076

Avg rating:3.0/5.0

Slides: 60

Provided by: sciBrookl

Learn more at: http://www.sci.brooklyn.cuny.edu

Category:

more less

Transcript and Presenter's Notes

Title: Pattern Matching Algorithms: An Overview

1
Pattern Matching Algorithms An Overview

Shoshana Neuburger
The Graduate Center, CUNY
9/15/2009

2
Overview

Pattern Matching in 1D
Dictionary Matching
Pattern Matching in 2D
Indexing
Suffix Tree
Suffix Array
Research Directions

3
What is Pattern Matching?

Given a pattern and text,
find the pattern in the text.

4
What is Pattern Matching?

S is an alphabet.
Input
Text T t1 t2 tn
Pattern P p1 p2 pm
Output
All i such that

5
Pattern Matching - Example

Input Pcagc a,g,c,t
Tacagcatcagcagctagcat

Output 2,8,11

acagcatcagcagctagcat
1 2 3 4 5 6 7 8 . 11
6
Pattern Matching Algorithms

Naïve Approach
Compare pattern to text at each location.
O(mn) time.
More efficient algorithms utilize information
from previous comparisons.

7
Pattern Matching Algorithms

Linear time methods have two stages
preprocess pattern in O(m) time and space.
scan text in O(n) time and space.
Knuth, Morris, Pratt (1977) automata method
Boyer, Moore (1977) can be sublinear

8
KMP Automaton
P ababcb
9
Dictionary Matching

S is an alphabet.
Input
Text T t1 t2 tn
Dictionary of patterns D P1, P2, , Pk
All characters in patterns and text belong to
S.
Output
All i, j such that
where mj Pj

10
Dictionary Matching Algorithms

Naïve Approach
Use an efficient pattern matching algorithm for
each pattern in the dictionary.
O(kn) time.
More efficient algorithms process text once.

11
AC Automaton

Aho and Corasick extended the KMP automaton to
dictionary matching
Preprocessing time O(d)
Matching time O(n log S k).
Independent of dictionary size!

12
AC Automaton

D ab, ba, bab, babb, bb

13
Dictionary Matching

KMP automaton does not depend on alphabet size
while AC automaton does branching.
Dori, Landau (2006) AC automaton is built in
linear time for integer alphabets.
Breslauer (1995) eliminates log factor in text
scanning stage.

14
Periodicity

A crucial task in preprocessing stage of most
pattern matching algorithms
computing periodicity.
Many forms
failure table
witnesses

15
Periodicity

A periodic pattern can be superimposed on itself
without mismatch before its midpoint.
Why is periodicity useful?
Can quickly eliminate many candidates for
pattern occurrence.

16
Periodicity

Definition
S is periodic if S and is a proper
suffix of .
S is periodic if its longest prefix that is also
a suffix is at least half S.
The shortest period corresponds to the longest
border.

17
Periodicity - Example

S abcabcabcab S 11
Longest border of S b abcabcab
b 8 so S is periodic.
Shortest period of S abc
3 so S is periodic.

18
Witnesses

Popular paradigm in pattern matching
find consistent candidates
verify candidates
consistent candidates ? verification is linear

19
Witnesses

Vishkin introduced the duel to choose between two
candidates by checking the value of a witness.
Alphabet-independent method.

20
Witnesses

Preprocess pattern
Compute witness for each location of
self-overlap.
Size of witness table
, if P is periodic,
, otherwise.

21
Witnesses

WITi any k such that Pk ? Pk-i1.
WITi 0, if there is no such k.
k is a witness against i being a period of P.
Example Pattern
Witness Table

22
Witnesses

Let jgti.
Candidates i and j are consistent if
they are sufficiently far from each other
OR
WITj-i0.

23
Duel

Scan text
If pair of candidates is close and inconsistent,
perform duel to eliminate one (or both).
Sufficient to identify pairwise consistent
candidates transitivity of consistent positions.

P T
witness
i j
a
b
?
24
2D Pattern Matching
MRI

S is an alphabet.
Input
Text T 1 n, 1 n
Pattern P 1 m, 1 m
Output
All (i, j) such that

25
2D Pattern Matching - Example

Input Pattern A,B
Text
Output (1,4),(2,2),(4, 3)

A B A
A B A
A A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
26
Bird / Baker

First linear-time 2D pattern matching algorithm.
View each pattern row as a metacharacter to
linearize problem.
Convert 2D pattern matching to 1D.

27
Bird / Baker

Preprocess pattern
Name rows of pattern using AC automaton.
Using names, pattern has 1D representation.
Construct KMP automaton of pattern.
Identical rows receive identical names.

28
Bird / Baker

Scan text
Name positions of text that match a row of
pattern, using AC automaton within each row.
Run KMP on named columns of text.
Since the 1D names are unique, only one name can
be given to a text location.

29
Bird / Baker - Example

Preprocess pattern
Name rows of pattern using AC automaton.
Using names, pattern has 1D representation.
Construct KMP automaton of pattern.

A B A
A B A
A A B
1
1
2
30
Bird / Baker - Example

Scan text
Name positions of text that match a row of
pattern, using AC automaton within each row.
Run KMP on named columns of text.

A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
0 0 2 1 0 1 0
0 0 0 1 0 1 0
0 0 2 1 0 2 0
0 0 0 2 1 0 0
0 0 1 0 1 0 0
0 0 0 0 2 1 0
0 0 0 0 0 1 0
0 0 2 1 0 1 0
0 0 0 1 0 1 0
0 0 2 1 0 2 0
0 0 0 2 1 0 0
0 0 1 0 1 0 0
0 0 0 0 2 1 0
0 0 0 0 0 1 0
31
Bird / Baker

Complexity of Bird / Baker algorithm
time and space.
Alphabet-dependent.
Real-time since scans text characters once.
Can be used for dictionary matching
replace KMP with AC automaton.

32
2D Witnesses

Amir et. al. 2D witness table can be used for
linear time and space alphabet-independent 2D
matching.
The order of duels is significant.
Duels are performed in 2 waves over text.

33
Indexing

Index text
Suffix Tree
Suffix Array
Find pattern in O(m) time
Useful paradigm when text will be searched for
several patterns.

34
Suffix Trie
T banana
suf7
suf1 suf2 suf3 suf4 suf5 suf6 suf7

suf6

suf5

suf4

suf3

suf2

suf1

One leaf per suffix.
An edge represents one character.
Concatenation of edge-labels on the path from the
root to leaf i spells the suffix that starts at
position i.

35
Suffix Tree
T banana
7,7
suf1 suf2 suf3 suf4 suf5 suf6 suf7
3,4

2,2
1,7
7,7
3,4
5,7

7,7
suf6

suf1
5,7
7,7
suf5
suf3

suf2
suf4

Compact representation of trie.
A node with one child is merged with its
parent.
Up to n internal nodes.
O(n) space by using indices to label edges

36
Suffix Tree Construction

Naïve Approach O(n2) time
Linear-time algorithms

Author Date Innovation Scan Direction
Weiner 1973 First linear-time algorithm, alphabet-dependent suffix links Right to left
McCreight 1976 Alphabet-independent suffix links, more efficient Left to right
Ukkonen 1995 Online linear-time construction, represents current end Left to right
Amir and Nor 2008 Real-time construction Left to right
37
Suffix Tree Construction

Linear-time suffix tree construction algorithms
rely on suffix links to facilitate traversal of
tree.
A suffix link is a pointer from a node labeled xS
to a node labeled S x is a character and S a
possibly empty substring.
Alphabet-dependent suffix links point from a node
labeled S to a node labeled xS, for each
character x.

38
Index of Patterns

Can answer Lowest Common Ancestor (LCA) queries
in constant time if preprocess tree accordingly.
In suffix tree, LCA corresponds to Longest Common
Prefix (LCP) of strings represented by leaves.

39
Index of Patterns

To index several patterns
Concatenate patterns with unique characters
separating them and build suffix tree.
Problem inserts meaningless suffixes that span
several patterns.
OR
Build generalized suffix tree single structure
for suffixes of individual patterns.
Can be constructed with Ukkonens algorithm.

40
Suffix Array

The Suffix Array stores lexicographic order of
suffixes.
More space efficient than suffix tree.
Can locate all occurrences of a substring by
binary search.
With Longest Common Prefix (LCP) array can
perform even more efficient searches.
LCP array stores longest common prefix between
two adjacent suffixes in suffix array.

41
Suffix Array

Index Suffix Index Suffix LCP
1 mississippi 11 i 0
2 ississippi 8 ippi 1
3 ssissippi 5 issippi 1
4 sissippi 2 ississippi 4
5 issippi 1 mississippi 0
6 ssippi 10 pi 0
7 sippi 9 ppi 1
8 ippi 7 sippi 0
9 ppi 4 sissippi 2
10 pi 6 ssippi 1
11 i 3 ssissippi 3

sort suffixes alphabetically
42
Suffix array

T mississippi

1
4
0
0
1
0
2
0
1
3
1
LCP
43
Search in Suffix Array

O(m log n)
Idea two binary searches- search for leftmost
position of X- search for rightmost position of
X
In between are all suffixes that begin with X
With LCP array O(m log n) search.

44
Suffix Array Construction

Naïve Approach O(n2) time
Indirect Construction
preorder traversal of suffix tree
LCA queries for LCP.
Problem does not achieve better space efficiency.

45
Suffix Array Construction

Direct construction algorithms
LCP array construction range-minima queries.

Author Date Complexity Innovation
Manber, Myers 1993 O(n log n) Sort and search, KMR renaming
Karkkainen and Sanders 2003 O(n) Linear-time
Ko and Aluru 2003 O(n) Linear-time
Kim, et. al. 2003 O(n) Linear-time
46
Compressed Indices

Suffix Tree O(n) words O(n log n) bits
Compressed suffix tree
Grossi and Vitter (2000)
O(n) space.
Sadakane (2007)
O(n log S) space.
Supports all suffix tree operations efficiently.
Slowdown of only polylog(n).

47
Compressed Indices

Suffix array is an array of n indices, which is
stored in
O(n) words O(n log n) bits
Compressed Suffix Array (CSA)
Grossi and Vitter (2000)
O(n log S) bits
access time increased from O(1) to O(loge n)
Sadakane (2003)
Pattern matching as efficient as in uncompressed
SA.
O(n log H0) bits
Compressed self-index

48
Compressed Indices

FM index
Ferragina and Manzini (2005)
Self-indexing data structure
First compressed suffix array that respects the
high-order empirical entropy
Size relative to compressed text length.
Improved by Navarro and Makinen (2007)

49
Dynamic Suffix Tree

Dynamic Suffix Tree
Choi and Lam (1997)
Strings can be inserted or deleted efficiently.
Update time proportional to string
inserted/deleted.
No edges labeled by a deleted string.
Two-way pointer for each edge, which can be done
in space linear in the size of the tree.

50
Dynamic Suffix Array

Dynamic Suffix Array
Recent work by Salson et. al.
Can update suffix array after construction if
text changes.
More efficient than rebuilding suffix array.
Open problems
Worst case O(n log n).
No online algorithm yet.

51
Word-Based Index

Text size n contains k distinct words
Index a subset of positions that correspond to
word beginnings
With O(n) working space can index entire text and
discard unnecessary positions.
Desired complexity
O(k) space.
will always need O(n) time.
Problem missing suffix links.

52
Word-Based Suffix Tree

Construction Algorithms

Author Date Results
Karkkainen and Ukkonen 1996 O(n) time and O(n/j) space construction of sparse suffix tree (every jth suffix)
Anderson et. al. 1999 Expected linear-time and k-space construction of word-based suffix tree for k words.
Inenaga and Takeda 2006 Online, O(n) time and k-space construction of word-based suffix tree for k words.
53
Word-Based Suffix Array

Ferragina and Fischer (2007) word-based suffix
array construction algorithm
Time and space optimal construction.
Computation of word-based LCP array in O(n) time
and O(k) space.
Alternative algorithm for construction of
word-based suffix tree.
Searching as efficient as ordinary sufffix array.

54
Research Directions

Problems we are considering
Small space dictionary matching.
Time-space optimal 2D compressed dictionary
matching algorithm.
Compressed parameterized matching.
Self-indexing word-based data structure.
Dynamic suffix array in O(n) construction time.

55
Small-Space

Applications arise in which storage space is
limited.
Many innovative algorithms exist for single
pattern matching using small additional space
Galil and Seiferas (1981) developed first
time-space optimal algorithm for pattern
matching.
Rytter (2003) adapted the KMP algorithm to work
in O(1) additional space, O(n) time.

56
Research Directions

Fast dictionary matching algorithms exist for 1D
and 2D. Achieve expected sublinear time.
No deterministic dictionary matching method that
works in linear time and small space.
We believe that recent results in compressed
self-indexing will facilitate the development of
a solution to the small space dictionary matching
problem.

57
Compressed Matching

Data is compressed to save space.
Lossless compression schemes can be reversed
without loss of data.
Pattern matching cannot be done in compressed
text pattern can span a compressed character.
LZ78 data can be uncompressed in time and space
proportional to the uncompressed data.

58
Research Directions

Amir et. al. (2003) devised an algorithm for 2D
LZ78 compressed matching.
They define strongly inplace as a criteria for
the algorithm that the extra space is
proportional to the optimal compression of all
strings of the given length.
We are seeking a time-space optimal solution to
2D compressed dictionary matching.