A Fast Algorithm for Multi-Pattern Searching - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

A Fast Algorithm for Multi-Pattern Searching

Description:

The minimum length of a pattern, m, and consider only the first m chars of each pattern. ... Map the first B' chars of all patterns into the PREFIX table. ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 21

Provided by: banyanCm

Category:

more less

Transcript and Presenter's Notes

Title: A Fast Algorithm for Multi-Pattern Searching

1
A Fast Algorithm for Multi-Pattern Searching

Sun Wu, Udi Manber
May 1994

2
Basic Idea

Boyer-Moore Starts by comparing the last
character.
Uses the skip idea of Boyer-Moore to multiple
patterns. (Bad character shift)
Looking text in blocks instead of one by one
char.
Hash functions and tables are used.

3
The preprocessing stage

The minimum length of a pattern, m, and consider
only the first m chars of each pattern.
?if k patterns, total size .
Three tables to build a SHIFT table, a HASH
table, and a PREFIX table.

4
Scanning steps

1. Compute a hash value h based on the current B
characters from the text (starting with
).
2. Check the value of if gt0, shift
and back to 1.
3. Compute the hash value of the prefix of the
text call it text_prefix.
4. Check for each p,
whether . When
they are equal, check the actual pattern against
the text directly.

5
SHIFT table

SHIFT table
Let B be the size of the block, each string of
size B in the alphabet is mapped to an index to
the SHIFT table by a hash function.
1.X doesnt appear
(not m)
2.X appears q is the
position that X ends in some pattern.
Set to the minimum value.

6
HASH table

The same hash function as SHIFT table.
Map the last B chars of all patterns.
contains a pointer that
Points to a list of pointers of the patterns
whose last B characters hash into i.
is an index to the PREFIX table.

7
PREFIX table

Map the first B chars of all patterns into the
PREFIX table.
Contains the hash value of each prefix of size
B.
Used to filter patterns whose suffix is the same
but whose prefix is different.

8
HASH table
SHIFT table
PREFIX table
SHIFTi
SHIFTi1
Pattern pointer list

Hash i

Hash i1
9
Performance

entries in SHIFT
table, constructed in time .
It takes to compute one hash function,
the total amount of work in the cases of non-zero
shifts is .
Assume BB, then the amount of work for the case
of shift value 0 is also , the expected
total amount for this step is also .

10
(No Transcript)
11
A comparison of different search routines on a
15.8MB text

Above figure
Pattern sizesranging from 5 to 15 with average
size slightly above 6.
Cannot handle more than few hundreds patterns.
Original egrep fgrep.

12
(No Transcript)
13
A comparison of running times for different
number of patterns

Above figure
Running time is improved exceeds about 8000.
Related to the way greps work rather than to the
specific algorithm.
Agrep (and every other grep) outputs the lines
that match the query.
Above 8000, most line are matched, so less work
is needed.

14
(No Transcript)
15
The effect of the minimum pattern length on the
running time

Above figure
The larger m is the more chances of shifting
there are, leading to less work.
Match the curve of the function
is the average shift values
Preprocessing is very fast
Ex. For 10000 patterns, agrep 0.17 second,
GNU-grep 0.9 second

16
Additional applications

Find all similar files in a large file system,
need a data structure to handle.
If the data is fetched from disk anyway, we can
Store the records as we obtained them without
sorting them.
Or putting one record together with its
identifier per line.

17
Additional applications

Benefits
No need for any additional space for the data
structure.
No need for preprocessing or organizing the data
structure, e.g. sorting.
More flexible search.

18
Additional applications

Another applicationsmatch-and replace.
Each pattern is associated with a replacement
pattern.
Discover and replace in the output by its
replacement.

19
Conclusion

Aho and Crosicka linear-time algorithm, optimal
in the worst case.
Boyer and Mooreregular string-searching
algorithm, possible to skip a large portion,
leading to faster than linear algorithm, but not
suitable in multi-pattern.

20
conclusion

Wu and Manberconcentrate on typical searches
rather than on worst-case behavior.
Crucial to making the algorithm significantly
faster than other algorithms in practice.

Write a Comment

User Comments (0)