A Fast Algorithm for Multi-Pattern Searching - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

A Fast Algorithm for Multi-Pattern Searching

Description:

The minimum length of a pattern, m, and consider only the first m chars of each pattern. ... Map the first B' chars of all patterns into the PREFIX table. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 21
Provided by: banyanCm
Category:

less

Transcript and Presenter's Notes

Title: A Fast Algorithm for Multi-Pattern Searching


1
A Fast Algorithm for Multi-Pattern Searching
  • Sun Wu, Udi Manber
  • May 1994

2
Basic Idea
  • Boyer-Moore Starts by comparing the last
    character.
  • Uses the skip idea of Boyer-Moore to multiple
    patterns. (Bad character shift)
  • Looking text in blocks instead of one by one
    char.
  • Hash functions and tables are used.

3
The preprocessing stage
  • The minimum length of a pattern, m, and consider
    only the first m chars of each pattern.
  • ?if k patterns, total size .
  • Three tables to build a SHIFT table, a HASH
    table, and a PREFIX table.

4
Scanning steps
  • 1. Compute a hash value h based on the current B
    characters from the text (starting with
    ).
  • 2. Check the value of if gt0, shift
    and back to 1.
  • 3. Compute the hash value of the prefix of the
    text call it text_prefix.
  • 4. Check for each p,
    whether . When
    they are equal, check the actual pattern against
    the text directly.

5
SHIFT table
  • SHIFT table
  • Let B be the size of the block, each string of
    size B in the alphabet is mapped to an index to
    the SHIFT table by a hash function.
  • 1.X doesnt appear
    (not m)
  • 2.X appears q is the
    position that X ends in some pattern.
  • Set to the minimum value.

6
HASH table
  • The same hash function as SHIFT table.
  • Map the last B chars of all patterns.
  • contains a pointer that
  • Points to a list of pointers of the patterns
    whose last B characters hash into i.
  • is an index to the PREFIX table.

7
PREFIX table
  • Map the first B chars of all patterns into the
    PREFIX table.
  • Contains the hash value of each prefix of size
    B.
  • Used to filter patterns whose suffix is the same
    but whose prefix is different.

8
HASH table
SHIFT table
PREFIX table
SHIFTi
SHIFTi1
Pattern pointer list
  • Hash i

Hash i1
9
Performance
  • entries in SHIFT
    table, constructed in time .
  • It takes to compute one hash function,
    the total amount of work in the cases of non-zero
    shifts is .
  • Assume BB, then the amount of work for the case
    of shift value 0 is also , the expected
    total amount for this step is also .

10
(No Transcript)
11
A comparison of different search routines on a
15.8MB text
  • Above figure
  • Pattern sizesranging from 5 to 15 with average
    size slightly above 6.
  • Cannot handle more than few hundreds patterns.
  • Original egrep fgrep.

12
(No Transcript)
13
A comparison of running times for different
number of patterns
  • Above figure
  • Running time is improved exceeds about 8000.
  • Related to the way greps work rather than to the
    specific algorithm.
  • Agrep (and every other grep) outputs the lines
    that match the query.
  • Above 8000, most line are matched, so less work
    is needed.

14
(No Transcript)
15
The effect of the minimum pattern length on the
running time
  • Above figure
  • The larger m is the more chances of shifting
    there are, leading to less work.
  • Match the curve of the function
  • is the average shift values
  • Preprocessing is very fast
  • Ex. For 10000 patterns, agrep 0.17 second,
    GNU-grep 0.9 second

16
Additional applications
  • Find all similar files in a large file system,
    need a data structure to handle.
  • If the data is fetched from disk anyway, we can
  • Store the records as we obtained them without
    sorting them.
  • Or putting one record together with its
    identifier per line.

17
Additional applications
  • Benefits
  • No need for any additional space for the data
    structure.
  • No need for preprocessing or organizing the data
    structure, e.g. sorting.
  • More flexible search.

18
Additional applications
  • Another applicationsmatch-and replace.
  • Each pattern is associated with a replacement
    pattern.
  • Discover and replace in the output by its
    replacement.

19
Conclusion
  • Aho and Crosicka linear-time algorithm, optimal
    in the worst case.
  • Boyer and Mooreregular string-searching
    algorithm, possible to skip a large portion,
    leading to faster than linear algorithm, but not
    suitable in multi-pattern.

20
conclusion
  • Wu and Manberconcentrate on typical searches
    rather than on worst-case behavior.
  • Crucial to making the algorithm significantly
    faster than other algorithms in practice.
Write a Comment
User Comments (0)
About PowerShow.com