Suffix Trees and Suffix Arrays - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Suffix Trees and Suffix Arrays

Description:

Inverted indices are good for search words ... 'abacus', 'acrimonious' X. Algorithm. Search each end of the defining intervals. ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 23
Provided by: chiahu
Category:

less

Transcript and Presenter's Notes

Title: Suffix Trees and Suffix Arrays


1
Suffix Trees and Suffix Arrays
  • Modern Information Retrieval
  • by R. Baeza-Yates and B. Ribeiro-Neto
  • Addison-Wesley, 1999.
  • (Chapter 8)

2
Introduction
  • Word-based indexing
  • Inverted indices are good for search words
  • Queries such as phrases are expensive to solve
    using Inverted files
  • For word-based applications, inverted files
    perform better
  • Suffix trees and suffix arrays
  • complex queries

3
Text Suffixes
This is a text. A text has many words. Words are
made from letters.
  • text. A text has many words. Words are made from
    letters.
  • text has many words. Words are made from letters.
  • many words. Words are made from letters.
  • words. Words are made from letters.
  • Words are made from letters.
  • made from letters.
  • letters.

4
The Suffix Trie and Suffix Tree
5
PAT Trees and PAT Arrays
  • Information Retrieval Data Structures and
    Algorithms
  • by W.B. Frakes and R. Baeza-Yates (Eds.)
    Englewood Cliffs, NJ Prentice Hall, 1992.
  • (Chapters 5)

6
PAT Trees and PAT Arrays
  • Problems of tradition IR models
  • Documents and words are assumed.
  • Keywords must be extracted from the text
    (indexing).
  • Queries are restricted to keywords.
  • New indices for text
  • A text is regarded as a long string.
  • Each position corresponds to a semi-infinite
    string (sistring).
  • No structures and no keywords

7
Semi-infinite Strings
  • ExampleText Once upon a time, in a far away
    land sistring 1 Once upon a time sistring
    2 nce upon a time sistring 8 on a time, in a
    sistring 11 a time, in a far sistring 22 a
    far away land
  • Compare sistrings 22 lt 11 lt 2 lt 8 lt 1

8
PAT Tree
  • PAT TreeA Patricia tree constructed over all the
    possible sistrings of a text
  • Patricia tree
  • a binary digital tree where the individual bits
    of the keys are used to decide on the branching
  • A zero bit will cause a branch to the left
    subtree
  • A one bit will cause a branch to the right
    subtree
  • each internal node indicates which bit of the
    query is used for branching
  • absolute bit position
  • a count of the number of bits to skip
  • each external node points to a sistring
  • the integer displacement to original text

9
1
Example
2
2
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
3
4
2
1
1
2
2
3
4
2
3
5
1
external node sistring (integer
displacement) total displacement of the bit
to be inspected
1
1
1
1
0
0
1
1
1
2
2
0
1
3
2
internal node skip counter pointer
10
1
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
2
2
2
4
3
3
6
7
3
4
5
1
1
2
2
1
2
4
3
3
2
2
6
7
3
5
5
4
1
4
2
3
4
8
6
3
5
1
Search 00101
?3?6?4?bits????
11
Indexing Points
  • The above example assumes every position in the
    text is indexed.i.e. n external nodes, one for
    each indexed position in the text
  • Word and phrase searchessistrings that are at
    the beginning of words are necessary
  • Trade-off between size of the index and search
    requirements

12
Prefix searching
  • ideaevery subtree of the PAT tree has all the
    sistrings with a given prefix.
  • Search proportional to the query lengthexhaust
    the prefix or up to external node.

Search for the prefix 10100 and its answer
13
Proximity Searching
  • Find all places where s1 is at most a fixed
    (given by a user) number of characters away from
    s2. in 4 ation gt insulation, international,
    information
  • Algorithm1. Search for s1 and s2.2. Select the
    smaller answer set from these two sets and
    sort by position.3. Traverse the unsorted answer
    set, searching every position in the sorted
    set and checking if the distance between
    positions satisfying the proximity condition.

sorttraverse timem1 logm1 m2logm1 (assume
m1ltm2)
14
Range Searching
  • Search for all the strings within a certain
    lexicographical range.
  • Ex the range of abc ..acc
  • abracadabra, acacia ?
  • abacus, acrimonious X
  • Algorithm
  • Search each end of the defining intervals.
  • Collect all the sub-trees between (and including)
    them.

15
Longest Repetition Searching
  • the match between two different positions of a
    text where this match is the longest in the
    entire text, e.g., 0 1 1 0 0 1 0 0 0 1 0 1 1 1

the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111 sistring
5 0100010111 sistring 6 100010111 sistring
7 00010111 sistring 8 0010111
1
2
2
4
3
3
2
6
7
3
5
5
1
4
8
16
Most Significant or Most Frequent Matching
  • The most frequently occurring strings within the
    text database
  • e.g., the most frequent trigram
  • Find the most frequent trigram
  • find the largest subtree at a distance 3
    characters from root

1
the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
2
2
4
3
3
i.e., 1, 2, 3 are the same for sistrings
100100010111 and 100010111
2
6
7
3
5
5
1
4
8
17
Building PAT Trees as Patricia Trees (1)
  • Bucketing of external nodes
  • collect more than one external node
  • a bucket replaces any subtree with size less than
    a certain constraint (b)save significant number
    of internal nodes
  • the external nodes inside a bucket do not have
    any structure associated with themincrease the
    number of comparisons for each search

18
Building PAT Trees as Patricia Trees (2)
  • Mapping the tree onto the disk using super-nodes
  • Advantage save the number of disk access and
    space
  • Every disk page has a single entry point,
    contains as much of the trees as possible, and
  • terminates either in external nodes or in
    pointers to other disk pages
  • The pointers in internal nodes will address
    either a disk page or another node inside the
    same page
  • reduces the storage cost of internal nodes
  • Example
  • Assume a disk page contains on the order of 1,000
    internal/external nodes
  • on the average, each disk page contains about 10
    steps of a root-to-leaf path

19
PAT Trees Represented as Arrays
  • External node bucket size, b
  • If we keep the external nodes in the bucket in
    the same relative order as they would be in the
    tree
  • Indirect binary search vs. sequential search

PAT array
1
7
4
8
5
1
6
3
2
2
2
2
4
3
3
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
6
7
3
5
5
1
4
8
20
Searching PAT Trees as Arrays
  • Prefix searching and range searchingdoing an
    indirect binary search over the array with the
    results of the comparisons being less than,
    equal, and greater than.
  • ExampleSearch for the prefix 100 and its answer
  • Most frequent, Longest repetition
  • Manber and Baeza-Yates (1991)

PAT array
7
4
8
5
1
6
3
2
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
21
Comparisons
  • Signature files
  • Use hashing techniques to produce an index
  • Advantage
  • storage overhead is small (10-20)
  • Disadvantages
  • the search time on the index is linear
  • some answers may not match the query, thus
    filtering must be done

22
Comparisons (Continued)
  • Inverted files
  • storage overhead (30 100)
  • search time for word searches is logarithmic
  • PAT arrays
  • potential use in other kind of searches
  • phrases
  • regular expression searching
  • approximate string searching
  • longest repetitions
  • most frequent searching
Write a Comment
User Comments (0)
About PowerShow.com