Modern Information Retrieval - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Modern Information Retrieval

Description:

... file, the vocabulary is stored in lexicographical order with a pointer for each ... An array containing all the pointers to the suffixes in lexicographical order ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 23
Provided by: Ken124
Category:

less

Transcript and Presenter's Notes

Title: Modern Information Retrieval


1
Modern Information Retrieval
  • Chapter 8 Indexing and Searching

2
  • It is worthwhile building and maintaining an
    index when the text collection is large and
    semi-static
  • semi-static not often updated
  • consider search cost, space overhead,
    construction cost, and maintenance cost

3
  • Inverted file
  • a word-oriented index
  • vocabulary the set of all different words in the
    text
  • occurrences lists of the text positions where
    the words appear
  • the positions can refer to words or characters

4

5
  • the space required for the vocabulary is rather
    small while the occurrences demand much more
    space
  • between 30 and 40 of the text size
  • block addressing reduces space overhead to 5

6
  • if the exact occurrence positions are required,
    an online search over the qualifying blocks has
    to be performed

7
  • searching the inverted file
  • vocabulary search the words present in the query
    are separately searched in the vocabulary
  • retrieval of occurrences the lists of the
    occurrences of all the words found are retrieved

8
  • manipulation of occurrences the lists are
    traversed in synchronization to find places where
    all the words appear in sequence for a phrase
    query or appear close enough for a proximity
    query
  • how to efficiently manipulate the occurrences
    when block addressing is used?

9
  • constructing the inverted file

10
  • once constructed, it is written to disk in two
    files
  • the lists of occurrences are stored contiguously
    in the first file
  • in the second file, the vocabulary is stored in
    lexicographical order with a pointer for each
    word to its list in the first file

11
  • Suffix tree and suffix array
  • can be used to index any text character
  • allow to answer efficiently more complex queries
  • index points are selected form the text, which
    point to the beginning of the text positions
    which will be retrievable
  • each position is considered as a text suffix
  • each suffix is uniquely identified by its position

12

13
  • a suffix tree is a trie data structure built over
    all the suffixes of the text
  • the pointers to the suffixes are stored at the
    leaf nodes
  • the trie is compacted into a Patricia tree where
    unary paths are compressed
  • an indication of the next character position to
    consider is stored at the nodes which root a
    compressed path

14
  • space overhead 120 to 240 over the text size

15
  • suffix arrays provide the same functionality with
    much less space requirements
  • An array containing all the pointers to the
    suffixes in lexicographical order
  • space requirements close to 40 overhead

16
  • allow binary searches done by comparing the
    contents of each pointer
  • supra-index over the suffix array is used to
    reduce the number of disk accesses
  • compare with an inverted file

17
  • processing phrase queries by searching the first
    words of the phrases
  • processing proximity queries by searching all the
    words in the queries
  • post-processing needed

18
  • Signature files
  • use a hash function to map words to bit masks of
    B bits
  • a text is divided in blocks of b words each
  • a bit mask of size B is assigned to each block by
    bitwise ORing the signatures of all the words in
    the block

19
  • if a word is present in a block, all the bits set
    in its signature are also set in the bit mask of
    the block
  • when a bit is set in the mask of the query word
    but not in the mask of the block, the word is not
    present in the block

20

21
  • false drop all the corresponding bits are set
    while the word is not in the block
  • signature file design principle make the
    probability of a false drop low while keeping the
    signature file as short as possible
  • searching a single word by hashing it to a bit
    mask W, checking whether
  • , and verifying if the word
    is actually there

22
  • process a phrase searching by bitwise ORing the
    signatures of all the words in the query
  • the probability of false drops is reduced
  • care has to be exercised at block boundaries by
    overlapping words in consecutive blocks
Write a Comment
User Comments (0)
About PowerShow.com