Chapter 8 Indexing and Searching presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 8 Indexing and Searching

1
Chapter 8Indexing and Searching

Hsin-Hsi Chen
Department of Computer Science and Information
Engineering
National Taiwan University

2
Introduction

searching
Online text searching
Scan the text sequentially
Indexed searching
Build data structures over the text to speed up
the search
Semi-static collections updated at reasonably
regular interval
indexing techniques
inverted files
suffix (PAT) arrays
signature files

3
Assumptions

n the size of text databases
m the length of the search patterns (mltn)
M the amount of memory available
n the size of texts that are modified (nltn)
Experiments
32bit Sun UltraSparc-1 of 167 MHz with 64 MB of
RAM
TREC-2 (WSJ, DOE, FR, ZIFF, AP)

4
File Structures for IR

lexicographical indices
indices that are sorted
inverted files
Patricia (PAT) trees (Suffix trees and arrays)
cluster file structures (see Chapter 7 in
document clustering)
indices based on hashing
signature files

5
Inverted Files
6
Inverted Files

Each document is assigned a list of keywords or
attributes.
Each keyword (attribute) is associated with
operational relevance weights.
An inverted file is the sorted list of keywords
(attributes), with each keyword having links to
the documents containing that keyword.
Penalty
the size of inverted files ranges from 10 to
100of more of the size of the text itself
need to update the index as the data set changes

7
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
Vocabulary Occurrences

addressing granularity
inverted list
word positions
character positions
inverted file
document

letters 60 made 50 many 28 text 11,
19 words 30, 40
Heaps law the vocabulary grows as O(n?), ?
0.40.6 Vocabulary for 1GB of
TREC-2 collection 5MB
(before stemming and
normalization) Occurrences the extra space O(n)
30 40 of the text size
8
Block Addressing

Full inverted indices
Point to exact occurrences
Blocking addressing
Point to the blocks where the word appears
Pointers are smaller
5 overhead over the text size

block fixed size blocks, files, documents, Web
pages,
block retrieval units?
Block1 Block2
Block3 Block
4 This is a text. A text has many words. Words
are made from letters.
Vocabulary Occurrences
Text
letters 4 made 4 many 2 text 1, 2
words 3
Inverted index
9
Sorted array implementation of an inverted file
the documents in which keyword occurs
10
Full inversion (all words, exact positions,
4-byte pointers)
2 or 1 byte(s) per pointer independent of the
text size
document size (10KB), 1, 2, 3 bytes per pointer,
depending on text size
All words are indexed
Stop words are not indexed
11
Searching

Vocabulary search
Identify the words and patterns in the query
Search them in the vocabulary
Retrieval of occurrences
Retrieve the lists of occurrences of all the
words
Manipulation of occurrences
Solve phrases, proximity, or Boolean operations
Find the exact word positions when block
addressing is used

Three general steps
12
Structures used in Inverted Files

Sorted Arrays
store the list of keywords in a sorted array
using a standard binary search
advantage easy to implement
disadvantage updating the index is expensive
B-Trees
Tries
Hashing Structures
Combinations of these structures

13
Trie
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
letters 60
made 50
l
d
a
m
Vocabulary trie
many 28
n
t
text 11, 19
w
words 33, 40
14
B-trees
F M
Rut Uni
Al Br E
Gr H Ja L

Russian 9
Ruthenian 1

Afgan 2
15
Sorted Arrays
1. The input text is parsed into a list of words
along with their location in the text. (time
and storage consuming operation) 2. This list is
inverted from a list of terms in location order
to a list of terms in alphabetical order. 3.
Add term weights, or reorganize or compress the
files.
16
Inversion of Word List
report appears in two records
17
Dictionary and postings file
Idea the file to be searched should be as short
as possible split a single file into two
pieces
(vocabulary)
(occurrences)
e.g. data set 38,304 records, 250,000 unique
terms
88 postings/record
(document , frequency)
18
Producing an Inverted File for Large Data Sets
without Sorting
Idea avoid the use of an explicit sort
by using a right-threaded binary tree
current number of term postings the storage
location of postings list
traverse the binary tree and the linked postings
list
19
Indexing Statistics
Final index only 8 of input text size for 50MB
database 14 of the input
text size for the larger database Working
storage not much larger than the size of final
index for new indexing method
the storage needed to build the index
p.1718
p.20
933
2GB
the same
20
A Fast Inversion Algorithm

Principle 1the large primary memories are
availableIf databases can be split into memory
loads that can be rapidly processed and then
combined, the overall cost will be minimized.
Principle 2the inherent order of the input
dataIt is very expensive to use polynomial or
even nlogn sorting algorithms for large files

21
FAST-INV algorithm
concept postings/ pointers
See p. 22.
22
Sample document vector
document number
concept number
(one concept number for each unique word)
Similar to the document- word list shown in p. 16.
The concept numbers are sorted within
document numbers, and document numbers
are sorted within collection
23
HCNhighest concept number in dictionary
(total number of concepts in dictionary) Lnumbe
r of concepts/document (documents/concept)
pairs in the collection Mavailable primary
memory size, in bytes MgtgtHCN, M lt L L/jltM, so
that each part fill fit into primary memory HCN/j
concepts, approximately, are associated with
each part Let LLlength of current load
(8 bytes for each concept-weight)
Sspread of concept numbers in current load (4
bytes
for each count of posting)
number of concept-weight pairs
8LL4S lt M
24
Preparation
1. Allocate an array, con_entries_cnt, of size
HCN. 2. For each ltdoc, congt entry in the
document vector file increment
con_entries_cntcon
0 (1,2), (1,4).. 2 (2,3) ..
3 (3,1), (3,2), (3,5) ... 6 (4,2), (4,3) .
8 ...
25
Preparation (continued)
5. For each ltcon,countgt pair obtained from
con_entries_cnt if there is no room for
documents with this concept to fit in the
current load, then created an entry in the load
table and initialize the next load entry
otherwise update information for the current
load table entry.
26
the range of concepts for each primary load
????Load?? LLlength of current load S end
concept-start concept 1
space for concept/ weight pairLL space for each
concept to store count of postingsS lt M
??SS???? ??????
??(Doc,Con) ?Con??Load ?,???? ?????? ?Load
?????Load File???CONPTR ???Offset??? ???????? ??
copy rather than sort
27
PAT Tress and PAT Arrays(Suffix Trees and Suffix
Arrays)
28
PAT Trees and PAT Arrays

Problems of tradition IR models
Documents and words are assumed.
Keywords must be extracted from the text
(indexing).
Queries are restricted to keywords.
New indices for text
A text is regarded as a long string.
Each position corresponds to a semi-infinite
string (sistring).
suffix a string that goes from a text position
to the end of the text
Each suffix is uniquely identified by its
position
no structures and no keywords

29
Text
This is a text. A text has many words. Words
are made from letters.
text. A text has many words. Words are made
from letters.
text has many words. Words are made from letters.
many words. Words are made from letters.
Suffixes
Words are made from letters.
different
made from letters.
Index points are selected from the text,
which point to the beginning of the text
positions which are retrievable.
letters.
30
PATRICIA

trie
branch decision node search decision-markers
element node real data
if branch decisions are made on each bit, a
complete binary tree is formed where the depth is
equal to the number of bits of the longest
strings
many element nodes and branch nodes are null

31
PATRICIA (Continued)

compressed digital search trie
the null element nodes and branch nodes are
removed
an additional field to denote the comparing bit
for branching decision is included in each
decision node
a matching between the searched results and their
search keys is required because only some of bits
are compared during the search process

32
PATRICIA (Continued)

Practical Algorithm to Retrieve Information Coded
in Alphanumeric
augmented branch node an additional field for
storing elements is included in branch node
each element is stored in an upper node or in
itself
an addition root node note the number of leaf
nodes is always greater than that of internal
nodes by one

33
PAT-tree

PATRICIA semi-infinite strings
a text T with n basic units u1 u2 un
u1 u2 un , u2 u3 un , u3 u4 un ,
an end to the left but none to the right
store the starting positions of semi-infinite
strings in a text using PATRICIA

34
semi-infinite strings

ExampleText Once upon a time, in a far away
land sistring 1 Once upon a time sistring
2 nce upon a time sistring 8 on a time, in a
sistring 11 a time, in a far sistring 22 a
far away land
Compare sistrings 22 lt 11 lt 2 lt 8 lt 1

35
PAT Tree

PAT TreeA Patricia tree constructed over all the
possible sistrings of a text
Patricia tree
a digital tree where the individual bits of the
keys are used to decide on the branching
each internal node indicates which bit of the
query is used for branching
absolute bit position
a count of the number of bits to skip
each external node is a sistring, i.e., the
integer displacement

36
1
Example
2
2
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
3
4
2
1
1
2
2
3
4
2
3
5
1
external node sistring (integer
displacement) total displacement of the bit
to be inspected
1
1
1
1
0
0
1
1
1
2
2
0
1
3
2
internal node skip counter pointer
37
1
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
2
2
2
4
3
3
6
7
3
4
5
1
1
2
2
1
2
4
3
3
2
2
6
7
3
5
5
4
1
4
2
3
4
8
6
3
5
1
Search 00101
?3?6?4?bits????
38
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
Suffix Trie
60
l
50
d
m
a
28
space overhead 120240 over the text size
19
n
t
e
x
t
w
11
40
o
r
d
s
33
60
l
Suffix Tree
50
d
m
3
1
28
19
n
t
5
11
w
40
6
33
39
PAT Trees Represented as Arrays

indirect binary search vs. sequential searchKeep
the external nodes in the bucket in the same
relative order as they would be in the tree

PAT array
1
7
4
8
5
1
6
3
2
2
2
2
4
3
3
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
6
7
3
5
5
1
4
8
40
1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Text
60
l
50
(1) Suffix Tree
d
m
3
1
28
19
n
t
120240 overhead
5
11
w
40
6
33
40 overhead
(2) Suffix Array
(3) Supra-Index
Suffix Array
41
difference between suffix array and inverted list

suffix array the occurrences of each word are
sorted lexicographically by the text following
the word
inverted list the occurrences of each word are
sorted by text position

1 6 9 11 17 19 24 28
33 40 46 50
55 60 This is a text. A text has many
words. Words are made from letters.
Vocabulary Supra-Index
Suffix Array
Inverted list
42
Indexing Points

The above example assumes every position in the
text is indexed.n external nodes, one for each
position in the text
word and phrase searchessistrings that are at
the beginning of words are necessary
trade-off between size of the index and search
requirements

43
Prefix searching

ideaevery subtree of the PAT tree has all the
sistrings with a given prefix.
Search proportional to the query lengthexhaust
the prefix or up to external node.

Search for the prefix 10100 and its answer
44
Searching PAT Trees as Arrays

Prefix searching and range searchingdoing an
indirect binary search over the array with the
results of the comparisons being less than,
equal, and greater than.
exampleSearch for the prefix 100 and its answer.

PAT array
7
4
8
5
1
6
3
2
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
45
Proximity Searching

Find all places where s1 is at most a fixed
(given by a user) number of characters away from
s2. in 4 ation gt insulation, international,
information
Algorithm1. Search for s1 and s2.2. Select the
smaller answer set from these two sets and
sort by position.3. Traverse the unsorted answer
set, searching every position in the sorted
set and checking if the distance between
positions satisfying the proximity condition.

sorttraverse time(m1m2)logm1 (assume m1ltm2)
46
Range Searching

Search for all the strings within a certain
lexicographical range.
the range of abc ..acc abracadabra,
acacia ? abacus, acrimonious X
Algorithm
Search each end of the defining intervals.
Collect all the sub-trees between (and including)
them.

47
Searching Suffix Array

P1 ? S lt P2
Binary search both limiting patterns in the
suffix array.
Find all the elements lying between both positions

48
Longest Repetition Searching

the match between two different positions of a
text where this match is the longest in the
entire text, e.g., 0 1 1 0 0 1 0 0 0 1 0 1 1 1

the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111 sistring
5 0100010111 sistring 6 100010111 sistring
7 00010111 sistring 8 0010111
1
2
2
4
3
3
2
6
7
3
5
5
1
4
8
49
Most Significant or Most Frequent Matching

the most frequently occurring strings within the
text database, e.g., the most frequent trigram
find the most frequent trigramfind the largest
subtree at a distance 3 characters from root

1
the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
2
2
4
3
3
i.e., 1, 2, 3 are the same for sistrings
100100010111 and 100010111
2
6
7
3
5
5
1
4
8
50
Building PAT Trees as Patricia Trees

bucketing of external nodes
collect more than one external node
a bucket replaces any subtree with size less than
a certain constraint (b)save significant number
of internal nodes
the external nodes inside a bucket do not have
any structure associated with themincrease the
number of comparisons for each search

51
Building PAT Trees as Patricia Trees(Continued)

mapping the tree onto the disk using super-nodes
Allocate as much as possible of the tree in a
disk page
Every disk page has a single entry point,
contains as much of the trees as possible,and
terminates either in external nodes or in
pointers to other disk pages
The pointers in internal nodes address either a
disk page or another node inside the same page
disk pages contain on the order of 1,000
internal/external nodes
on the average, each disk page contains about 10
steps of a root-to-leaf path

52
Suffix array construction (in MM)

The suffix array and the text must be in main
memory
The suffix array is the set of pointers
lexicographically sorted
The pointers are collected in ascending text
order
The pointers are sorted by the text they point to
(accessing the text at random positions)

53
Suffix array construction (in MM)

Algorithm
All the suffixes are bucket-sorted according to
the first letter only
At iteration i, the suffixes begin already sorted
by their 2i-1 first letters and end up sorted by
their first 2i letters.
Sort the text positions Ta and Tb in the
suffix array
Determine the relative order between text
positions Ta2i-1 and Tb 2i-1 in the current
stage of search

54
Construction of Suffix Arraysfor Large Text

Split the text blocks that can be sorted in MM.
Build the suffix array for the first block
Build the suffix array for the second block
Merge both suffix arrays
Build the suffix array for the third block
Merge the suffix array with the previous one
Build the suffix array for the fourth block
Merge the new suffix array with previous one

55
Merge Step

How to merge a large suffix array with the small
suffix array?
Determine how many elements of the large array
are to be placed between each pair of elements in
the small array
Read the large array sequentially into main
memory
Each suffix of that text is searched in the small
suffix array
Increment appropriate counter
Use the information to merge the arrays without
accessing the text

56
small text
small text
(a)
(b)
long text
small suffix array
small suffix array
local suffix array is built
counters
Counters are computed
small text
long suffix array
(c)
small suffix array
counters
final suffix array
The suffix arrays are merged
57
Signature Files
58
Signature Files

basic idea inexact filter
discard many of nonqualifying items
qualifying items definitely pass the test
false hits or false drops may also pass
accidentally
procedure
Documents are stored sequentially in text file.
Their signatures (hash-coded bit patterns) are
stored in the signature file.
Scan the signature file, discard nonqualifying
documents, and check the rest, when a query
arrives.

59
Merits of Signature Files

faster than full text scanning
1 or 2 orders of magnitude faster
modest space overhead
10-15 vs. 50-300 (inversion)
insertions can be handled more easily than
inversion
append only
no reorganization or rewriting

60
Basic Concepts

Use superimposed coding to create signature.
Each document is divided into logical blocks.
A block contains D distinct non-common words.
Each word yields word signature.
A word signature is a F-bit pattern, with m
1-bit.
Each word is divided into successive, overlapping
triplets. e.g. free --gt ?fr, fre, ree, ee ?
Each such triplet is hashed to a bit position.
The word signatures are ORed to form block
signature.
Block signatures are concatenated to form the
document signature.

B in text book
l
61
Basic Concepts (Continued)
B
l

Example (D2, F12, m4)word signaturefree 00
1 000 110 010text 000 010 101 001block
signature 001 010 111 011
Search
Use hash function to determine the m 1-bit
positions.
Examine each block signature for 1s bit
positions that the signature of the search word
has a 1.

62
A Signature File
Block 1
Block 2
Block 3
Block 4
This is a text. A text has many words. Words
are made from letters.
Text
000101 110101 100100 101101
Text Signature
h(text) 000101 h(many) 110000 h(words) 100100
h(made) 001100 h(letters) 100001
63
Basic Concepts (Continued)

false alarm (false hit, or false drop) Fdthe
probability that a block signature seems to
qualify, given that the block does not actually
qualify. Fd Probsignature qualifies block
does not
Ensure the probability of a false alarm is low
enough while keeping the signature file as short
as possible
For a given value of F, the value of m that
minimizes the false drop probability is such that
each row of the matrix contains 1s with
probability 0.5. Fd 2-m Fln2mD

NF binary matrix
F signature size in bits m number of bits per
word D number of distinct noncommon words
per document Fd false drop probability
mln2F/D
64
space overhead of index (1/80)(F/D) F is
measured in bits and D in words 10 overhead
false drop probability close to
2 10(1/80)(F/D) ? (F/D)8 m8ln25.545 Fd
2-5.5452 20 overhead false drop probability
close to 0.046 20(1/80)(F/D) ?
(F/D)16 m16ln211.09 Fd2-11.090.046
65
Sequential Signature File (SSF)
documents
the size of document signature the size of
block signatureF assume documents span exactly
one logical block
66
Classification of Signature-Based Methods

CompressionIf the signature matrix is
deliberately sparse, it can be compressed.
Vertical partitioningStoring the signature
matrix columnwise improves the response time on
the expense of insertion time.
Horizontal partitioningGrouping similar
signatures together and/or providing an index on
the signature matrix may result in
better-than-linear search.

67
Classification of Signature-Based Methods

Sequential storage of the signature matrix
without compression sequential signature files
(SSF)
with compression bit-block compression
(BC) variable bit-block compression (VBC)
Vertical partitioning
without compression bit-sliced signature files
(BSSF, BSSF) frame sliced (FSSF) generalized
frame-sliced (GFSSF)

68
Classification of Signature-Based
Methods(Continued)

with compression compressed bit slices
(CBS) doubly compressed bit slices
(DCBS) no-false-drop method (NFD)
Horizontal partitioning
data independent partitioning Gustafsons
method partitioned signature files
data dependent partitioning 2-level signature
files 5-trees

69
Criteria

the storage overhead
the response time on single word queries
the performance on insertion, as well as whether
the insertion maintains the append-only property

70
Compression

idea
Create sparse document signatures on purpose.
Compress them before storing them sequentially.
Method
Use B-bit vector, where B is large.
Hash each word into one (n) bit position(s).
Use run-length encoding.

71
Compression using run-length encoding
data 0000 0000 0000 0010 0000 base 0000 0001
0000 0000 0000 management 0000 1000 0000 0000
0000 system 0000 0000 0000 0000 1000 block
signature 0000 1001 0000 0010 1000
L2
L3
L4
L5
L1
L1 L2 L3 L4 L5 where x is the encoded
vale of x.
search Decode the encoded lengths of all the
preceding intervals example search data
(1) data gt 0000 0000 0000 0010 0000 (2)
decode L10000, decode L200, decode
L3000000 disadvantage search becomes low
72
Bit-block Compression (BC)
Data Structure (1) The sparse vector is divided
into groups of consecutive bits
(bit-blocks). (2) Each bit block is encoded
individually. Algorithm Part I. It is one bit
long, and it indicates whether there are any
1s in the bit-block (1) or the bit
-block is (0). In the latter case,
the bit-block signature stops here.
0000 1001 0000 0010 1000 0
1 0 1 1 Part II. It
indicates the number s of 1s in the bit-block.
It consists of s-1 1 and a
terminating zero. 10
0 0 Part III. It contains the
offsets of the 1s from the beginning of the
bit-block. 0011
10 00 ??b4,???0,
1, 2, 3,???00, 01, 10, 11 block signature 01011
10 00 00 11 10 00
73
Bit-block Compression (BC) (Continued)
Search data (1) data gt 0000 0000 0000 0010
0000 (2) the 4th bit-block (3) signature 01011
10 0 0 00 11 10 00 (4) OK, there is at least
one setting in the 4th bit-block. (5) Check
furthermore. 0 tells us there is only one
setting in the 4th bit-clock. Is it the
3rd bit? (6) Yes, 10 confirms the
result. Discussion (1) Bit-block compression
requires less space than Sequential
Signature File for the same false drop
probability. (2) The response time of Bit-block
compression is lightly less than Sequential
Signature File.
74
Vertical Partitioning

ideaavoid bringing useless portions of the
document signature in main memory
methods
store the signature file in a bit-sliced form or
in a frame-sliced form
store the signature matrix column-wise to improve
the response time on the expense of insertion time

75
Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
documents
(document signature)
transpose
documents
represent
76
documents
bit-files
search (1) retrieve m bit vectors. (instead of F
bit vectors) e.g., the word
signature of free is 001 000 110 010
the document contains free 3rd, 7th,
8th, 11th bit are set i.e.,
only 3rd, 7th, 8th, 11th files are examined.
(2) and these vectors. The 1s in the
result N-bit vector denote the qualifying
logical blocks (documents). (3) retrieve text
file through pointer file. insertion require F
disk accesses for a new logical block (document),
one for each bit-file, but no
rewriting
77
Frame-Sliced Signature File (FSSF)

Ideas
random disk accesses are more expensive than
sequential ones.
force each word to hash into bit positions that
are closer to each other in the document
signature
these bit files are stored together and can be
retrieved with a few random accesses
Procedures
The document signature (F bits long) is divided
into k frames of s consecutive bits each.
For each word in the document, one of the k
frames will be chosen by a hash function.
Using another hash function, the word sets m bits
in that frame.

78
documents
frames
Each frame will be kept in consecutive disk
blocks.
79
FSSF (Continued)

Example (D2, F12, s6, k2, m3) Word Signatu
re free 000000 110010 text 010110
000000 doc. signature 010110 110010
Search
Only one frame has to be retrieved for a single
word query. I.E., only one random disk access is
required.e.g., search documents that contain the
word freebecause the word signature of free
is placed in 2nd frame,only 2nd frame has to be
examined.
At most n frames have to be scanned for an n word
query.
InsertionOnly k frames have to be accessed
instead of F bit-slices.

80
Vertical Partitioning and Compression

idea
create a very sparse signature matrix
store it in a bit-sliced form
compress each bit slice by storing the position
of the 1s in the slice.

81
Compressed Bit Slices (CBS)

Rooms for improvements for bit-sliced method
Searching
Each search word requires the retrieval m bit
files.
The search time could be improved if m was forced
to be 1.
Insertion
Require too many disk accesses (equal to F, which
is typically 600-1000).

82
Compressed Bit Slices (CBS)(Continued)

Let m1. To maintain the same false drop
probability, F (S) has to be increased.

documents
one bit-setting for each word Only one row has to
be read
Size of a signature
83
(document collection)
representation for a word ????, ??0??? ????? 1???
, ???1? ????? ?
Do not distinguish synonyms.
Hash a word to obtain bucket address
Obtain the pointers to the relevant documents
from buckets
h(base)30
84
Doubly Compressed Bit Slices
Idea compress the sparse directory ?S?? ???? ????
? ??,?? ??buckets ???? ????? ??,?? ??hash function
Distinguish synonyms partially.
Follow the pointers of posting buckets to
retrieve the qualifying documents.
h1(base)30
h2(base)011
85
No False Drops Method
Fixed length Save space
Using pointer to the word in the text file
Distinguish synonyms completely.??h2????????
86
Horizontal Partitioning
1. Goal group the signatures into sets,
partitioning the signature matrix
horizontally. 2. Grouping criterion
documents
87
Partitioned Signature Files

Using a portion of a document signature as a
signature key to partition the signature file.
All signatures with the same key will be grouped
into a so-called module.
When a query signature arrives,
examine its signature key and look for the
corresponding modules
scan all the signatures within those modules that
have been selected

88
Comparisons

signature files
Use hasing techniques to produce an index
advantage
storage overhead is small (10-20)
disadvantages
the search time on the index is linear
some answers may not match the query, thus
filtering must be done

89
Comparisons (Continued)

inverted files
storage overhead (30 100)
search time for word searches is logarithmic
PAT arrays
potential use in other kind of searches
phrases
regular expression searching
approximate string searching
longest repetitions
most frequent searching

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 8 Indexing and Searching PowerPoint PPT Presentation