Fast Phrase Querying with Combined Indexes - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Fast Phrase Querying with Combined Indexes

Description:

Searching in text databases is enhanced by providing a ... Other common queries were 'jennifer lopez' (126), 'the cranberries' (79), and 'santa claus' (76) ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 31
Provided by: justin122
Category:

less

Transcript and Presenter's Notes

Title: Fast Phrase Querying with Combined Indexes


1
Fast Phrase Querying with Combined Indexes
  • Hugh E. Williams, Justin Zobel, and Dirk Bahle
  • hugh, jz, dbahle_at_cs.rmit.edu.au
  • School of Computer Science and
  • Information Technology, RMIT University

2
Overview
  • An overview of phrase querying
  • Word-position inverted indexes
  • Use in evaluating phrase queries
  • Phrase indexes for phrase querying
  • Partial phrase indexes
  • Nextword indexes
  • Combined indexes for fast, space-efficient phrase
    querying
  • Results
  • Conclusions

3
Phrase Querying
  • Searching in text databases is enhanced by
    providing a rich set of alternative methods for
    expressing information needs
  • A phrase query matches documents or phrases that
    have the same word ordering. Unambiguous.
  • For example, a query may be to find all documents
    containing the exact phrase All the President's
    men
  • Collection documents are usually indexed so that
    punctuation defines a phrase
  • For example, the query will not match the
    document fragment all the presidents. Men,

  • Phrase queries might also be partially specified
    All the Pres' or All the men

4
Phrase Querying
  • Around 8 of queries from the Excite query logs
    we used in our experiments contain phrases
    enclosed in quotation marks
  • Median phrase length is 2 words and mean is 2.5
    around 34 contain three or more words
  • Around 41 of the remaining non-phrase queries
    actually match a phrase in a 21Gb Web collection
    we used (derived from the 1997 TREC VLC2)
  • Observation users often pose phrase queries
    without explicitly using the query operator
    (usually quotation marks)
  • Investigated in our recent TREC terabyte track
    experiments (the talks next week!)

5
More Observations...
  • Around 11 of the explicit phrase queries contain
    one of the three commonest words the, to, and
    of
  • But only 0.4 of these are terminated by the
    common word
  • Around 14.5 include one of the twenty commonest
    terms
  • Common words play different roles
  • Structural only tower of london
  • Important flights to london, flights from
    london
  • Essential the who, the the, this is the
    end
  • Ignoring common words leads to very fast query
    evaluation (details later)

6
What if we ignore common words?
  • Acceptable the query tower -- london may
    match
  • tower of london ...
  • tower in london ...
  • tower, while london
  • Unacceptable examples
  • The query -- -- -- end matches any document
    containing three stopwords and then end
  • The query the who may be completely stopped
  • Conclusion stopping of common words may be
    important for efficiency, but can have an
    unpredictable effect on user searching

7
A Word-Position Inverted Index
  • Inverted indexes are the standard method of
    supporting querying of large text databases
  • Example hampshire occurs in 53 documents, the
    first of which is ordinal document 9, where it
    occurs 3 times at word positions 3, 8, and 90

8
Evaluating Phrase Queries
  • Resolving phrase queries is straightforward, but
    that does not mean it is fast
  • The phrase new hampshire can be identified by
    retrieving and processing the postings lists for
    new and hampshire
  • A logical AND of the lists would identify the
    phrase beginning at position 7 in document 9 (and
    perhaps in other documents) but not in document
    5
  • The evaluation process
  • 1. Essential sort terms from rarest to most
    common
  • 2. Retrieve the shortest postings list
    (hampshire)
  • 3. Decode and create an in-memory list of
    candidate matches
  • 4. Using the in-memory list
  • 4a. Retrieve the next shortest postings list
    (new)
  • 4b. Merge the new postings list with the
    in-memory list to identify (partial) phrases
  • 4c. Repeat from 4a until there are no more lists

9
Evaluation Costs
  • The following graph shows the results of running
    132,276 phrase queries on a 21.9Gb collection of
    Web data (from the TREC Web track data)
  • Result Stopping only three words reduces query
    time by 60!

10
Evaluation Costs
  • The following graph shows how stopping with a
    hand-crafted list of 490 stopwords affects the
    speed of queries of different lengths (same
    queries as previously, but the 1997 TREC WT10g
    web collection)
  • Result Stopping improves speed five-fold, and is
    more effective for longer queries

11
Phrase Indexes
  • Phrase indexes are designed to improve the speed
    of phrase querying, ideally with low main-memory
    and disk space overheads
  • A phrase index is an inverted index, where the
    vocabulary entries are phrases
  • Cannot be used for other purposes word-level
    inverted index still needed
  • It stores entries for selected phrases, and is
    therefore useful for only those phrases complete
    phrase indexes are unlikely to be practical
  • Two types
  • Complete index of phrases of a fixed length l
    (the nextword index discussed later is of this
    type, for l2)
  • Incomplete phrase index for phrases of arbitrary
    length
  • We experiment with both, and discuss the latter
    first

12
Partial Phrase Indexes
  • A partial phrase index stores phrases that are
    exact matches to queries (Gutwin et al., 1998)
  • The following is an example for five common
    phrases
  • We do not store word positions, as we only
    experiment with exact matching

13
Using partial phrase indexes
  • Query evaluation is straightforward
  • For each phrase query, check the partial phrase
    vocabulary
  • If the phrase is found (for example, Apple
    Macintosh), then the query is resolved by
    retrieving its list
  • If the phrase is not found, follow the
    conventional evaluation strategy using the
    word-based inverted index
  • This structure affords the greatest savings for
    phrases containing common words
  • For example, the list for new york is at most
    as long as the list for york and probably much
    shorter. Its certainly much shorter than the
    list for new

14
A Useful Index
  • A useful partial phrase index is dependent on
    being able to successfully predict future
    queries
  • A simple approach is to store only frequent past
    queries, and this what we did in our
    experiments
  • For example, in our logs, thumbnail post was
    the most common (198). Other common queries were
    jennifer lopez (126), the cranberries (79),
    and santa claus (76)
  • In practice, an LRU or LFU strategy might be
    better, but this requires experimental validation
    with much larger query sets (which we now have)

15
Experiments
  • We experimented with a partial phrase index,
    combined with a complete inverted index
  • The partial phrase index is trained using the
    first 66,000 queries from our Excite log, and
    tested using the remaining 66,000
  • We tested keeping the 100, 1000, and 10000 most
    common queries in the partial phrase index
  • For 10000, there are many singularly-occuring
    phrases (choice is arbitrary)
  • Experiments were on the WT10g web collection

16
Results
  • Storing 10,000 queries in the partial phrase
    index reduces query evaluation costs by around 15

17
Results
  • Around 70 of the queries in the 10,000 word
    index are two words in length, 21 are three
    words, and 6 are four words
  • The scheme works best for short two word queries,
    which improve by around 29
  • The largest index is only 12.8 Mb (or 0.1 of the
    collection size)

18
Nextword Indexes
  • Williams et al. (1999) proposed a nextword index
    as a structure to support fast phrase querying
  • Advantages no additional main-memory costs
    works well for phrases of length two

19
Nextword Indexes...
  • Nextword indexes can be used to evaluate two word
    phrase queries as follows
  • 1. Find the word new in an in-memory vocabulary
    and retrieve the pointer to a nextword list on
    disk
  • 2. Process the nextwords of new to find
    hampshire and retrieve the pointer to the
    inverted list on disk
  • 3. Retrieve and process the inverted list (and
    then either display the document identifiers and
    offsets, or retrieve and display the documents)
  • For longer queries, each word need only be
    evaluated as a word or as a nextword the
    evaluation process is the same as for a
    conventional inverted index
  • Example the query the cat sat on the mat can
    be evaluated by retrieving and merging the lists
    for the cat, sat on, and the mat
  • For queries with an odd number of words, there is
    a choice as to which of the non-end pairs to
    evaluate

20
A Subtle Change
  • Since it was originally proposed in 1999, we have
    experimented with many optimisations to the
    nextword structure
  • In the experiments we describe here, we store
    every nth nextword in memory on disk, the
    nextwords are front-coded for compact storage
  • This allows the short, in-memory list to be
    scanned for a near match, and then the on-disk
    list to be directly accessed rather than
    completely decoded from its beginning

21
Evaluation Costs
  • The nextword index structure is large
  • For our collection of 10Gb, the compressed
    nextword index is 2.75Gb in size, almost exactly
    twice the size of the inverted index alone
  • But the saving in query evaluation time is
    dramatic
  • Average evaluation time is 0.02 seconds, or
    around 50 times faster than with the inverted
    index
  • Two word queries are 35 times faster
  • Five word queries are 15 times faster
  • Observation not surprisingly, most of the speed
    gain is for non-rare words there is little
    improvement for rare words

22
Combined Nextword Inverted Indexes
  • Proposal Combine a partial nextword index with a
    complete conventional index to speed phrase
    queries

23
Combined Nextword Inverted...
  • In a combined inverted nextword index, only
    common firstwords have a nextword list
  • We call this a top frequency based scheme
  • This leads to the following strategy
  • 1. Find all words with nextword lists
    (firstwords). Record inverse document frequency
    (IDF)
  • 2. Find all remaining words not in a
    firstword-nextword pair. Record IDF.
  • 3. Process unique words by increasing IDF (using
    nextword lists where possible)
  • 4. Display results
  • Query evaluation stops when
  • There are no lists left to process
  • There are no candidates left in the in-memory list

24
Evaluation Costs
  • The combined indexing strategy works well
  • An additional around 192Mb (or 13 of the
    inverted index size) for the three most-common
    words
  • Query time is halved, and is only 0.3 seconds
    slower on average per query than with 490-word
    stopping

25
Combining all three
  • We examined combining a partial nextword, partial
    phrase, and complete inverted index
  • Advantages phrase containing common words can
    use the nextword index, while common phrases
    (which rarely contain common words) can be
    evaluated using the partial phrase index.
    Difficult cases use the inverted index only
  • This works as follows
  • Look for the phrase in the partial phrase index
    vocabulary. If found, return answer if not,
    continue
  • Evaluate the phrase query using the combined
    nextword inverted strategy

26
Three-way results
  • Results for a 10,000 phrase partial phrase index
    on WT10g, with varying numbers of terms in the
    nextword index (0, 3, 6, and 192)

27
Three-way Results
  • The schemes are complementary
  • The partial phrase and partial nextword schemes
    improve different sets of queries
  • Queries of length two improve by 60 compared to
    phrase inverted, and by 15 compared to
    nextword inverted
  • Depending on the choice of the number of
    nextwords, the three-way combination is 60 to
    80 faster than an unstopped inverted index alone

28
Conclusions
  • Phrase queries are an important class of
    queries
  • Around 8 of queries explicitly involve a phrase
  • Many more can be evaluated as phrase queries
  • However, phrase queries are costly to evaluate
  • Phrase indexes offer an efficient solution to
    phrase querying
  • With a complete nextword index, querying is 50
    times faster, but the index is twice the size as
    an inverted index
  • With a partial phrase index of 10,000 words,
    querying is 15 faster and index is 1 of the
    size of an inverted index

29
Conclusions
  • Solution we propose a combined strategy use
    partial phrase, then partial nextword inverted
    indexes
  • With only three common words, this adds 13 to
    the index size and reduces phrase query
    evaluation times by 60
  • Conclusion Using a combined index, queries with
    stopwords can be efficiently evaluated with only
    a small increase in index size. Stopping is
    unnecessary
  • Questions?

30
Pointers ( advertising!)
  • The Search Engine Group, http//www.seg.rmit.edu.a
    u/
  • The zettair search engine, http//www.seg.rmit.edu
    .au/zettair/
  • My home page, http//www.cs.rmit.edu.au/hugh
  • H.E. Williams, J. Zobel, and D. Bahle, Fast
    Phrase Querying with Combined Indexes'', ACM
    Transactions on Information Systems, 22(4),
    573-594, 2004.
  • D. Bahle, H.E. Williams, and J. Zobel,
    Efficient Phrase Querying with an Auxiliary
    Index'', In Proc. of the ACM-SIGIR International
    Conference on Research and Development in
    Information Retrieval (Tampere, Finland), K.
    Jarvelin, M. Beaulieu, R. Baeza- Yates, and S. H.
    Myaeng, Eds., 215-221, 2002.
  • D. Bahle, H.E. Williams, and J. Zobel,
    Compaction techniques for nextword indexes.
    In Proc. of the String Processing and Information
    Retrieval Symposium (San Rafael, Chile). IEEE
    Computer Society Press, 33--45, 2001.  
  • D. Bahle, H.E. Williams, and J. Zobel,
    Optimised phrase querying and browsing in text
    databases. In Proc. of the Australasian
    Computer Science Conference (Gold Coast,
    Australia), M. Oudshoorn, Ed. Conferences in
    Research and Practice in Information Technology,
    11--19, 2001.
  • H.E. Williams, J. Zobel, and P. Anderson,
    What's next? Index structures for efficient
    phrase querying. In Proc. of the Australasian
    Database Conference (Auckland, New Zealand), M.
    Orlowska, Ed. Springer-Verlag, 141--152, 1999.
Write a Comment
User Comments (0)
About PowerShow.com