EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey - PowerPoint PPT Presentation

1 / 2
About This Presentation
Title:

EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey

Description:

Maintain a counter (for each query returned) There is a match if (counter == query size) ... Now maintain bit vectors instead of counters in the. accumulators ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey


1
EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR
PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen
Mihaylov, Torsten Suel, Samrat Ganguly, and Rauf
IzmailovPolytechnic University, Brooklyn
NEC Laboratories America, Inc., New Jersey
  • What is RSS?
  • Rich Site Summary (version 0.91)
  • RDF Site Summary (versions 0.9 and 1.0)
  • Really Simple Syndication (version 2.0)
  • Provides
  • Web content (or summaries)
  • Meta-data (TITLE, URL and DESCRIPTION)
  • Goals
  • Web Syndication
  • Allow readers to keep track of updates
  • Internal Representations for Efficient Matching
  • Use of Inverted Index
  • Queries are indexed by their terms
  • Reduces the number of queries examined
  • Queries, Terms and Documents are
  • represented by unique identifiers (QIDs, TIDs,
    DIDs)
  • Comparison to Traditional Search
  • Retrospective Search
  • On a previously crawled file collection
  • Searching the past
  • Collection of files is static
  • Queries are dynamic
  • Prospective Search
  • On newly added or updated files
  • Searching the future
  • Files are dynamic
  • Collection of queries is static

New Q1 York Q1 Yankees Q1 Q2 Red
Q2 Q3 Sox Q2 Q3 Boston Q3
Q1 New York Yankees Q2 Yankees Red Sox Q3
Boston Red Sox
  • Query (Subscription) Types
  • AND only All terms have to appear
  • k-out-of-n At least k (out of all n) terms have
    to appear
  • Boolean Boolean expression with AND, OR and NOT

For more information please send email to
uirmak_at_cis.poly.edu or suel_at_poly.edu.
2
EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR
PROSPECTIVE SEARCH ENGINES (continued) Utku
Irmak, Svilen Mihaylov, Torsten Suel, Samrat
Ganguly, and Rauf IzmailovPolytechnic
University, Brooklyn NEC Laboratories
America, Inc., New Jersey
  • Datasets and Experimental Evaluations
  • Subscriptions Query logs from excite.com
  • Documents Crawled parsed web pages
  • Evaluation Throughput with various
  • numbers of subscriptions
  • A Primitive Matching Algorithm (AND only)
  • For each TID in the document
  • - Find queries that contain TID (using
    inverted index)
  • - Maintain a counter (for each query
    returned)
  • There is a match if (counter query size)

Opt 1 Exploiting Term Frequencies and Position Information - Assign TIDs based on frequencies - Sort terms in the queries by TIDs - Sort terms in incoming document by TIDs - For each TID in the document - If (TID pos0) counter1 - Else if (TID poscounter) counter Advantage Fewer counters maintained in the accumulators Smaller hash table
  • Opt 3 Partitioning the Queries
  • Create multiple smaller inverted
  • indexes
  • Repeat the matching algorithm
  • Advantage
  • Better locality (in the processor cache)
  • Opt 2 Use of Bloom Filters
  • Bloom Filter A probabilistic, space-
  • efficient method for membership queries
  • For each new item, set the corresponding
  • bit to 1
  • False negatives are guaranteed not to
  • occur
  • Advantage
  • Reduced cost of maintaining the accumulators
  • A Clustering Approach
  • Queries usually have common terms and some
  • are contained by others
  • If a query is already evaluated on a document,
  • contained queries can be answered very
    efficiently

A Greedy Clustering Algorithm - Create (artificial) super queries - Create inverted index only for super queries - Now maintain bit vectors instead of counters in the accumulators - Evaluate the corresponding cluster of contained queries for any matched super query
For more information please send email to
uirmak_at_cis.poly.edu or suel_at_poly.edu.
Write a Comment
User Comments (0)
About PowerShow.com