EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey

About This Presentation

Title:

EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly, and Rauf Izmailov Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey

Description:

Maintain a counter (for each query returned) There is a match if (counter == query size) ... Now maintain bit vectors instead of counters in the. accumulators ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 3

Provided by: xiaohu9

Category:

more less

Transcript and Presenter's Notes

Title: EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey

1
EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR
PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen
Mihaylov, Torsten Suel, Samrat Ganguly, and Rauf
IzmailovPolytechnic University, Brooklyn
NEC Laboratories America, Inc., New Jersey

What is RSS?
Rich Site Summary (version 0.91)
RDF Site Summary (versions 0.9 and 1.0)
Really Simple Syndication (version 2.0)
Provides
Web content (or summaries)
Meta-data (TITLE, URL and DESCRIPTION)
Goals
Web Syndication
Allow readers to keep track of updates

Internal Representations for Efficient Matching
Use of Inverted Index
Queries are indexed by their terms
Reduces the number of queries examined
Queries, Terms and Documents are
represented by unique identifiers (QIDs, TIDs,
DIDs)

Comparison to Traditional Search
Retrospective Search
On a previously crawled file collection
Searching the past
Collection of files is static
Queries are dynamic
Prospective Search
On newly added or updated files
Searching the future
Files are dynamic
Collection of queries is static

New Q1 York Q1 Yankees Q1 Q2 Red
Q2 Q3 Sox Q2 Q3 Boston Q3
Q1 New York Yankees Q2 Yankees Red Sox Q3
Boston Red Sox

Query (Subscription) Types
AND only All terms have to appear
k-out-of-n At least k (out of all n) terms have
to appear
Boolean Boolean expression with AND, OR and NOT

For more information please send email to
uirmak_at_cis.poly.edu or suel_at_poly.edu.
2
EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR
PROSPECTIVE SEARCH ENGINES (continued) Utku
Irmak, Svilen Mihaylov, Torsten Suel, Samrat
Ganguly, and Rauf IzmailovPolytechnic
University, Brooklyn NEC Laboratories
America, Inc., New Jersey

Datasets and Experimental Evaluations
Subscriptions Query logs from excite.com
Documents Crawled parsed web pages
Evaluation Throughput with various
numbers of subscriptions

A Primitive Matching Algorithm (AND only)
For each TID in the document
- Find queries that contain TID (using
inverted index)
- Maintain a counter (for each query
returned)
There is a match if (counter query size)

Opt 1 Exploiting Term Frequencies and Position Information - Assign TIDs based on frequencies - Sort terms in the queries by TIDs - Sort terms in incoming document by TIDs - For each TID in the document - If (TID pos0) counter1 - Else if (TID poscounter) counter Advantage Fewer counters maintained in the accumulators Smaller hash table

Opt 3 Partitioning the Queries
Create multiple smaller inverted
indexes
Repeat the matching algorithm
Advantage
Better locality (in the processor cache)

Opt 2 Use of Bloom Filters
Bloom Filter A probabilistic, space-
efficient method for membership queries
For each new item, set the corresponding
bit to 1
False negatives are guaranteed not to
occur
Advantage
Reduced cost of maintaining the accumulators

A Clustering Approach
Queries usually have common terms and some
are contained by others
If a query is already evaluated on a document,
contained queries can be answered very
efficiently

A Greedy Clustering Algorithm - Create (artificial) super queries - Create inverted index only for super queries - Now maintain bit vectors instead of counters in the accumulators - Evaluate the corresponding cluster of contained queries for any matched super query
For more information please send email to
uirmak_at_cis.poly.edu or suel_at_poly.edu.

Write a Comment

User Comments (0)

About PowerShow.com

EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey - PowerPoint PPT Presentation