Using Bloom Filters to Refine Web Search Results - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Using Bloom Filters to Refine Web Search Results

Description:

Department of Computer Sciences, UT Austin. 1. Using Bloom Filters to Refine Web Search Results ... Department of Computer Sciences, UT Austin. 7. Feature ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 18
Provided by: valueds209
Category:

less

Transcript and Presenter's Notes

Title: Using Bloom Filters to Refine Web Search Results


1
Using Bloom Filters to Refine Web Search Results
  • Navendu Jain
  • Mike Dahlin
  • University of Texas at Austin
  • Renu Tewari
  • IBM Almaden Research Center

2
Motivation
  • Google, Yahoo, MSN Significant fraction of
    near-duplicates in top search results
  • Google emacs manual query
  • 7 of 20 results redundant
  • 3 identical pairs
  • 4 similar to one document
  • Similar results for Yahoo, MSN, A9 search engines

3
Problem Statement
  • Goal Filter near-duplicates in web search
    results
  • Given a query search results, identify pages that
    are either
  • Highly similar in content (and link structure)
  • Contained in another page (Inclusions with small
    changes)
  • Key Constraints
  • Low Space Overhead
  • Use only a small amount of information per
    document
  • Low Time Overhead (latency unnoticeable to
    end-user)
  • Perform fast comparison and matching of documents

4
Our Contributions
  • A novel similarity detection technique using
    content-defined chunking and Bloom filters to
    refine web search results
  • Satisfies key requirements
  • Compact Representation
  • Incurs only about 0.4 extra bytes per document
  • Quick Matching
  • 66 ms for top-80 search results
  • Document similarity using bit-wise AND of their
    feature representations
  • Easy Deployment
  • Attached as a filter over any search engines
    result set

5
Talk Outline
  • Motivation
  • Our approach
  • System Overview
  • Bloom Filters for Similarity Testing
  • Experimental Evaluation
  • Related Work and Conclusions

6
System Overview
  • Applying similarity detection to search engines
  • Crawl time The web crawler
  • Step 1 fetches a page and indexes it
  • Step 2 computes and stores per-page features
  • Search time The search-engine (or end users
    browser)
  • Step 1 Retrieve the top results meta-data for a
    given query
  • Step 2 Similarity Testing to filter highly
    similar results

Similarity Testing
C
7
Feature Extraction and Similarity Testing (1)
gt Similarity threshold
  • Divide a file into variable-sized blocks (called
    chunks)
  • Use Rabin fingerprint to compute block
    boundaries
  • SHA-1 hash of each chunk as its feature
    representation


Content-defined Chunking
Original Document
3
4
5
6
chunk 1
2
chunk 1
2'
3
4
5
6
Modified Document
Data inserted
8
Feature Extraction and Similarity Testing (2)
gt Similarity threshold
  • A Bloom filter is an approximate set
    representation
  • An array of m bits (initially 0)
  • k independent hash functions
  • Supports
  • Insert (x,S)
  • Member (y,S)

Document
SHA-1
9
Feature Extraction and Similarity Testing (3)
gt Similarity threshold
A
Document
Bit-wise AND
B
SHA-1
0
1
0
1
0
1
0
1
0
0

Bloom filter generation
A /\ B

75 of As set bits matched
10
Proof-of-concept examples Differentiate between
multiple similar documents
  • IBM site (http//www.ibm.com) Dataset
  • 20 MB (590 documents)
  • /investor/corpgoverance/index.phtml compared with
    all pages
  • Similar pages (same base URL)
  • cgcoi.phtml (53)
  • cgblaws.phtml (69)
  • CVS Repository Dataset
  • Technical doc. file (17 KB)
  • Extracted 20 consecutive versions from the CVS
  • foo ? original document
  • foo.1 ? first modified version
  • foo.19 ? last version

11
Talk Outline
  • Motivation
  • Our approach
  • System Overview
  • Bloom Filters for Similarity Testing
  • Experimental Evaluation
  • Related Work and Conclusions

12
Evaluation (1) Effect of degree of similarity
  • emacs manual query on Google
  • 493 results retrieved using GoogleAPI
  • Fraction of duplicates
  • 88 (50 similarity), 31 (90 similarity)
  • Larger Aliasing of higher ranked documents
  • Initial result set repeated more frequently in
    later results
  • Similar results observed for other queries

100
50 similar
80
60
Percentage of Duplicate Documents
80 similar
70 similar
60 similar
40
20
90 similar
0
0
100
200
300
400
500
Number of Top Search Results Retrieved
13
Evaluation (2) Effect of Search Query Popularity
300
400
80
"jon stewart crossfire"
"republican national convention
"hawking black hole bet"
300
60
"day of the dead"
"national hurricane center"
200
"electoral college"
Percentage of Duplicate Documents
200
40
100
"Olympics 2004 doping"
100
20
"x prize spaceship"
"indian larry"
0
0
0
0
100
200
300
400
0
200
400
600
800
1000
0
200
400
600
800
1000
Number of Top Search Results Retrieved
14
Evaluation (3) Analyzing Response Times
  • Top-80 search results for emacs manual query
  • Offline Computation time (pre-computed and
    stored)
  • CDC chunks 80 0.3 ms
  • Bloom filters generation 80 14
    ms
  • Online Matching Time
  • Bit-wise AND of two Bloom filters (4 ?s)
  • Matching and Clustering time 66 ms
  • Total (offline online) 1210 ms
  • Online Time 66 ms

15
Selected Related Work
  • Most prior work based on shingling (many
    variants)
  • Basic idea (Broder97)
  • Divide document into k-shingles all k
    consecutive words/tokens
  • Represent document by shingle-set
  • Shingle-sets intersection large ? near-duplicate
    documents
  • Reduce similarity detection problem to set
    intersection
  • Differences with our technique
  • Document similarity based on feature set
    intersection
  • Higher feature-set computational overhead
  • Feature set size dependent on sampling (Mins,
    Modm, etc.)

16
Conclusions
  • Problem Highly similar matches in search results
  • Popular Search engines (Google, Yahoo, MSN)
  • Significant fraction of near-duplicates in top
    results
  • Adversely affects query search performance
  • Our Solution A similarity detection technique
    using CDC and Bloom filters
  • Incurs small meta-data overhead
  • 0.4 bytes per document
  • Performs fast similarity detection
  • Bit-wise AND operations order of ms
  • Easily deployed as a filter over any search
    engines results

17
For more information
  • http//www.cs.utexas.edu/users/nav/
Write a Comment
User Comments (0)
About PowerShow.com