Title: Using Bloom Filters to Refine Web Search Results
1Using Bloom Filters to Refine Web Search Results
- Navendu Jain
- Mike Dahlin
- University of Texas at Austin
- Renu Tewari
- IBM Almaden Research Center
2Motivation
- Google, Yahoo, MSN Significant fraction of
near-duplicates in top search results - Google emacs manual query
- 7 of 20 results redundant
- 3 identical pairs
- 4 similar to one document
- Similar results for Yahoo, MSN, A9 search engines
3Problem Statement
- Goal Filter near-duplicates in web search
results - Given a query search results, identify pages that
are either - Highly similar in content (and link structure)
- Contained in another page (Inclusions with small
changes) - Key Constraints
- Low Space Overhead
- Use only a small amount of information per
document - Low Time Overhead (latency unnoticeable to
end-user) - Perform fast comparison and matching of documents
4Our Contributions
- A novel similarity detection technique using
content-defined chunking and Bloom filters to
refine web search results - Satisfies key requirements
- Compact Representation
- Incurs only about 0.4 extra bytes per document
- Quick Matching
- 66 ms for top-80 search results
- Document similarity using bit-wise AND of their
feature representations - Easy Deployment
- Attached as a filter over any search engines
result set
5Talk Outline
- Motivation
- Our approach
- System Overview
- Bloom Filters for Similarity Testing
- Experimental Evaluation
- Related Work and Conclusions
6System Overview
- Applying similarity detection to search engines
- Crawl time The web crawler
- Step 1 fetches a page and indexes it
- Step 2 computes and stores per-page features
- Search time The search-engine (or end users
browser) - Step 1 Retrieve the top results meta-data for a
given query - Step 2 Similarity Testing to filter highly
similar results
Similarity Testing
C
7Feature Extraction and Similarity Testing (1)
gt Similarity threshold
- Divide a file into variable-sized blocks (called
chunks) - Use Rabin fingerprint to compute block
boundaries - SHA-1 hash of each chunk as its feature
representation
Content-defined Chunking
Original Document
3
4
5
6
chunk 1
2
chunk 1
2'
3
4
5
6
Modified Document
Data inserted
8Feature Extraction and Similarity Testing (2)
gt Similarity threshold
- A Bloom filter is an approximate set
representation - An array of m bits (initially 0)
- k independent hash functions
- Supports
- Insert (x,S)
- Member (y,S)
Document
SHA-1
9Feature Extraction and Similarity Testing (3)
gt Similarity threshold
A
Document
Bit-wise AND
B
SHA-1
0
1
0
1
0
1
0
1
0
0
Bloom filter generation
A /\ B
75 of As set bits matched
10Proof-of-concept examples Differentiate between
multiple similar documents
- IBM site (http//www.ibm.com) Dataset
- 20 MB (590 documents)
- /investor/corpgoverance/index.phtml compared with
all pages - Similar pages (same base URL)
- cgcoi.phtml (53)
- cgblaws.phtml (69)
- CVS Repository Dataset
- Technical doc. file (17 KB)
- Extracted 20 consecutive versions from the CVS
- foo ? original document
- foo.1 ? first modified version
- foo.19 ? last version
11Talk Outline
- Motivation
- Our approach
- System Overview
- Bloom Filters for Similarity Testing
- Experimental Evaluation
- Related Work and Conclusions
12Evaluation (1) Effect of degree of similarity
- emacs manual query on Google
- 493 results retrieved using GoogleAPI
- Fraction of duplicates
- 88 (50 similarity), 31 (90 similarity)
- Larger Aliasing of higher ranked documents
- Initial result set repeated more frequently in
later results - Similar results observed for other queries
100
50 similar
80
60
Percentage of Duplicate Documents
80 similar
70 similar
60 similar
40
20
90 similar
0
0
100
200
300
400
500
Number of Top Search Results Retrieved
13Evaluation (2) Effect of Search Query Popularity
300
400
80
"jon stewart crossfire"
"republican national convention
"hawking black hole bet"
300
60
"day of the dead"
"national hurricane center"
200
"electoral college"
Percentage of Duplicate Documents
200
40
100
"Olympics 2004 doping"
100
20
"x prize spaceship"
"indian larry"
0
0
0
0
100
200
300
400
0
200
400
600
800
1000
0
200
400
600
800
1000
Number of Top Search Results Retrieved
14Evaluation (3) Analyzing Response Times
- Top-80 search results for emacs manual query
- Offline Computation time (pre-computed and
stored) - CDC chunks 80 0.3 ms
- Bloom filters generation 80 14
ms - Online Matching Time
- Bit-wise AND of two Bloom filters (4 ?s)
- Matching and Clustering time 66 ms
-
- Total (offline online) 1210 ms
- Online Time 66 ms
15Selected Related Work
- Most prior work based on shingling (many
variants) - Basic idea (Broder97)
- Divide document into k-shingles all k
consecutive words/tokens - Represent document by shingle-set
- Shingle-sets intersection large ? near-duplicate
documents - Reduce similarity detection problem to set
intersection - Differences with our technique
- Document similarity based on feature set
intersection - Higher feature-set computational overhead
- Feature set size dependent on sampling (Mins,
Modm, etc.)
16Conclusions
- Problem Highly similar matches in search results
- Popular Search engines (Google, Yahoo, MSN)
- Significant fraction of near-duplicates in top
results - Adversely affects query search performance
- Our Solution A similarity detection technique
using CDC and Bloom filters - Incurs small meta-data overhead
- 0.4 bytes per document
- Performs fast similarity detection
- Bit-wise AND operations order of ms
- Easily deployed as a filter over any search
engines results
17For more information
- http//www.cs.utexas.edu/users/nav/