Using Bloom Filters to Refine Web Search Results - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Using Bloom Filters to Refine Web Search Results

Description:

Department of Computer Sciences, UT Austin. 1. Using Bloom Filters to Refine Web Search Results ... Department of Computer Sciences, UT Austin. 7. Feature ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 18

Provided by: valueds209

Category:

more less

Transcript and Presenter's Notes

Title: Using Bloom Filters to Refine Web Search Results

1
Using Bloom Filters to Refine Web Search Results

Navendu Jain
Mike Dahlin
University of Texas at Austin
Renu Tewari
IBM Almaden Research Center

2
Motivation

Google, Yahoo, MSN Significant fraction of
near-duplicates in top search results
Google emacs manual query
7 of 20 results redundant
3 identical pairs
4 similar to one document
Similar results for Yahoo, MSN, A9 search engines

3
Problem Statement

Goal Filter near-duplicates in web search
results
Given a query search results, identify pages that
are either
Highly similar in content (and link structure)
Contained in another page (Inclusions with small
changes)
Key Constraints
Low Space Overhead
Use only a small amount of information per
document
Low Time Overhead (latency unnoticeable to
end-user)
Perform fast comparison and matching of documents

4
Our Contributions

A novel similarity detection technique using
content-defined chunking and Bloom filters to
refine web search results
Satisfies key requirements
Compact Representation
Incurs only about 0.4 extra bytes per document
Quick Matching
66 ms for top-80 search results
Document similarity using bit-wise AND of their
feature representations
Easy Deployment
Attached as a filter over any search engines
result set

5
Talk Outline

Motivation
Our approach
System Overview
Bloom Filters for Similarity Testing
Experimental Evaluation
Related Work and Conclusions

6
System Overview

Applying similarity detection to search engines
Crawl time The web crawler
Step 1 fetches a page and indexes it
Step 2 computes and stores per-page features
Search time The search-engine (or end users
browser)
Step 1 Retrieve the top results meta-data for a
given query
Step 2 Similarity Testing to filter highly
similar results

Similarity Testing
C
7
Feature Extraction and Similarity Testing (1)
gt Similarity threshold

Divide a file into variable-sized blocks (called
chunks)
Use Rabin fingerprint to compute block
boundaries
SHA-1 hash of each chunk as its feature
representation

Content-defined Chunking
Original Document
3
4
5
6
chunk 1
2
chunk 1
2'
3
4
5
6
Modified Document
Data inserted
8
Feature Extraction and Similarity Testing (2)
gt Similarity threshold

A Bloom filter is an approximate set
representation
An array of m bits (initially 0)
k independent hash functions
Supports
Insert (x,S)
Member (y,S)

Document
SHA-1
9
Feature Extraction and Similarity Testing (3)
gt Similarity threshold
A
Document
Bit-wise AND
B
SHA-1
0
1
0
1
0
1
0
1
0
0

Bloom filter generation
A /\ B

75 of As set bits matched
10
Proof-of-concept examples Differentiate between
multiple similar documents

IBM site (http//www.ibm.com) Dataset
20 MB (590 documents)
/investor/corpgoverance/index.phtml compared with
all pages
Similar pages (same base URL)
cgcoi.phtml (53)
cgblaws.phtml (69)
CVS Repository Dataset
Technical doc. file (17 KB)
Extracted 20 consecutive versions from the CVS
foo ? original document
foo.1 ? first modified version
foo.19 ? last version

11
Talk Outline

Motivation
Our approach
System Overview
Bloom Filters for Similarity Testing
Experimental Evaluation
Related Work and Conclusions

12
Evaluation (1) Effect of degree of similarity

emacs manual query on Google
493 results retrieved using GoogleAPI
Fraction of duplicates
88 (50 similarity), 31 (90 similarity)
Larger Aliasing of higher ranked documents
Initial result set repeated more frequently in
later results
Similar results observed for other queries

100
50 similar
80
60
Percentage of Duplicate Documents
80 similar
70 similar
60 similar
40
20
90 similar
0
0
100
200
300
400
500
Number of Top Search Results Retrieved
13
Evaluation (2) Effect of Search Query Popularity
300
400
80
"jon stewart crossfire"
"republican national convention
"hawking black hole bet"
300
60
"day of the dead"
"national hurricane center"
200
"electoral college"
Percentage of Duplicate Documents
200
40
100
"Olympics 2004 doping"
100
20
"x prize spaceship"
"indian larry"
0
0
0
0
100
200
300
400
0
200
400
600
800
1000
0
200
400
600
800
1000
Number of Top Search Results Retrieved
14
Evaluation (3) Analyzing Response Times