Approximation Algorithms for Frequency Related Query Processing on Streaming Data - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Approximation Algorithms for Frequency Related Query Processing on Streaming Data

Description:

Summarize a stream of elements. Estimate the frequency of a ... Streaming algorithms. found real applications (important) can lead to theoretical results (fun) ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 26

Provided by: csUal

Category:

more less

Transcript and Presenter's Notes

Title: Approximation Algorithms for Frequency Related Query Processing on Streaming Data

1
Approximation Algorithms for Frequency Related
QueryProcessing on Streaming Data

Presented by Fan Deng
Supervisor Dr. Davood Rafiei
May 30, 2007

2
Outline

Introduction
Continuous membership query
Point query
Similarity self-join size estimation
Conclusions and future work

3
Data stream

A sequence of data records
Examples
Document/URL streams from a Web crawler
IP packet streams
Web advertisement click streams
Sensor reading streams
...

4
Processing in one pass

One pass processing
Online stream (one scan required)
Massive offline stream (one scan preferred)
Challenges
Huge data volume
Fast processing requirement
Relatively small fast storage space

5
Approximation algorithms

Exact query answers
can be slow to obtain
may need large storage space
sometimes are not necessary
Approximate query answers
can take much less time
may need less space
with acceptable errors

6
Frequency related queries

Frequency
of occurrences
Continuous membership query
Point query
Similarity self-join size estimation

7
Outline

Introduction
Continuous membership query SIGMOD06
Motivating application
Problem statement
Our theoretical and experimental results
Point query
Similarity self-join size estimation
Conclusions and future work

8
A Motivating Application

Duplicate URL detection in Web crawling
Search engines Broder et al. WWW03
Fetch web pages continuously
Extract URLs within each downloaded page
Check each URL (duplicate detection)
If never seen before
Then fetch it
Else skip it

9
A Motivating Application (cont.)

Problems
Huge number of distinct URLs
Memory is usually not large enough
Disks are slow
Errors are usually acceptable
A false positive (missed URLs)
A false negative (redundant crawls or disk search)

10
Problem statement

A sequence of elements with order
Storage space M
Not large enough to store all distinct elements
Continuous membership query
Appeared before? Yes or No
d g a f b e a d c b a
Our goal
Minimize the of errors
Fast

11
SBF theoretical results

SBF will be stable
The expected of 0s will become a constant
after a number of updates
Converge at an exponential rate
Monotonic decreasing
False positive rates become constant
An upper bound of false positive rates
(a function of 4 parameters SBF size, of hash
functions, max cell values, and kick-out rates)
Setting the optimal parameters (partially
empirical)

12
SBF experimental results (cont.)

Comparison SBF, and FPBuffering method (LRU)
700M real URL fingerprints
SBF generates 3-13 less false negatives, same
of false positives (lt10)
MIN, Broder et al. WWW03, theoretically optimal
assumes the entire sequence of requests is known
in advance
beats LRU caching by lt5 in most cases
More false positives allowed, SBF gains more

13
Outline

Introduction
Continuous membership query
Point query to be submitted
Motivating application
Problem statement
Theoretical and experimental results
Similarity self-join size estimation
Conclusions and future work

14
Motivating application

Internet traffic monitoring
Query the of IP packets sent by a particular IP
address in the past one hour
Phone call record analysis
Query the of calls to a given phone yesterday

15
Problem statement

Point query
Summarize a stream of elements
Estimate the frequency of a given element
Goal minimize the space cost and answer the
query fast

16
CMM theoretical results

Unbiased estimate (deduct mean)
Estimate variance is the same as that of
Fast-AGMS, a well-known method
(in the case deducting mean)
For less skewed data set
the estimation accuracies of CMM and Fast-AGMS
are exactly the same

17
CMM experimental results and analysis

For skewed data sets
Accuracy (given the same space)
CMM-median Fast-AGMS gt CMM-mean
Advantage of CMM 2 estimates from 1 sketch
More flexible (with estimate upper bound)
More powerful (Count-min can be more accurate for
the very skewed data set)

18
Outline

Introduction
Continuous membership query
Point query
Similarity self-join size estimation
submitted to VLDB07
Motivating application
Problem statement
Theoretical and experimental results
Conclusions and future work

19
Motivating application

Near-duplicate document detection for search
engines Broder 99, Henzinger 06
Very slow (30M pages, 10 days in 1997 2006?)
To predict the processing time,
necessary to estimate the number of similar pairs
Data cleaning in general (similarity self-join)
To find a better query plan (query optimization)
Estimates of similarity self-join size is needed

20
Problem statement

Similarity self-join size
Given a set of records with d attributes,
estimate the of record pairs that at least
s-similar
An s-similar pair
A pair of records with s attributes in common
E.g. ltDavood, Rafiei, CS, UofA, Canadagt
ltFan, Deng, CS, UofA, Canadagt
are 3-similar

21
Theoretical results