Approximation Algorithms for Frequency Related Query Processing on Streaming Data - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Approximation Algorithms for Frequency Related Query Processing on Streaming Data

Description:

Summarize a stream of elements. Estimate the frequency of a ... Streaming algorithms. found real applications (important) can lead to theoretical results (fun) ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 26
Provided by: csUal
Category:

less

Transcript and Presenter's Notes

Title: Approximation Algorithms for Frequency Related Query Processing on Streaming Data


1
Approximation Algorithms for Frequency Related
QueryProcessing on Streaming Data
  • Presented by Fan Deng
  • Supervisor Dr. Davood Rafiei
  • May 30, 2007

2
Outline
  • Introduction
  • Continuous membership query
  • Point query
  • Similarity self-join size estimation
  • Conclusions and future work

3
Data stream
  • A sequence of data records
  • Examples
  • Document/URL streams from a Web crawler
  • IP packet streams
  • Web advertisement click streams
  • Sensor reading streams
  • ...

4
Processing in one pass
  • One pass processing
  • Online stream (one scan required)
  • Massive offline stream (one scan preferred)
  • Challenges
  • Huge data volume
  • Fast processing requirement
  • Relatively small fast storage space

5
Approximation algorithms
  • Exact query answers
  • can be slow to obtain
  • may need large storage space
  • sometimes are not necessary
  • Approximate query answers
  • can take much less time
  • may need less space
  • with acceptable errors

6
Frequency related queries
  • Frequency
  • of occurrences
  • Continuous membership query
  • Point query
  • Similarity self-join size estimation

7
Outline
  • Introduction
  • Continuous membership query SIGMOD06
  • Motivating application
  • Problem statement
  • Our theoretical and experimental results
  • Point query
  • Similarity self-join size estimation
  • Conclusions and future work

8
A Motivating Application
  • Duplicate URL detection in Web crawling
  • Search engines Broder et al. WWW03
  • Fetch web pages continuously
  • Extract URLs within each downloaded page
  • Check each URL (duplicate detection)
  • If never seen before
  • Then fetch it
  • Else skip it

9
A Motivating Application (cont.)
  • Problems
  • Huge number of distinct URLs
  • Memory is usually not large enough
  • Disks are slow
  • Errors are usually acceptable
  • A false positive (missed URLs)
  • A false negative (redundant crawls or disk search)

10
Problem statement
  • A sequence of elements with order
  • Storage space M
  • Not large enough to store all distinct elements
  • Continuous membership query
  • Appeared before? Yes or No
  • d g a f b e a d c b a
  • Our goal
  • Minimize the of errors
  • Fast

11
SBF theoretical results
  • SBF will be stable
  • The expected of 0s will become a constant
    after a number of updates
  • Converge at an exponential rate
  • Monotonic decreasing
  • False positive rates become constant
  • An upper bound of false positive rates
  • (a function of 4 parameters SBF size, of hash
    functions, max cell values, and kick-out rates)
  • Setting the optimal parameters (partially
    empirical)

12
SBF experimental results (cont.)
  • Comparison SBF, and FPBuffering method (LRU)
  • 700M real URL fingerprints
  • SBF generates 3-13 less false negatives, same
    of false positives (lt10)
  • MIN, Broder et al. WWW03, theoretically optimal
  • assumes the entire sequence of requests is known
    in advance
  • beats LRU caching by lt5 in most cases
  • More false positives allowed, SBF gains more

13
Outline
  • Introduction
  • Continuous membership query
  • Point query to be submitted
  • Motivating application
  • Problem statement
  • Theoretical and experimental results
  • Similarity self-join size estimation
  • Conclusions and future work

14
Motivating application
  • Internet traffic monitoring
  • Query the of IP packets sent by a particular IP
    address in the past one hour
  • Phone call record analysis
  • Query the of calls to a given phone yesterday

15
Problem statement
  • Point query
  • Summarize a stream of elements
  • Estimate the frequency of a given element
  • Goal minimize the space cost and answer the
    query fast

16
CMM theoretical results
  • Unbiased estimate (deduct mean)
  • Estimate variance is the same as that of
    Fast-AGMS, a well-known method
  • (in the case deducting mean)
  • For less skewed data set
  • the estimation accuracies of CMM and Fast-AGMS
    are exactly the same

17
CMM experimental results and analysis
  • For skewed data sets
  • Accuracy (given the same space)
  • CMM-median Fast-AGMS gt CMM-mean
  • Advantage of CMM 2 estimates from 1 sketch
  • More flexible (with estimate upper bound)
  • More powerful (Count-min can be more accurate for
    the very skewed data set)

18
Outline
  • Introduction
  • Continuous membership query
  • Point query
  • Similarity self-join size estimation
  • submitted to VLDB07
  • Motivating application
  • Problem statement
  • Theoretical and experimental results
  • Conclusions and future work

19
Motivating application
  • Near-duplicate document detection for search
    engines Broder 99, Henzinger 06
  • Very slow (30M pages, 10 days in 1997 2006?)
  • To predict the processing time,
  • necessary to estimate the number of similar pairs
  • Data cleaning in general (similarity self-join)
  • To find a better query plan (query optimization)
  • Estimates of similarity self-join size is needed

20
Problem statement
  • Similarity self-join size
  • Given a set of records with d attributes,
    estimate the of record pairs that at least
    s-similar
  • An s-similar pair
  • A pair of records with s attributes in common
  • E.g. ltDavood, Rafiei, CS, UofA, Canadagt
  • ltFan, Deng, CS, UofA, Canadagt
  • are 3-similar

21
Theoretical results
  • Unbiased estimate
  • Standard deviation bound of the estimate
  • Time and space cost
  • (For both offline and online SimParCount)

22
Experimental results
  • Online SimPairCount v.s. Random sampling
  • Given the same amount of space
  • Error (estimate trueValue) / trueValue
  • Dataset
  • DBLP paper titles
  • Each converted into a record with 6 attributes
  • Using min-wise independent hashing

23
Similarity self-join size estimation
Experimental results (cont.)
24
Conclusions and future work
  • Streaming algorithms
  • found real applications (important)
  • can lead to theoretical results (fun)
  • More work to be done
  • Current direction
  • multi-dimensional streaming algorithms
  • E.g
  • Estimating the of outliers in one pass

25
Questions/Comments?
Write a Comment
User Comments (0)
About PowerShow.com