Effective Change Detection Using Sampling - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Effective Change Detection Using Sampling

Description:

Problem: We have only 5-download-cycle data. Solution: Extrapolate the history. Repeat ... Greedy is easy to implement and shows high performance ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 32
Provided by: wind358
Category:

less

Transcript and Presenter's Notes

Title: Effective Change Detection Using Sampling


1
Effective Change Detection Using Sampling
  • Junghoo John Cho
  • Alexandros Ntoulas
  • UCLA

2
Problem
Polling
Update
Query
Remote database
Local database
  • Application
  • Web search engines/crawlers
  • Web archive
  • Data warehouse
  • . . .

3
Existing Approach
  • Round robin
  • Download pages in a round robin manner
  • Change-frequency based CLW98, CGM00, EMT01
  • Estimate the change frequency
  • Adjust download frequency
  • Proven to be optimal

4
Our Approach
  • Sampling-based
  • Sample k pages from each source
  • Download more pages from the source with more
    changed samples

5
Comparison
  • Frequency based
  • Proven to be optimal
  • Change history required
  • Difficult to estimate change frequency
  • Sampling based
  • Can be worse than frequency based policy
  • No history/frequency-estimation required
  • Experimental comparison later

6
Questions
  • Are we assuming correlation?
  • How to use sampling results?
  • Proportional vs Greedy
  • How many samples?
  • Dynamic sample size adjustment?
  • What if we have very limited resources?

7
Is Correlation Necessary?
  • Random sampling
  • Correlation not necessary. Only random sampling
  • More discussion later

4/5
1/5
8
Questions
  • Are we assuming correlation?
  • How to use sampling results?
  • Proportional vs Greedy
  • How many samples?
  • Dynamic sample size adjustment?
  • What if we have very limited resources?

9
Download Model (1)
  • Fixed download cycle
  • Say, once a month
  • Fixed download resources in each cycle
  • Say, 100,000 page download every month
  • Goal
  • Download as many changes as we can
  • ChangeRatio
  • No of changed downloaded pages
  • No of downloaded pages

10
Download Model (2)
  • Two-stage sampling policy
  • Sampling stage
  • Download stage
  • Sampling requires page download

11
How to Use Sampling Result?
  • Sites A and B, each with 20 pages
  • 20 total download, 5 samples from each site
  • 10 page download remaining

1/5
A
B
4/5
12
Proportional Policy
  • Download pages proportionally to the detected
    changes
  • 8 pages from A, 2 pages from B

1/5
A
B
4/5
13
Greedy Policy
  • Download pages from the sites with most changes
  • 10 pages from A

1/5
A
B
4/5
14
Optimality of Greedy
  • Theorem
  • Greedy is optimal if we make download decisions
    purely based on sampling results
  • Probabilistic optimality for their expected values

15
Questions
  • Are we assuming correlation?
  • How to use sampling results?
  • Proportional vs Greedy
  • How many samples?
  • Dynamic sample size adjustment?
  • What if we have very limited resources?

16
How Many Samples?
  • Too few samples
  • Inaccurate change estimates
  • Too many samples
  • Waste of resources for sampling
  • How to determine optimal sample size?

17
Optimal Sample Size
  • Factors to consider
  • Total number of pages that we maintain
  • Number of pages that we can download in the
    current cycle
  • Number of pages in each Web site
  • Change distribution
  • Scenario 1 -- A 90/100, B 10/100
  • Scenario 2 -- A 60/100, B 40/100

18
Change Fraction Distribution
fraction of sites
?
?t
  • ri fraction of changed pages in site i
  • f(r) distribution of r values

19
Optimal Sample Size
  • N no of pages in a site
  • r no of pages to download / no of pages we
    maintain
  • Analysis is complex
  • is a good rule of thumb

20
Dynamic Sample Size?
  • Do we need the same sample size for every site?
  • A ? 0, B ? 0.45, C ? 0.55, D ? 1

21
Adaptive Sampling
  • If the estimated r is high/low enough, make an
    early decision
  • What does high enough mean?
  • Confidence interval above threshold

?
?t
22
In the Paper
  • More details on
  • Optimal sample size
  • Adaptive policy
  • The cases where resource is too limited for
    sampling

23
Experiments
  • 353,000 pages from 252 sites
  • Mostly popular sites
  • Yahoo, CNN, Microsoft,
  • 1400 pages from each site
  • Followed the links in the breadth-first manner
  • Monthly change history for 6 months
  • 5 download cycles
  • In experiments, 100,000 page downloads in each
    download cycle

24
Comparison of Policies
ChangeRatio
25
Optimal Sample Size
ChangeRatio
Sample Size
26
Comparison of Long-Term Performance
  • Problem We have only 5-download-cycle data
  • Solution Extrapolate the history

?
27
Frequency vs. Sampling
ChangeRatio
Frequency
Greedy
Download Cycle
28
Related Work
  • Frequency-based policy
  • Coffman et al., Journal of Scheduling 1998
  • Cho et al., SIGMOD 2000
  • Edwards et al., WWW 2001
  • Source cooperation
  • Olston et al., SIGMOD 2002

29
Conclusion
  • Sampling-based policy
  • Great short-term performance
  • No change history required
  • Frequency-based policy
  • Potentially good long-term performance if the
    change frequency does not change
  • Greedy is easy to implement and shows high
    performance

30
Future Work
  • Combination of sampling and frequency based
    policies
  • Switch to the frequency-based policy after a
    while
  • Good partitioning for sampling?
  • Site based? Directory based?
  • Content based?
  • Link-structure based?

31
Questions?
Write a Comment
User Comments (0)
About PowerShow.com