Internet and Data Management - PowerPoint PPT Presentation

About This Presentation
Title:

Internet and Data Management

Description:

Visit fast changing e1. get 1/2 day of freshness. Visit slow changing e2 ... 1 day. Page visited. Page changed. Incomplete change history. 28. Improved Estimator ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 33
Provided by: Jungh1
Learn more at: http://oak.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Internet and Data Management


1
Internet and Data Management
Junghoo John Cho UCLA Computer Science
2
Information Galore
Biblio sever
Legacy database
Plain text files
3
Challenges Too much information?
  • Discovery
  • Management
  • Overload
  • Access

4
Approaches
  • Central caching and Indexing
  • Google, Excite, AltaVista
  • Dynamic integration
  • MySimon, BizRate

5
Central Caching and Indexing
Central Index
6
Challenges
  • Page selection and download
  • What page to download?
  • Page and index update
  • How to update pages?
  • Page ranking
  • What page is important or relevant?
  • Scalability

7
Dynamic Integration
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
8
Challenges
  • Heterogeneous sources
  • Different data models relational,
    object-oriented
  • Different schemas and representations
  • Keanu Reeves or Reeves, K. etc.
  • Limited query capabilities
  • Mediator caching

9
Outline of This Talk
  • How can we maintain pages fresh?
  • How does the Web change?
  • What do we mean by fresh pages?
  • How should we refresh pages?

10
Web Evolution Experiment
  • How often does a Web page change?
  • How long does a page stay on the Web?
  • How long does it take for 50 of the Web to
    change?
  • How do we model Web changes?

11
Experimental Setup
  • February 17 to June 24, 1999
  • 270 sites visited (with permission)
  • identified 400 sites with highest PageRank
  • contacted administrators
  • 720,000 pages collected
  • 3,000 pages from each site daily
  • start at root, visit breadth first (get new old
    pages)
  • ran only 9pm - 6am, 10 seconds between site
    requests

12
Average Change Interval
fraction of pages
¾
¾
average change interval
13
Change Interval By Domain
fraction of pages
¾
¾
average change interval
14
Modeling Web Evolution
  • Poisson process with rate l
  • T is time to next event
  • fT (t) l e-lt (t gt 0)

15
Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
16
Change Metrics
  • Freshness
  • Freshness of element ei at time t is F (
    ei t ) 1 if ei is up-to-date at time t
    0 otherwise

17
Change Metrics
  • Age
  • Age of element ei at time t is A( ei t
    ) 0 if ei is up-to-date at time t
    t - (modification ei time)
    otherwise

18
Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
19
Trick Question
  • Two page database
  • e1 changes daily
  • e2 changes once a week
  • Can visit one page per week
  • How should we visit pages?
  • e1 e2 e1 e2 e1 e2 e1 e2... uniform
  • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
  • e1 e1 e1 e1 e1 e1 ...
  • e2 e2 e2 e2 e2 e2 ...
  • ?

e1
e1
e2
e2
web
database
20
Proportional Often Not Good!
  • Visit fast changing e1
  • ? get 1/2 day of freshness
  • Visit slow changing e2
  • get 1/2 week of freshness
  • Visiting e2 is a better deal!

21
Optimal Refresh Frequency
  • Problem
  • Given ?1, ?1, .., ?N and f ,
  • find f1, f2,.., fN that maximize

22
Optimal Refresh Frequency
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

23
Optimal Refresh for Age
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

24
Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
25
Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
F (S ) 1 F (e1) 2 F (e2)
26
Weighted Freshness
f
w 2
w 1
l
27
Change Frequency Estimation
  • How to estimate change frequency?
  • Naïve Estimator X/T
  • X number of detected changes
  • T monitoring period
  • 2 changes in 10 days 0.2 times/day
  • Incomplete change history

28
Improved Estimator
  • Based on the Poisson model
  • X number of detected changes
  • N number of accesses
  • f access frequency
  • 3 changes in 10 days 0.36 times/day
  • ? Accounts for missed changes

29
Improvement Significant?
  • Application to a Web crawler
  • Visit pages once every week for 5 weeks
  • Estimate change frequency
  • Adjust revisit frequency based on the estimate
  • Uniform do not adjust
  • Naïve based on the naïve estimator
  • Ours based on our improved estimator

30
Improvement from Our Estimator
(9,200,000 visits in total)
31
WebArchive Project
  • Can we store the history of the Web?
  • Web is ephemeral
  • Study of the Web evolution
  • Challenges
  • Update?
  • Compression?
  • New storage?
  • Indexing?

32
Conclusion
  • Exciting area and many challenges ahead!
  • Thank you for your attention
  • For more information visit
  • http//www.cs.ucla.edu/cho/
Write a Comment
User Comments (0)
About PowerShow.com