How to Crawl the Web - PowerPoint PPT Presentation

About This Presentation
Title:

How to Crawl the Web

Description:

How to Crawl the Web. Looksmart.com. 12/13/2002. Junghoo ' ... Application to a Web crawler. Visit pages once every week for 5 weeks. Estimate change frequency ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 39
Provided by: Jungh1
Learn more at: http://oak.cs.ucla.edu
Category:
Tags: crawl | web

less

Transcript and Presenter's Notes

Title: How to Crawl the Web


1
How to Crawl the Web
  • Looksmart.com
  • 12/13/2002

Junghoo John Cho UCLA
2
What is a Crawler?
initial urls
init
to visit urls
get next url
get page
visited urls
web
extract urls
web pages
3
Applications
  • Internet Search Engines
  • Google, AltaVista
  • Comparison Shopping Services
  • My Simon, BizRate
  • Data mining
  • Stanford Web Base, IBM Web Fountain

4
Prototype WebBase Crawler
  • Web Base Project
  • BackRub Crawler, PageRank
  • New Web Base Crawler
  • 20,000 lines in C/C
  • 130M pages collected

5
Crawling Issues (1)
  • Load at visited web sites
  • Space out requests to a site
  • Limit number of requests to a site per day
  • Limit depth of crawl

6
Crawling Issues (2)
  • Load at crawler
  • Parallelize

initial urls
init
init
to visit urls
get next url
get next url
get page
get page
extract urls
extract urls
visited urls
web pages
7
Crawling Issues (3)
  • Scope of crawl
  • Not enough space for all pages
  • Not enough time to visit all pages

8
Crawling Issues (4)
  • Replication
  • Pages mirrored at multiple locations

9
Crawling Issues (5)
  • Incremental crawling
  • How do we avoid crawling from scratch?
  • How do we keep pages fresh?

10
My Research On Crawler
  • Load on sites PAWS00
  • Parallel crawler WWW01
  • Page selection WWW7
  • Replicated page detection SIGMOD00
  • Page freshness SIGMOD00, VLDB01
  • Crawler architecture VLDB00

11
Outline of This Talk
  • How can we maintain pages fresh?
  • How does the Web change?
  • What do we mean by fresh pages?
  • How should we refresh pages?

12
Web Evolution Experiment
  • How often does a Web page change?
  • How long does a page stay on the Web?
  • How long does it take for 50 of the Web to
    change?
  • How do we model Web changes?

13
Experimental Setup
  • February 17 to June 24, 1999
  • 270 sites visited (with permission)
  • identified 400 sites with highest PageRank
  • contacted administrators
  • 720,000 pages collected
  • 3,000 pages from each site daily
  • start at root, visit breadth first (get new old
    pages)
  • ran only 9pm - 6am, 10 seconds between site
    requests

14
Average Change Interval
fraction of pages
¾
¾
average change interval
15
Change Interval By Domain
fraction of pages
¾
¾
average change interval
16
Modeling Web Evolution
  • Poisson process with rate ?
  • T is time to next event
  • fT (t) ? e-? t (t gt 0)

17
Change Interval of Pages
for pages that change every 10 days on average
fraction of changes with given interval
Poisson model
interval in days
18
Change Metrics
  • Freshness
  • Freshness of element ei at time t is F (
    ei t ) 1 if ei is up-to-date at time t
    0 otherwise

19
Change Metrics
  • Age
  • Age of element ei at time t is A( ei t
    ) 0 if ei is up-to-date at time t
    t - (modification ei time)
    otherwise

20
Change Metrics
F(ei)
1
0
time
A(ei)
0
time
refresh
update
21
Refresh Order
  • Fixed order
  • Explicit list of URLs to visit
  • Random order
  • Start from seed URLs follow links
  • Purely random
  • Refresh pages on demand,
  • as requested by user

web
database
ei
ei
...
...
22
Freshness vs. Revisit Frequency
r ? / f average change frequency / average
visit frequency
23
Age vs. Revisit Frequency
r ? / f average change frequency / average
visit frequency
24
Trick Question
  • Two page database
  • e1 changes daily
  • e2 changes once a week
  • Can visit one page per week
  • How should we visit pages?
  • e1 e2 e1 e2 e1 e2 e1 e2... uniform
  • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 proportional
  • e1 e1 e1 e1 e1 e1 ...
  • e2 e2 e2 e2 e2 e2 ...
  • ?

e1
e1
e2
e2
web
database
25
Proportional Often Not Good!
  • Visit fast changing e1
  • ? get 1/2 day of freshness
  • Visit slow changing e2
  • ? get 1/2 week of freshness
  • Visiting e2 is a better deal!

26
Optimal Refresh Frequency
  • Problem
  • Given ?1, ?1, .., ?N and f ,
  • find f1, f2,.., fN that maximize

27
Optimal Refresh Frequency
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

28
Optimal Refresh for Age
  • Shape of curve is the same in all cases
  • Holds for any change frequency distribution

29
Comparing Policies
Based on Statistics from experiment and revisit
frequency of every month
30
Not Every Page is Equal!
? Some pages are more important
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
F (S ) 1 F (e1) 2 F (e2)
31
Weighted Freshness
f
w 2
w 1
l
32
Change Frequency Estimation
  • How to estimate change frequency?
  • Naïve Estimator X/T
  • X number of detected changes
  • T monitoring period
  • 2 changes in 10 days 0.2 times/day
  • Incomplete change history

33
Improved Estimator
  • Based on the Poisson model
  • X number of detected changes
  • N number of accesses
  • f access frequency
  • 3 changes in 10 days 0.36 times/day
  • ? Accounts for missed changes

34
Improvement Significant?
  • Application to a Web crawler
  • Visit pages once every week for 5 weeks
  • Estimate change frequency
  • Adjust revisit frequency based on the estimate
  • Uniform do not adjust
  • Naïve based on the naïve estimator
  • Ours based on our improved estimator

35
Improvement from Our Estimator
Detected changes Ratio to uniform
Uniform 2,147,589 100
Naïve 4,145,582 193
Ours 4,892,116 228
(9,200,000 visits in total)
36
Other Estimators
  • Irregular access interval
  • Last-modified date
  • Categorization

37
Summary
  • Web evolution experiment
  • Change metric
  • Refresh policy
  • Frequency estimator

38
The End
  • Thank you for your attention
  • For more information visit
  • http//www-db.stanford.edu/cho/
Write a Comment
User Comments (0)
About PowerShow.com