Lazy Preservation: Reconstructing Websites by Crawling the Crawlers - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers

Description:

Lazy Preservation: Reconstructing Websites by Crawling ... How much of the Web is indexed? ... Move from descriptive model to proscriptive & predictive model ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 25
Provided by: Fra5197
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers


1
Lazy Preservation Reconstructing Websites by
Crawling the Crawlers
  • Frank McCown, Joan A. Smith, Michael L. Nelson,
    Johan Bollen
  • Old Dominion UniversityNorfolk, Virginia, USA
  • Arlington, VirginiaNovember 10, 2006

2
Outline
  • Web page threats
  • Web Infrastructure
  • Web caching experiment
  • Web repository crawling
  • Website reconstruction experiment

3
Black hat http//img.webpronews.com/securityprone
ws/110705blackhat.jpgVirus image
http//polarboing.com/images/topics/misc/story.com
puter.virus_1137794805.jpg Hard drive
http//www.datarecoveryspecialist.com/images/head-
crash-2.jpg
4
How much of the Web is indexed?
Estimates from The Indexable Web is More than
11.5 billion pages by Gulli and Signorini
(WWW05)
5


6
(No Transcript)
7
Cached Image
8
Cached PDF
http//www.fda.gov/cder/about/whatwedo/testtube.pd
f
canonical
MSN version Yahoo
version Google version
9
Web Repository Characteristics


C Canonical version is stored M Modified version
is stored (modified images are thumbnails, all
others are html conversions) R Indexed but not
retrievable S Indexed but not stored
10
Timeline of Web Resource
11
Web Caching Experiment
  • Create 4 websites composed of HTML, PDF, images
  • http//www.owenbrau.com/
  • http//www.cs.odu.edu/fmccown/lazy/
  • http//www.cs.odu.edu/jsmit/
  • http//www.cs.odu.edu/mln/lazp/
  • Remove pages each day
  • Query GMY each day using identifiers

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Crawling the Web and web repositories
17
  • First developed in fall of 2005
  • Available for download at http//www.cs.odu.edu/f
    mccown/warrick/
  • www2006.org first lost website reconstructed
    (Nov 2005)
  • DCkickball.org first website someone else
    reconstructed without our help (late Jan 2006)
  • www.iclnet.org first website we reconstructed
    for someone else (mid Mar 2006)
  • Internet Archive officially endorses Warrick (mid
    Mar 2006)

18
How Much Did We Reconstruct?
Lost web site Reconstructed
web site
A
A
B
C
F
B
C
G
E
D
E
F
Four categories of recovered resources 1)
Identical A, E2) Changed B, C3) Missing D,
F4) Added G
Missing link to D points to old resource G
F cant be found
19
Reconstruction Diagram
added 20
changed 33
missing 17
identical 50
20
Reconstruction Experiment
  • Crawl and reconstruct 24 sites of various sizes
  • 1. small (1-150 resources) 2. medium (151-499
    resources)3. large (500 resources)
  • Perform 5 reconstructions for each website
  • One using all four repositories together
  • Four using each repository separately
  • Calculate reconstruction vector for each
    reconstruction (changed, missing, added)

21
Frank McCown, Joan A. Smith, Michael L. Nelson,
and Johan Bollen. Reconstructing Websites for the
Lazy Webmaster, Technical Report, arXiv
cs.IR/0512069, 2005.
22
Recovery Success by MIME Type
23
Repository Contributions
24
Current Future Work
  • Building a web interface for Warrick
  • Currently crawling reconstructing 300 randomly
    sampled websites each week
  • Move from descriptive model to proscriptive
    predictive model
  • Injecting server-side functionality into WI
  • Recover the PHP code, not just the HTML

25
Time Queries
26
Traditional Web Crawler
27
Web-Repository Crawler
28
Limitations
  • Web crawling
  • Limit hit rate per host
  • Websites periodically unavailable
  • Portions of website off-limits (robots.txt,
    passwords)
  • Deep web
  • Spam
  • Duplicate content
  • Flash and JavaScript interfaces
  • Crawler traps
  • Web-repo crawling
  • Limit hit rate per repo
  • Limited hits per day (API query quotas)
  • Repos periodically unavailable
  • Flash and JavaScript interfaces
  • Can only recover what repos have stored
  • Lossy format conversions (thumb nail images,
    HTMLlized PDFs, etc.)
Write a Comment
User Comments (0)
About PowerShow.com