International Collaboration on Web Archiving - PowerPoint PPT Presentation

About This Presentation
Title:

International Collaboration on Web Archiving

Description:

Biblioth que Nationale. de France. Who is Internet Archive? Our Library. 400 machines, 200TB ... IA is currently archiving 2 billion pages per month, 10 TB of ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 13
Provided by: andre459
Learn more at: http://worldcat.org
Category:

less

Transcript and Presenter's Notes

Title: International Collaboration on Web Archiving


1
International Collaboration on Web Archiving
  • Michele Kimpton
  • Internet Archive
  • Julien Masanès
  • Bibliothèque Nationale
  • de France

2
Who is Internet Archive?
  • Our Library
  • 400 machines, 200TB
  • 100mb/sec connectivity
  • A global collection of pages from the publicly
    available web from 1996

3
Why archive the Web?
  • What we do not save now will be lost forever

4
Why archiving the Web is Difficult
  • It is Massive and growing
  • IA is currently archiving 2 billion pages per
    month, 10 TB of data from over 2 million domains
  • Web continues to grow in excess of 300k new
    domains each year

5
It is constantly changing
  • Average page changes every 100 days
  • Median age of a site is 19 months
  • IA currently take a snapshot every 2 months but
    this is not enough in some cases

6
It is without boundaries and interconnected
  • On average 20 links per page with 5 pointed off
    site
  • Can not be characterized as a discrete object,
    like traditional library materials
  • The user can seamlessly move through the
    Universal Internet landscape

7
Crawling within a Country Domain is difficult
  • Currently difficult to get a complete list of all
    national sites
  • Discovery of new URLs- if not global crawling
    how do you find?
  • Identification of sites created locally but
    registered or hosted outside country domain
  • How to decide where to put the boundaries?

8
Benefits of global crawling
National domain B
  • Harvesting of new undiscovered URLs
  • Preservation of linkages which capture essence of
    the Internet

3
2
1
National domain A
9
Necessity of cross-access to collections
  • Future users wont be happy if they have to take
    a flight to follow a link

10
Possible levels of collaboration
  • Interoperability common tools (crawler and
    access tools) and standards (archiving format and
    metadata)
  • Share URLs list to draw a general topology map of
    the Web (distributed crawl)
  • Share crawling facilities
  • Share collections (distributed archives)


11
Status
  • 8 National Libraries have decided to set up
  • a working group on web archiving
  • -Specification of archiving crawler and tools
    (open source)
  • -Explore the possibility of collecting web sites
    in collaboration over 3 years

12
Questions ?
  • Discussion list web-archive_at_cru.fr
  • http//listes.cru.fr/wws/info/web-archive
Write a Comment
User Comments (0)
About PowerShow.com