Archive-It Architecture Introduction - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Archive-It Architecture Introduction

Description:

Schedules new periodic crawls. Talks to crawler pool through HCC ... Incremental indexing - goal of new crawls in index within 72 hours ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 18
Provided by: cni93
Category:

less

Transcript and Presenter's Notes

Title: Archive-It Architecture Introduction


1
Archive-It Architecture Introduction
  • April 3, 2006
  • Dan Avery
  • Internet Archive

2
Archive-It Components
  • Crawling
  • User Interface
  • Storage
  • Playback
  • Text Indexing
  • Integration

3
Component Integration
4
Crawling
  • Heritrix ( http//crawler.archive.org/ )
  • Java application
  • Open source (LGPL)
  • Crawls for completeness/depth
  • Highly configurable

5
Crawling - Distributed Crawling
  • Heritrix Cluster Controller
  • Java component - open source - developed by IA
  • http//crawler.archive.org/hcc
  • Provides proxy access to pool of Heritrix
    instances through JMX interface
  • Provides crawler control and status
  • Currently controlling 33 crawler instances on
    three commodity dual Opterons--upper bound unknown

6
Archive-It Web Application
  • User Interface and Crawl Scheduling
  • Gets seed URLs and crawl parameters from users
  • Schedules new periodic crawls
  • Talks to crawler pool through HCC
  • Provides access, search, and crawl history UI

7
Storage
  • archive.org ARC repository
  • custom Perl system
  • simple storage on primary/backup pairs
  • monthly MD5 digest verification
  • robust, non proprietary file format
  • Alexandria (Egypt)/Amsterdam

8
Access
  • Internet Archive Wayback Machine
  • Replaying archived web pages since 2001
  • Current IA version written in Perl and C, with
    components distributed across various machines
  • Not open source, but open source beta (in Java)
    available now

9
Full-Text Indexing
  • Nutch (http//nutch.org)
  • NutchWAX (http//archive-access.sf.net) additions
    create and search indexes of stored ARC files
  • Standard text search plus link analysis
  • can search by date instead of relevance, useful
    for individual archives

10
Text Indexing Challenges
  • Some parts are distributable, some are not
  • Incremental indexing - goal of new crawls in
    index within 72 hours
  • Working on Archive-It usable map/reduce version -
    July
  • In the meantime, a lot of workarounds

11
Integration
  • Group of Perl and bash scripts - planning more
    complex than the execution
  • Most components available individually
  • Decentralized control, centralized monitoring
  • Each component operates almost entirely
    independently

12
The Big Picture
13
Future Challenges
  • Crawler trap detection
  • Scalability
  • Current setup can accommodate 300 partners at
    current crawling rates
  • During pilot we crawled/indexed/stored just over
    100,000,000 documents (4TB) in eight weeks
  • More machines can be easily added to storage and
    crawling clusters

14
Scalability
  • Current Nutch is between versions
  • Old version has some non-distributable pieces
  • New version is much more distributable and
    scalable (map/reduce - Hadoop), but not ready for
    incremental indexing

15
Looking ahead
  • After basic UI/archiving/indexing...
  • Time-based search UI
  • Analyzing archives for research and ongoing
    collection improvement
  • Content classification
  • Rate of change
  • New site suggestions

16
http//www.archive-it.org
17
RLGs Web Archiving Program
  • Collaborative collection development.
  • Descriptive metadata for web archives.
  • Usability/user studies
  • Intellectual property concerns
  • Web Archiving 101
  • Web archiving services and software
Write a Comment
User Comments (0)
About PowerShow.com