Internet Archive - PowerPoint PPT Presentation

About This Presentation
Title:

Internet Archive

Description:

Transitioned from 'Archive of the Internet' to 'Archive on the Internet' ... Upload your movie to the Archive. Build a movie at the Archive! Texts. Have 20K books ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 27
Provided by: raym153
Learn more at: http://www.fdis.org
Category:
Tags: archive | internet

less

Transcript and Presenter's Notes

Title: Internet Archive


1
Internet ArchiveWeb Datamining
  • Raymie Stata
  • UC Santa Cruz Internet Archive

2
Agenda
  • State of the Archive
  • Collections
  • Infrastructure (freecache)
  • Internet Analytics
  • Information carnivores

3
Archive Overview
  • Started in 1996
  • Transitioned from Archive of the Internet to
    Archive on the Internet
  • Transitioning to Digital Library of the Future
  • Funding from private foundations, plus lots of
    volunteers

4
Digital Library of the Future
Universal Access to Human Knowledge
  • Information is accessible to anyone from anywhere
  • The best and broadest information is available
  • We imagine a small network of very large,
    regional, mega digital libraries

5
Web collection
  • Over 10B pages, 200TB, 50M sites
  • Broad crawls (20TB snapshot/2 months)
  • Narrow crawls (elections, 9/11)
  • Heritage crawls
  • Writing new crawler -(
  • Wayback machine
  • Success! 4M hits/day
  • Have search engine, but hidden!
  • Policy has been tested, remains same

6
Moving images
  • 2500 Movies
  • Open source movies
  • Upload your movie to the Archive
  • Build a movie at the Archive!

7
Texts
  • Have gt 20K books
  • Actively involved in 1M Book and ICDL
  • Bookmobile
  • Protest of Eldred
  • Real interest turned out to be overseas
  • India (30!), Egypt, Uganda
  • Spun into separate non-profit

8
Audio - eTree
  • Around 5,000 concerts from 250 bands
  • Growing 30 concerts, 1 band/day
  • Largest consumer of bandwidth
  • Consistent 85Mbps (downloads)
  • Same policy as Wayback
  • We respect requests

9
Infrastructure
  • Infinite bandwidth and storage
  • Core competency of the Archive
  • Vision, not reality
  • But striving for it makes us better
  • Recent challenges
  • Moving from 250TB to 1PB
  • Supporting eTree bandwidth

10
The Petabyte challenge
  • Finally having problems predicted
  • Power, cooling, disk failures dominating
  • Need larger staff, real software engineering
  • BUT
  • Took much longer than anticipated
  • Sticking to our philosophies
  • Commodity hardware
  • Widely used software simple scripts

11
The Petabyte architecture
  • New datacenter
  • To solve our power and cooling problems
  • Better procurement process
  • File-level mirroring
  • Use basic FS, simple scripts
  • Preparing for geoplexing (vs. file-level RAID)
  • Elimination of inter-crawl copies
  • This is currently our backup

12
The (eTree) bandwidth challenge
  • Can we do better than simply buying more
    bandwidth?
  • Yes! Find other people willing to help
  • Cooperative/open-source CDN

13
Freecache.org
  • It shouldnt cost you to give away content
  • To distribute using freecache, simply
  • Replace hrefhttp//X/Y
  • With hrefhttp//freecache.org/http//X/Y
  • To be a distribution node, simply install a 1K
    perl-script on your Apache server

14
Freecache design
  • Content routing done centrally
  • Right now, routing is random
  • Working on closeness-driven routing
  • LRU eviction policy
  • Throttles cheaters
  • Broken browsers have been a problem

15
Web scale datamining
  • Use data
  • Wayback, Wayback search
  • Web characterization
  • Story lifecycle analyzer

Apps
Access
Feature Datamarts
  • Access subsets of data fast
  • Full-text index, shingleprints
  • Connectivity, Term vectors

Warehouse
  • Store and access pages
  • Page cache
  • Feature extractor

Data collection
  • Download web pages
  • Donations, crawling

16
Tools for Web mining
  • Very similar to the Astronomy project
  • Need indexes, parallelism
  • Need to move computation to the data
  • Strategies to deal with different result-set
    sizes
  • Current focus is on the warehouse

17
Web datamining usingWeb Carnivores
18
The Carnivore Analogy Etzioni96
Web pages
19
The Carnivore Analogy
Search engines
Web pages
20
The Carnivore Analogy
Carnivore apps
Search engines
Web pages
21
Carnivores
  • Search engines have what you want
  • Google has 3B pages Its in there
  • No need to crawl anymore
  • However, their general-purpose interface do not
    always yield good results for specific
    information needs

22
Googlisms a fun carnivore
Googlism for scott kirkpatrick scott kirkpatrick
is an associate for rossscott kirkpatrick is an
awesome drummer with many fine credits to his
namescott kirkpatrick is 17 but certified as an
adultscott kirkpatrick is listed as one of the
executors in the will of george hankins dated 1
october 1838 in jackson countyscott kirkpatrick
is the new chairpersonscott kirkpatrick is
joining the flett chiropractic clinic
Googlism for john kubiatowicz john kubiatowicz
is a professor in computer science at uc
berkeleyjohn kubiatowicz is currently an
assistant professor at the university of
california at berkeleyjohn kubiatowicz is
designing ajohn kubiatowicz is working on
oceanstorejohn kubiatowicz is a researcher at
berkeley exploring the space of introspective
computingjohn kubiatowicz is a doctoral
candidate in the department of electrical
engineering and computer science at mit
23
A carnivore for genre search
  • Genre classifies documents by its intent
  • Why was the document written
  • Search engines search by topic, not genre
  • Idea build a carnivore for genre search

24
Genre search engine
Query Generation
Topic (from user)
Filter
Results
Term-vector generation
Genre (static)
25
Making it work
  • Query templates
  • Details of query matters
  • PMI-IR for genre terms
  • Discrimination as well as genre vector

26
User study
  • Genre Buying guides
  • Education for product selection
  • Lots on the Web, but hard to find
  • (Agreement on what they are)
  • Results
  • Topic by itself 0 P_at_10 (ie, none in top 10)
  • Topic buying guide 33
  • Our carnivore 51
Write a Comment
User Comments (0)
About PowerShow.com