William Y. Arms - PowerPoint PPT Presentation

About This Presentation
Title:

William Y. Arms

Description:

MySpace, Facebook, Flickr, Delicious. Community portals. Yahoo Groups, DBLife ... a simple way to translate mathematical problems into efficient computer codes. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 40
Provided by: DanHutte5
Category:
Tags: arms | codes | myspace | william

less

Transcript and Presenter's Notes

Title: William Y. Arms


1

Cornell Information Science
Research Seminar The Web Lab http//weblab.infosc
i.cornell.edu/
  • William Y. Arms
  • Manuel Calimlim
  • Lucy Walle
  • Felix Weigel
  • January 23, 2007

2
The Web Lab A Joint Project of Cornell
University and the Internet Archive
Faculty William Arms, Johannes Gehrke, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang,... Researchers Manuel Calimlim, Dave
Lifka, Ruth Mitchell, Lucia Walle, Felix
Weigel,... Students Selcuk Aya, Pavel Dmitriev,
Blazej Kot, with more than 50 M.Eng., and
undergraduate students from Information Science
and Computer Science Internet Archive Brewster
Kahle, Tracey Jacquith, Michael Stack, Kris
Carpenter,...
3
Introduction to the Web LabMining the History of
the Web
  • The Internet Archive's Web Collection
  • Complete crawls of the Web, every two months
    since 1996
  • Total archive is about 110,000,000,000 pages
    (110 billion)
  • Recent crawls are about 60 TByte (compressed)
  • Total archive is about 1,900 TByte (compressed)
  • Metadata contains format, links, anchor text

4
The Library Stacks the Internet Archive
5
The Wayback Machine
  • Demo
  • http//www.archive.org/

6
Research using Metadata about Web Pages
  • Current NSF grant
  • Research using anchor text
  • links to microsoft.com and google.com
  • Changes to the link structure of the Web
  • differences between crawls
  • densification (increases in average node
    degree)
  • Formation of online groups

7
Example of Past Work Social and Information
Networks, Joining a Community
  • Close to one billion (user, community)
    instances
  • Work by Lars Backstrom, Dan Huttenlocher, Jon
    Kleinberg, and Xiangyang Lan

8
The Never-ending Research Dialog
Here's an analysis we would like to do... Not as
you suggest it, but here's another idea...
We don't know how to do that analysis. Would
this be any use to you? That might be possible,
with the following modification...
INFORMATION SCIENTIST
RESEARCHER
Let's try it and see.
9
The Role of Web Data for Social Science Research
  • Social networks are an important research topic
  • Emergence of global phenomena from local effects
  • Viral spreading of rumors
  • Behavior of individuals in a community
  • Roles in discussion threads, herd behavior in
    opinion polls
  • Network structure and dynamics
  • Strength of weak ties, triangle relations,
    homophily

10
How to Observe a Social Network?
  • Social network research before the web
  • Talk to people, make notes
  • Distribute questionnaires, gather statistics
  • Problems with this approach
  • Tedious task
  • Small scale
  • The Internet Archive is a great resource for
    research
  • Contains web pages with social networks
  • Records the history of the pages

11
Social Networks on the Web
  • The web contains many social networks
  • Sites for social networking, social bookmarking,
    file sharing
  • MySpace, Facebook, Flickr, Delicious
  • Community portals
  • Yahoo Groups, DBLife
  • Encyclopedia and folksonomy projects
  • Wikipedia, Wikia
  • Review sites and customer comments
  • Amazon, Netflix
  • Blogs, web forums, Usenet

12
The Bliss and Curse of Digital Data
  • Opportunities
  • Collecting network data at an unprecedented scale
  • Verifying hypotheses in many different networks
  • Monitoring communities at a finer granularity
  • Mining and searching social networks
  • Challenges
  • Finding suitable information on the web
  • Extracting information from web pages
  • Making web data persistent
  • Processing very large data sets
  • Access rights and privacy

13
Web Lab and Social Science Research
  • Collaboration with Cornells Institute for the
    Social Sciences
  • Our goal Make data available to researchers
  • Large web graph database with multiple crawls
  • Packaged subsets of crawls for analysis
  • Visual extraction tool for creating new data sets
    (ongoing)
  • Small-scale crawling for adding new web sites
    (starting)
  • Full-text indexing (planned)
  • Demo of the extraction tool available at
  • http//www.cs.cornell.edu/weigel/WrapperDemo/

14
Web Data Extraction
  • Researchers often dont care about web pages, but
    specific substructures inside the pages
  • Blog postings
  • Web forums
  • Social tagging
  • News headlines
  • Tables of content
  • Bibliographies
  • Product details
  • Customer reviews

15
Web Data Collaboration Server
  • Data extraction
  • Writing extraction code is a tedious task
  • Create tools to make the data easily accessible
    in a structured format (e.g., tables in a
    database)
  • Data sharing
  • Extracting the same data repeatedly is a waste
    of time and storage space
  • Let users share their data and extraction rules
  • Data curation
  • Web data is often incomplete and erroneous
  • Let users collaborate to correct and complete
    the data

16
Demonstration
  • Demo of the extraction tool available at
  • http//www.cs.cornell.edu/weigel/WrapperDemo/

17
The Web Lab System
INTERNET ARCHIVE
Web Collection
Wayback Machine
Text indexes
File server
Computer cluster
National super-computers
Structure database
Text indexes
Page store
CORNELL UNIVERSITY
18
Technical Processing the Web Lab
Networking Internet 2, National Lambda
Rail Wayback Machine Commodity computers
with local file systems Structure
database Relational database system on large
shared memory computer Data analysis Specialized
Linux cluster with Hadoop distributed file
system and MapReduce programming
Different types of computer for different
functions
19
The Research Process
  • Select a sub-set for analysis
  • SQL query the relational database directly
  • Use the GetPages tool on the Web site to send
    an SQL query
  • Download the sub-set
  • To the researcher's computer
  • To the Web Lab file server
  • Clean-up the data
  • MapReduce tasks on the Hadoop cluster
  • Data analysis
  • MapReduce tasks on the Hadoop cluster

20
Selection Methods
  • By known identifier (Wayback Machine)
  • web pages with the URL http//www.nsf.gov/
  • By character string (full text indexing) --
    future
  • all pages containing, "Internet is doubling
    every six months"
  • all page containing the SARS-CoV genetic
    sequence
  • By metadata criteria
  • all web pages that link to microsoft.com but not
    to google.com
  • all email addresses that I used to receive mail
    from but have not had mail from recently
  • Example provided by Marc Smith

21
Benefits of Using a Relational Database
  • Simple query language for retrieving data
  • Transaction support
  • Concurrency control for parallel queries
  • Multiple indices for high performance
  • Reliability since databases have built-in
    recovery functionality

22
Metadata Loading
  • The crawler outputs compressed metadata files
    (DAT files).
  • Each DAT file has a set of crawled pages with
    page metadata, including things like crawl time,
    IP address, mime type, language encoding, etc.
  • Most importantly, the outgoing links from each
    page are parsed, including the full URL and
    associated anchor text.

23
Database Schema
  • Crawl Name of the crawl from which data is
    loaded
  • Page Metadata about each webpage plus fields to
    help find and extract the full html text
  • Link The outgoing links from crawled pages
  • Url Lookup table for unique URLs
  • Host Lookup table for unique hostnames

24
Crawls Loaded Into SQL DB
25
Selection from the Database
  • SQL query the relational database directly
  • (Contact Manuel Calimlim)
  • Use the GetPages tool on the Web site to send
    an SQL query -- work in progress

26
Demonstration
  • Demonstration of the Web Lab web site
  • http//weblab.infosci.cornell.edu/
  • and the GetPages tool

27
Massive Data Analysis by Non-Specialists
  • A typical scientist or social scientist
  • Has deep domain knowledge
  • Has good algorithmic understanding
  • Is often a competent computer user or has a
    research assistant who is familiar with languages
    such as Fortran, Python, and Matlab, or
    applications packages such as SAS and Excel.
  • But...
  • Has limited understanding of large-scale data
    analysis
  • Is not skilled at any form of computing that
    requires parallel computing or concurrency
  • Typical problem of scale Given 100 billion URLs,
    how do you
  • identify duplicates?

28
Hadoop and MapReduce Programming
Hadoop An open source distributed file system
similar to the Google File System. It supports
MapReduce programming. http//lucene.apache.org/h
adoop/ MapReduce A functional programming style
to support large-scale data analysis without the
need for global data structures. In the
1960s, Fortran gave scientists a simple way to
translate mathematical problems into efficient
computer codes. MapReduce programming gives
researchers a simple way to run massive data
analysis on large computer clusters.
29
The MapReduce Paradigm
M map tasks
R reduce tasks
Input data split into files
Output files
Intermediate files
Output 0
split 0 split 1 split 2 split 3 split 4
Output 1
Each intermediate file is divided into R
partitions
Each reduce task corresponds to one partition
30
A Web Graph Example
2
1
4
3
6
5
31
Building the Web Graph
URLs, pages, and links URLs contained in Web
pages may link to pages never crawled URLs not
canonicalized different URLs may refer to same
page Links are from a page to a URL Web graph
from crawl data Nodes are union of pages
crawled and URLs seen Each node and edge has
time interval(s) over which it exists
32
Web Graph Example
Problem Given a set of URL pairs in
uncanonicalized form (u0, v0), create a list of
all the edges that point to each node of the web
graph Replace each u0 or v0 with its
canonicalized form u or v. Create a list of
all nodes of the graph, i.e., the set of unique
u. Discard all (u, v) pairs, where u v, or
v is not a node of the graph. Discard all
duplicate edges. For each node v, create a list
(v, u), where u is the set of nodes that have
edges to node v. Each step is a simple
programming task for a small numbers of links on
a single computer. How can this simplicity be
retained with huge numbers of links on a very
large computer cluster?
33
MapReduce Example
Map task Input (u0, v0) Output (u, d) //
Indicate that u is a from-URL (v, u) //
Indicate that v is a to-URL with link from u d
is a dummy marker. Do not output if u v. This
is simple application code to write.
34
A MapReduce Example
Merge The input to the reduce process merges the
output values from the map task that correspond
to each URL. For each URL, w, it creates a
list w, d, ... , d, u1, ..., uk This merge is
performed automatically by the system libraries.
35
A MapReduce Example
Reduce Input w, d, ... , d, u1, ..., uk,
where w is any URL. Output If there is no
marker d in the list, discard and do not output.
This corresponds to a URL that never appears only
as the first element of a (u, v) pair. Otherwise
remove duplicates from u1, ..., uk and output.
The output is a to-URL and a list of the nodes
that link to it v, u1, ..., uk This is
simple application code to write.
36
For the FutureExamples of Tools and Services
  • The Web Lab is steadily building a set of tools
    for researchers
  • API and Web services
  • GetPages Web forms to select dataset by query
    of a relational database with indexes by date,
    URL, domain name, file type, anchor text, etc.
  • Focused Web crawling (modification of Heritrix
    crawler)
  • Extraction of Web graph from subset and
    calculations, e.g., PageRank, hubs and
    authorities
  • Graph visualization
  • Natural language processing of anchor text

37
The Web Lab is Ready for Use
  • We are ready to work with a number of
    researchers
  • Systems
  • Relational database operational
  • Hadoop pilot cluster (large cluster soon)
  • File server and web server operational
  • People
  • Manuel Calimlim (database)
  • Lucy Walle (Hadoop MapReduce)
  • Tools
  • A variety of tools in prototype
  • Experience with large volumes of anchor text and
    URLs

38
Thanks
This work would not be possible without the
forethought and long standing commitment of
Brewster Kahle and the Internet Archive to
capture and preserve the content of the Web for
future generations. This work has been funded in
part by the National Science Foundation, grants
CNS-0403340, DUE-0127308, SES-0537606,
IIS-0634677, and IIS-0705774.
39

Cornell Information Science
Research Seminar The Web Lab http//weblab.infosc
i.cornell.edu/
  • William Y. Arms
  • Manuel Calimlim
  • Lucy Walle
  • Felix Weigel
  • January 23, 2007
Write a Comment
User Comments (0)
About PowerShow.com