William Y. Arms

About This Presentation

Title:

William Y. Arms

Description:

MySpace, Facebook, Flickr, Delicious. Community portals. Yahoo Groups, DBLife ... a simple way to translate mathematical problems into efficient computer codes. ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 40

Provided by: DanHutte5

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: William Y. Arms

1

Cornell Information Science
Research Seminar The Web Lab http//weblab.infosc
i.cornell.edu/

William Y. Arms
Manuel Calimlim
Lucy Walle
Felix Weigel
January 23, 2007

2
The Web Lab A Joint Project of Cornell
University and the Internet Archive
Faculty William Arms, Johannes Gehrke, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang,... Researchers Manuel Calimlim, Dave
Lifka, Ruth Mitchell, Lucia Walle, Felix
Weigel,... Students Selcuk Aya, Pavel Dmitriev,
Blazej Kot, with more than 50 M.Eng., and
undergraduate students from Information Science
and Computer Science Internet Archive Brewster
Kahle, Tracey Jacquith, Michael Stack, Kris
Carpenter,...
3
Introduction to the Web LabMining the History of
the Web

The Internet Archive's Web Collection
Complete crawls of the Web, every two months
since 1996
Total archive is about 110,000,000,000 pages
(110 billion)
Recent crawls are about 60 TByte (compressed)
Total archive is about 1,900 TByte (compressed)
Metadata contains format, links, anchor text

4
The Library Stacks the Internet Archive
5
The Wayback Machine

Demo
http//www.archive.org/

6
Research using Metadata about Web Pages

Current NSF grant
Research using anchor text
links to microsoft.com and google.com
Changes to the link structure of the Web
differences between crawls
densification (increases in average node
degree)
Formation of online groups

7
Example of Past Work Social and Information
Networks, Joining a Community

Close to one billion (user, community)
instances
Work by Lars Backstrom, Dan Huttenlocher, Jon
Kleinberg, and Xiangyang Lan

8
The Never-ending Research Dialog
Here's an analysis we would like to do... Not as
you suggest it, but here's another idea...
We don't know how to do that analysis. Would
this be any use to you? That might be possible,
with the following modification...
INFORMATION SCIENTIST
RESEARCHER
Let's try it and see.
9
The Role of Web Data for Social Science Research

Social networks are an important research topic
Emergence of global phenomena from local effects
Viral spreading of rumors
Behavior of individuals in a community
Roles in discussion threads, herd behavior in
opinion polls
Network structure and dynamics
Strength of weak ties, triangle relations,
homophily

10
How to Observe a Social Network?

Social network research before the web
Talk to people, make notes
Distribute questionnaires, gather statistics
Problems with this approach
Tedious task
Small scale
The Internet Archive is a great resource for
research
Contains web pages with social networks
Records the history of the pages

11
Social Networks on the Web

The web contains many social networks
Sites for social networking, social bookmarking,
file sharing
MySpace, Facebook, Flickr, Delicious
Community portals
Yahoo Groups, DBLife
Encyclopedia and folksonomy projects
Wikipedia, Wikia
Review sites and customer comments
Amazon, Netflix
Blogs, web forums, Usenet

12
The Bliss and Curse of Digital Data

Opportunities
Collecting network data at an unprecedented scale
Verifying hypotheses in many different networks
Monitoring communities at a finer granularity
Mining and searching social networks
Challenges
Finding suitable information on the web
Extracting information from web pages
Making web data persistent
Processing very large data sets
Access rights and privacy

13
Web Lab and Social Science Research

Collaboration with Cornells Institute for the
Social Sciences
Our goal Make data available to researchers
Large web graph database with multiple crawls
Packaged subsets of crawls for analysis
Visual extraction tool for creating new data sets
(ongoing)
Small-scale crawling for adding new web sites
(starting)
Full-text indexing (planned)
Demo of the extraction tool available at
http//www.cs.cornell.edu/weigel/WrapperDemo/

14
Web Data Extraction

Researchers often dont care about web pages, but
specific substructures inside the pages
Blog postings
Web forums
Social tagging
News headlines
Tables of content
Bibliographies
Product details
Customer reviews

15
Web Data Collaboration Server

Data extraction
Writing extraction code is a tedious task
Create tools to make the data easily accessible
in a structured format (e.g., tables in a
database)
Data sharing
Extracting the same data repeatedly is a waste
of time and storage space
Let users share their data and extraction rules
Data curation
Web data is often incomplete and erroneous
Let users collaborate to correct and complete
the data

16
Demonstration

Demo of the extraction tool available at
http//www.cs.cornell.edu/weigel/WrapperDemo/

17
The Web Lab System
INTERNET ARCHIVE
Web Collection
Wayback Machine
Text indexes
File server
Computer cluster
National super-computers
Structure database
Text indexes
Page store
CORNELL UNIVERSITY
18
Technical Processing the Web Lab
Networking Internet 2, National Lambda
Rail Wayback Machine Commodity computers
with local file systems Structure
database Relational database system on large
shared memory computer Data analysis Specialized
Linux cluster with Hadoop distributed file
system and MapReduce programming
Different types of computer for different
functions
19
The Research Process

Select a sub-set for analysis
SQL query the relational database directly
Use the GetPages tool on the Web site to send
an SQL query
Download the sub-set
To the researcher's computer
To the Web Lab file server
Clean-up the data
MapReduce tasks on the Hadoop cluster
Data analysis
MapReduce tasks on the Hadoop cluster

20
Selection Methods

By known identifier (Wayback Machine)
web pages with the URL http//www.nsf.gov/
By character string (full text indexing) --
future
all pages containing, "Internet is doubling
every six months"
all page containing the SARS-CoV genetic
sequence
By metadata criteria
all web pages that link to microsoft.com but not
to google.com
all email addresses that I used to receive mail
from but have not had mail from recently
Example provided by Marc Smith

21
Benefits of Using a Relational Database

Simple query language for retrieving data
Transaction support
Concurrency control for parallel queries
Multiple indices for high performance
Reliability since databases have built-in
recovery functionality

22
Metadata Loading

The crawler outputs compressed metadata files
(DAT files).
Each DAT file has a set of crawled pages with
page metadata, including things like crawl time,
IP address, mime type, language encoding, etc.
Most importantly, the outgoing links from each
page are parsed, including the full URL and
associated anchor text.

23
Database Schema

Crawl Name of the crawl from which data is
loaded
Page Metadata about each webpage plus fields to
help find and extract the full html text
Link The outgoing links from crawled pages
Url Lookup table for unique URLs
Host Lookup table for unique hostnames

24
Crawls Loaded Into SQL DB
25
Selection from the Database

SQL query the relational database directly
(Contact Manuel Calimlim)
Use the GetPages tool on the Web site to send
an SQL query -- work in progress

26
Demonstration

Demonstration of the Web Lab web site
http//weblab.infosci.cornell.edu/
and the GetPages tool

27
Massive Data Analysis by Non-Specialists

A typical scientist or social scientist
Has deep domain knowledge
Has good algorithmic understanding
Is often a competent computer user or has a
research assistant who is familiar with languages
such as Fortran, Python, and Matlab, or
applications packages such as SAS and Excel.
But...
Has limited understanding of large-scale data
analysis
Is not skilled at any form of computing that
requires parallel computing or concurrency
Typical problem of scale Given 100 billion URLs,
how do you
identify duplicates?

28
Hadoop and MapReduce Programming
Hadoop An open source distributed file system
similar to the Google File System. It supports
MapReduce programming. http//lucene.apache.org/h
adoop/ MapReduce A functional programming style
to support large-scale data analysis without the
need for global data structures. In the
1960s, Fortran gave scientists a simple way to
translate mathematical problems into efficient
computer codes. MapReduce programming gives
researchers a simple way to run massive data
analysis on large computer clusters.
29
The MapReduce Paradigm
M map tasks
R reduce tasks
Input data split into files
Output files
Intermediate files
Output 0
split 0 split 1 split 2 split 3 split 4
Output 1
Each intermediate file is divided into R
partitions
Each reduce task corresponds to one partition
30
A Web Graph Example
2
1
4
3
6
5
31
Building the Web Graph
URLs, pages, and links URLs contained in Web
pages may link to pages never crawled URLs not
canonicalized different URLs may refer to same
page Links are from a page to a URL Web graph
from crawl data Nodes are union of pages
crawled and URLs seen Each node and edge has
time interval(s) over which it exists
32
Web Graph Example
Problem Given a set of URL pairs in
uncanonicalized form (u0, v0), create a list of
all the edges that point to each node of the web
graph Replace each u0 or v0 with its
canonicalized form u or v. Create a list of
all nodes of the graph, i.e., the set of unique
u. Discard all (u, v) pairs, where u v, or
v is not a node of the graph. Discard all
duplicate edges. For each node v, create a list
(v, u), where u is the set of nodes that have
edges to node v. Each step is a simple
programming task for a small numbers of links on
a single computer. How can this simplicity be
retained with huge numbers of links on a very
large computer cluster?
33
MapReduce Example
Map task Input (u0, v0) Output (u, d) //
Indicate that u is a from-URL (v, u) //
Indicate that v is a to-URL with link from u d
is a dummy marker. Do not output if u v. This
is simple application code to write.
34
A MapReduce Example
Merge The input to the reduce process merges the
output values from the map task that correspond
to each URL. For each URL, w, it creates a
list w, d, ... , d, u1, ..., uk This merge is
performed automatically by the system libraries.
35
A MapReduce Example
Reduce Input w, d, ... , d, u1, ..., uk,
where w is any URL. Output If there is no
marker d in the list, discard and do not output.
This corresponds to a URL that never appears only
as the first element of a (u, v) pair. Otherwise
remove duplicates from u1, ..., uk and output.
The output is a to-URL and a list of the nodes
that link to it v, u1, ..., uk This is
simple application code to write.
36
For the FutureExamples of Tools and Services

The Web Lab is steadily building a set of tools
for researchers
API and Web services
GetPages Web forms to select dataset by query
of a relational database with indexes by date,
URL, domain name, file type, anchor text, etc.
Focused Web crawling (modification of Heritrix
crawler)
Extraction of Web graph from subset and
calculations, e.g., PageRank, hubs and
authorities
Graph visualization
Natural language processing of anchor text

37
The Web Lab is Ready for Use

We are ready to work with a number of
researchers
Systems
Relational database operational
Hadoop pilot cluster (large cluster soon)
File server and web server operational
People
Manuel Calimlim (database)
Lucy Walle (Hadoop MapReduce)
Tools
A variety of tools in prototype
Experience with large volumes of anchor text and
URLs

38
Thanks
This work would not be possible without the
forethought and long standing commitment of
Brewster Kahle and the Internet Archive to
capture and preserve the content of the Web for
future generations. This work has been funded in
part by the National Science Foundation, grants
CNS-0403340, DUE-0127308, SES-0537606,
IIS-0634677, and IIS-0705774.
39

Cornell Information Science
Research Seminar The Web Lab http//weblab.infosc
i.cornell.edu/

William Y. Arms
Manuel Calimlim
Lucy Walle
Felix Weigel
January 23, 2007

Write a Comment

User Comments (0)

About PowerShow.com

William Y. Arms - PowerPoint PPT Presentation

William Y. Arms

MySpace, Facebook, Flickr, Delicious. Community portals. Yahoo Groups, DBLife ... a simple way to translate mathematical problems into efficient computer codes. ... – PowerPoint PPT presentation