Computational Social Science Temporal Evolution of Social and Information Networks - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Computational Social Science Temporal Evolution of Social and Information Networks

Description:

William Arms, Geri Gay, Dan Huttenlocher, ... Current data sets such as corporate board overlap provide limited view. 5 ... Subject of 2003 KDD cup competition ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 31

Provided by: DanHutte5

Category:

more less

Transcript and Presenter's Notes

Title: Computational Social Science Temporal Evolution of Social and Information Networks

1
Computational Social ScienceTemporal Evolution
of Social and Information Networks

Jon Kleinberg

2
Research Team
Faculty William Arms, Geri Gay, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang Cornell Theory Center Manuel Calimlim,
Dave Lifka, Ruth Mitchell, and the Petabyte Data
Store team Ph.D. Students Selcuk Aya, Pavel
Dmitriev, Blazej Kot, with more than 20 M.Eng.
and undergraduate students Internet
Archive Brewster Kahle, Tracey Jacquith, John
Berry
3
Computational Social Science

Computational ideas and methods for use in the
social sciences
Beyond tabulation and statistics
Semi-structured data, implicit structure
Multiple sources, millions of independent agents
Scale that defies manual techniques
Recent advances in and new challenges for CS
Data mining, machine learning, natural language

4
Illustrative Questions

Emergence and diffusion of norms
In online communities, on Web sites
Enable better study of early adoption
Network dynamics of polarization
Negative as well as positive opinion
Diffusion across organizations
Studied from public statements, news items and
discussion
Current data sets such as corporate board overlap
provide limited view

5
Comp Soc Sci at Cornell

Institute for Social Sciences (ISS) selected as
the theme project for 2005-08
Getting Connected Social Science in the Age of
Networks
Large NSF Cybertools project
Co-PIs from Sociology, Computer Science,
Information Science, Communication
Web Lab
Computational resources for research access to
Internet Archive data

6
Research Goals

Bring together research communities separately
studying related problems
Both at Cornell and nationally
Focus on problems where computation and new data
sources can be fundamental
Help transform social sciences in manner
analogous to supercomputing for physical sciences
and engineering
Computational models, simulation and large-scale
datasets

7
Research Goals

Potential paradigm-shifting methodology using
data-intensive computing resources
A move from data collection to information
discovery
Motivating examples for many researchers in data
mining and machine learning
Often these problems have long history in social
sciences but at small scale
Investigation of widely studied topic areas at a
Web scale

8
Research Issues

Diffusion of innovation
Study both adoption and abandonment using
snapshots of Web over time
Questions that have proven difficult to answer
with small hand coded datasets
E.g., effects of early adopters, network
structures, polarization within and between
topics
Dynamic networks
Behavior and characteristics of individuals
relative to their network neighbors
Studies of interaction data from online
communities

9
Social and Information Networks

Studying social networks intertwined with online
information networks
A major area of research at Cornell
In sociology, communication, economics,
information science and computer science
People mediated by information artifacts and vice
versa
Not just social connections or just links between
documents

10
Sources of Data

Model systems
Cornell e-print arXiv (scientific topics,
coauthorship/collaboration, trends over time)
Usenet, w/ Marc Smith at Microsoft
(conversational structure, topic dynamics)
Medium-scale systems
On-line communities (LiveJournal, MySpace)
Web-scale
Internet Archive data, showing evolution of Web
1996-2006.

11
Good Model Systems

Address longstanding problem of having only any
two of three desirable criteria
Large-scale, realistic and completely mapped
arXiv has become valuable model system
Subject of 2003 KDD cup competition
Data has been used in many subsequent database
and machine learning papers
Allows researchers to pose questions about the
behavior of a professional community
Collaborations, events, trends, diffusion of
ideas

12
Some Research on the arXiv

Link prediction Liben-Nowell Kleinberg 2003
Given the collaboration network up to a cut-off
date, can new collaborations be predicted?
Text classification to detect emerging areas
Ginsparg, Houle, Joachims Sul 2004
Led to the creation of a new quantitative
biology area in the arXiv
Network evolution Leskovec, Kleinberg
Faloutsos 2005
How does the structure of the arXiv citation and
collaboration networks change over time?

13
Network Evolution

We are acquiring a large vocabulary of patterns
for static networks
small-world, scale-free, preferential attachment,
PageRank, hubs and authorities, bow-ties,
bipartite cores, network motifs,
We know very little about the characteristic ways
in which networks grow over time
What are the analogous collections of patterns?

14
Evolution of the arXiv Networks

arXiv citation network over time LKF05
edges grows superlinearly in nodes, e ? n1.69
Have similarly observed densification laws for
many other networks
Avg distance between nodes decreases over time
Challenges theoretical models in which diameter
is a slowly growing function of number of nodes

15
Medium-Scale Online Social Networking Systems

A fundamental question in the diffusion of
innovation
What is the probability an individual will adopt
a new behavior, as a function of the number of
his/her friends who are adopters?
New behavior could be adopting a new technology,
joining a social movement, believing a rumor,
Large online social networks can have 100,000
explicitly defined user communities.
New behavior choosing to join a community.

16
Joining a Community

Unprecedented scale for such a curve
Close to one billion (user,community) instances
Most standard models predict S-shaped probability
curves
Further machine learning to predict joining

17
The Internet Archive Web Collection

The Data
Complete crawls of the Web, every two months
since 1996, with
some gaps
Range of formats and depth of crawl have
increased with time.
No data from sites that are protected by
robots.txt or where owners have requested not to
be archived
Some missing or lost data
Metadata contains format, links, anchor text
Organized to facilitate historical access to
known URL (wayback machine)

18
Archive of www.microsoft.com
19
www.microsoft.com in 1996
20
The Internet Archive Web Collection
Sizes Current crawls are about 40-60 TByte
(compressed) Total archive is about 600 TByte
(compressed) Compression ratio up to
251 best estimate of overall average is
101 Rate of increase is about 1 TByte/day
(compressed) Total storage requirement at Cornell
will differ because Elimination of data that
is duplicated between crawls Expansion of
metadata for research Database indexes
21
Web Library/Laboratory

The Cornell petabyte data store will allow us to
mount several large
portions of the Web online for broad range of Web
research.
Copy snapshots of the Web from the Internet
Archive
Store online at Cornell
Extract feature sets for researchers
Provide APIs for researchers (program
interface, download of datasets, Web Services
API)
Provide Web GUI for social science researchers

22
Scale of Data Processing
Balance of Resources Ideal Realistic Networkin
g 500 Mbit/sec 100 Mbit/sec Data online all few
crawls/year Metadata online all all? Disk 750
TB 240 TB Tape archive all few
crawls/year Computers research shared separate
with storage
23
Equipment

Scidata1 -- Initial Configuration
16-Processor Unisys ES7000 Servers
32 GByte RAM
8 GByte/sec aggregate I/O bandwidth
100 TByte RAID Online Storage
ADIC Scalar 10K robotic tape library for archive
Separate Web server
Near-term Expansion
Disk capacity will expand to 240 TByte by end
of 2007
Network
Internet2 with dedicated 100 mbs link to
Internet Archive

24
Data Processing Overview
25
User Services
26
Data Processing
Transfer 300-500 GByte per day Internet 2 -- 100
mbs maximum throughput Archive raw data to tape
Process raw data Uncompress and unpack Web
pages (ARC) and metadata (DAT) files Create IDs
for pages and content hashes Database
load Database load batches of metadata about
page and links (MS SQL Server 2000) Store
compressed page files
27
Metadata

URLs, pages and links
URLs contained in metadata may link to pages
never crawled
URLs not canonicalized different URLs may
refer to same page
Links are from a page to a URL
Web graph
Nodes are union of pages crawled and URLs seen
Each node and edge have time interval(s) over
which they exist
Content
Anchor text in more recent crawls
File and mime types

28
Current Status
Data Capture Connection of Internet Archive
to Internet 2 (October 2005) Parallel loading
of crawls of DAT and ARC files (January
2006) Storage Relational database and preload
system under test Page store preliminary
design work User Services Basic access DLL and
download of data to users under test
29
Research Team
Faculty William Arms, Geri Gay, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang Cornell Theory Center Manuel Calimlim,
Dave Lifka, Ruth Mitchell, and the Petabyte Data
Store team Ph.D. Students Selcuk Aya, Pavel
Dmitriev, Blazej Kot, with more than 20 M.Eng.
and undergraduate students Internet
Archive Brewster Kahle, Tracey Jacquith, John
Berry
30
Thanks
This work would not be possible without the
forethought and longstanding commitment of the
Internet Archive to capture and preserve the
content of the Web for future generations. This
work has been funded in part by the National
Science Foundation, grants CNS-0403340,
DUE-0127308, and SES-0537606, with equipment
support from Unisys and by an E-Science grant and
a gift from Microsoft Corporation.

Write a Comment

User Comments (0)