Title: Computational Social Science Temporal Evolution of Social and Information Networks
1Computational Social ScienceTemporal Evolution
of Social and Information Networks
2Research Team
Faculty William Arms, Geri Gay, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang Cornell Theory Center Manuel Calimlim,
Dave Lifka, Ruth Mitchell, and the Petabyte Data
Store team Ph.D. Students Selcuk Aya, Pavel
Dmitriev, Blazej Kot, with more than 20 M.Eng.
and undergraduate students Internet
Archive Brewster Kahle, Tracey Jacquith, John
Berry
3Computational Social Science
- Computational ideas and methods for use in the
social sciences - Beyond tabulation and statistics
- Semi-structured data, implicit structure
- Multiple sources, millions of independent agents
- Scale that defies manual techniques
- Recent advances in and new challenges for CS
- Data mining, machine learning, natural language
4Illustrative Questions
- Emergence and diffusion of norms
- In online communities, on Web sites
- Enable better study of early adoption
- Network dynamics of polarization
- Negative as well as positive opinion
- Diffusion across organizations
- Studied from public statements, news items and
discussion - Current data sets such as corporate board overlap
provide limited view
5Comp Soc Sci at Cornell
- Institute for Social Sciences (ISS) selected as
the theme project for 2005-08 - Getting Connected Social Science in the Age of
Networks - Large NSF Cybertools project
- Co-PIs from Sociology, Computer Science,
Information Science, Communication - Web Lab
- Computational resources for research access to
Internet Archive data
6Research Goals
- Bring together research communities separately
studying related problems - Both at Cornell and nationally
- Focus on problems where computation and new data
sources can be fundamental - Help transform social sciences in manner
analogous to supercomputing for physical sciences
and engineering - Computational models, simulation and large-scale
datasets
7Research Goals
- Potential paradigm-shifting methodology using
data-intensive computing resources - A move from data collection to information
discovery - Motivating examples for many researchers in data
mining and machine learning - Often these problems have long history in social
sciences but at small scale - Investigation of widely studied topic areas at a
Web scale
8Research Issues
- Diffusion of innovation
- Study both adoption and abandonment using
snapshots of Web over time - Questions that have proven difficult to answer
with small hand coded datasets - E.g., effects of early adopters, network
structures, polarization within and between
topics - Dynamic networks
- Behavior and characteristics of individuals
relative to their network neighbors - Studies of interaction data from online
communities
9Social and Information Networks
- Studying social networks intertwined with online
information networks - A major area of research at Cornell
- In sociology, communication, economics,
information science and computer science - People mediated by information artifacts and vice
versa - Not just social connections or just links between
documents
10Sources of Data
- Model systems
- Cornell e-print arXiv (scientific topics,
coauthorship/collaboration, trends over time) - Usenet, w/ Marc Smith at Microsoft
(conversational structure, topic dynamics) - Medium-scale systems
- On-line communities (LiveJournal, MySpace)
- Web-scale
- Internet Archive data, showing evolution of Web
1996-2006.
11Good Model Systems
- Address longstanding problem of having only any
two of three desirable criteria - Large-scale, realistic and completely mapped
- arXiv has become valuable model system
- Subject of 2003 KDD cup competition
- Data has been used in many subsequent database
and machine learning papers - Allows researchers to pose questions about the
behavior of a professional community - Collaborations, events, trends, diffusion of
ideas
12Some Research on the arXiv
- Link prediction Liben-Nowell Kleinberg 2003
- Given the collaboration network up to a cut-off
date, can new collaborations be predicted? - Text classification to detect emerging areas
Ginsparg, Houle, Joachims Sul 2004 - Led to the creation of a new quantitative
biology area in the arXiv - Network evolution Leskovec, Kleinberg
Faloutsos 2005 - How does the structure of the arXiv citation and
collaboration networks change over time?
13Network Evolution
- We are acquiring a large vocabulary of patterns
for static networks - small-world, scale-free, preferential attachment,
PageRank, hubs and authorities, bow-ties,
bipartite cores, network motifs, - We know very little about the characteristic ways
in which networks grow over time - What are the analogous collections of patterns?
14Evolution of the arXiv Networks
- arXiv citation network over time LKF05
- edges grows superlinearly in nodes, e ? n1.69
- Have similarly observed densification laws for
many other networks - Avg distance between nodes decreases over time
- Challenges theoretical models in which diameter
is a slowly growing function of number of nodes
15Medium-Scale Online Social Networking Systems
- A fundamental question in the diffusion of
innovation - What is the probability an individual will adopt
a new behavior, as a function of the number of
his/her friends who are adopters? - New behavior could be adopting a new technology,
joining a social movement, believing a rumor, - Large online social networks can have 100,000
explicitly defined user communities. - New behavior choosing to join a community.
16Joining a Community
- Unprecedented scale for such a curve
- Close to one billion (user,community) instances
- Most standard models predict S-shaped probability
curves - Further machine learning to predict joining
17The Internet Archive Web Collection
- The Data
- Complete crawls of the Web, every two months
since 1996, with - some gaps
- Range of formats and depth of crawl have
increased with time. - No data from sites that are protected by
robots.txt or where owners have requested not to
be archived - Some missing or lost data
- Metadata contains format, links, anchor text
- Organized to facilitate historical access to
known URL (wayback machine)
18Archive of www.microsoft.com
19www.microsoft.com in 1996
20The Internet Archive Web Collection
Sizes Current crawls are about 40-60 TByte
(compressed) Total archive is about 600 TByte
(compressed) Compression ratio up to
251 best estimate of overall average is
101 Rate of increase is about 1 TByte/day
(compressed) Total storage requirement at Cornell
will differ because Elimination of data that
is duplicated between crawls Expansion of
metadata for research Database indexes
21Web Library/Laboratory
- The Cornell petabyte data store will allow us to
mount several large - portions of the Web online for broad range of Web
research. - Copy snapshots of the Web from the Internet
Archive - Store online at Cornell
- Extract feature sets for researchers
- Provide APIs for researchers (program
interface, download of datasets, Web Services
API) - Provide Web GUI for social science researchers
22Scale of Data Processing
Balance of Resources Ideal Realistic Networkin
g 500 Mbit/sec 100 Mbit/sec Data online all few
crawls/year Metadata online all all? Disk 750
TB 240 TB Tape archive all few
crawls/year Computers research shared separate
with storage
23Equipment
- Scidata1 -- Initial Configuration
- 16-Processor Unisys ES7000 Servers
- 32 GByte RAM
- 8 GByte/sec aggregate I/O bandwidth
- 100 TByte RAID Online Storage
- ADIC Scalar 10K robotic tape library for archive
- Separate Web server
- Near-term Expansion
- Disk capacity will expand to 240 TByte by end
of 2007 - Network
- Internet2 with dedicated 100 mbs link to
Internet Archive
24Data Processing Overview
25User Services
26Data Processing
Transfer 300-500 GByte per day Internet 2 -- 100
mbs maximum throughput Archive raw data to tape
Process raw data Uncompress and unpack Web
pages (ARC) and metadata (DAT) files Create IDs
for pages and content hashes Database
load Database load batches of metadata about
page and links (MS SQL Server 2000) Store
compressed page files
27Metadata
- URLs, pages and links
- URLs contained in metadata may link to pages
never crawled - URLs not canonicalized different URLs may
refer to same page - Links are from a page to a URL
- Web graph
- Nodes are union of pages crawled and URLs seen
- Each node and edge have time interval(s) over
which they exist - Content
- Anchor text in more recent crawls
- File and mime types
28Current Status
Data Capture Connection of Internet Archive
to Internet 2 (October 2005) Parallel loading
of crawls of DAT and ARC files (January
2006) Storage Relational database and preload
system under test Page store preliminary
design work User Services Basic access DLL and
download of data to users under test
29Research Team
Faculty William Arms, Geri Gay, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang Cornell Theory Center Manuel Calimlim,
Dave Lifka, Ruth Mitchell, and the Petabyte Data
Store team Ph.D. Students Selcuk Aya, Pavel
Dmitriev, Blazej Kot, with more than 20 M.Eng.
and undergraduate students Internet
Archive Brewster Kahle, Tracey Jacquith, John
Berry
30Thanks
This work would not be possible without the
forethought and longstanding commitment of the
Internet Archive to capture and preserve the
content of the Web for future generations. This
work has been funded in part by the National
Science Foundation, grants CNS-0403340,
DUE-0127308, and SES-0537606, with equipment
support from Unisys and by an E-Science grant and
a gift from Microsoft Corporation.