Computational Social Science Temporal Evolution of Social and Information Networks - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Computational Social Science Temporal Evolution of Social and Information Networks

Description:

William Arms, Geri Gay, Dan Huttenlocher, ... Current data sets such as corporate board overlap provide limited view. 5 ... Subject of 2003 KDD cup competition ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 31
Provided by: DanHutte5
Category:

less

Transcript and Presenter's Notes

Title: Computational Social Science Temporal Evolution of Social and Information Networks


1
Computational Social ScienceTemporal Evolution
of Social and Information Networks
  • Jon Kleinberg

2
Research Team
Faculty William Arms, Geri Gay, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang Cornell Theory Center Manuel Calimlim,
Dave Lifka, Ruth Mitchell, and the Petabyte Data
Store team Ph.D. Students Selcuk Aya, Pavel
Dmitriev, Blazej Kot, with more than 20 M.Eng.
and undergraduate students Internet
Archive Brewster Kahle, Tracey Jacquith, John
Berry
3
Computational Social Science
  • Computational ideas and methods for use in the
    social sciences
  • Beyond tabulation and statistics
  • Semi-structured data, implicit structure
  • Multiple sources, millions of independent agents
  • Scale that defies manual techniques
  • Recent advances in and new challenges for CS
  • Data mining, machine learning, natural language

4
Illustrative Questions
  • Emergence and diffusion of norms
  • In online communities, on Web sites
  • Enable better study of early adoption
  • Network dynamics of polarization
  • Negative as well as positive opinion
  • Diffusion across organizations
  • Studied from public statements, news items and
    discussion
  • Current data sets such as corporate board overlap
    provide limited view

5
Comp Soc Sci at Cornell
  • Institute for Social Sciences (ISS) selected as
    the theme project for 2005-08
  • Getting Connected Social Science in the Age of
    Networks
  • Large NSF Cybertools project
  • Co-PIs from Sociology, Computer Science,
    Information Science, Communication
  • Web Lab
  • Computational resources for research access to
    Internet Archive data

6
Research Goals
  • Bring together research communities separately
    studying related problems
  • Both at Cornell and nationally
  • Focus on problems where computation and new data
    sources can be fundamental
  • Help transform social sciences in manner
    analogous to supercomputing for physical sciences
    and engineering
  • Computational models, simulation and large-scale
    datasets

7
Research Goals
  • Potential paradigm-shifting methodology using
    data-intensive computing resources
  • A move from data collection to information
    discovery
  • Motivating examples for many researchers in data
    mining and machine learning
  • Often these problems have long history in social
    sciences but at small scale
  • Investigation of widely studied topic areas at a
    Web scale

8
Research Issues
  • Diffusion of innovation
  • Study both adoption and abandonment using
    snapshots of Web over time
  • Questions that have proven difficult to answer
    with small hand coded datasets
  • E.g., effects of early adopters, network
    structures, polarization within and between
    topics
  • Dynamic networks
  • Behavior and characteristics of individuals
    relative to their network neighbors
  • Studies of interaction data from online
    communities

9
Social and Information Networks
  • Studying social networks intertwined with online
    information networks
  • A major area of research at Cornell
  • In sociology, communication, economics,
    information science and computer science
  • People mediated by information artifacts and vice
    versa
  • Not just social connections or just links between
    documents

10
Sources of Data
  • Model systems
  • Cornell e-print arXiv (scientific topics,
    coauthorship/collaboration, trends over time)
  • Usenet, w/ Marc Smith at Microsoft
    (conversational structure, topic dynamics)
  • Medium-scale systems
  • On-line communities (LiveJournal, MySpace)
  • Web-scale
  • Internet Archive data, showing evolution of Web
    1996-2006.

11
Good Model Systems
  • Address longstanding problem of having only any
    two of three desirable criteria
  • Large-scale, realistic and completely mapped
  • arXiv has become valuable model system
  • Subject of 2003 KDD cup competition
  • Data has been used in many subsequent database
    and machine learning papers
  • Allows researchers to pose questions about the
    behavior of a professional community
  • Collaborations, events, trends, diffusion of
    ideas

12
Some Research on the arXiv
  • Link prediction Liben-Nowell Kleinberg 2003
  • Given the collaboration network up to a cut-off
    date, can new collaborations be predicted?
  • Text classification to detect emerging areas
    Ginsparg, Houle, Joachims Sul 2004
  • Led to the creation of a new quantitative
    biology area in the arXiv
  • Network evolution Leskovec, Kleinberg
    Faloutsos 2005
  • How does the structure of the arXiv citation and
    collaboration networks change over time?

13
Network Evolution
  • We are acquiring a large vocabulary of patterns
    for static networks
  • small-world, scale-free, preferential attachment,
    PageRank, hubs and authorities, bow-ties,
    bipartite cores, network motifs,
  • We know very little about the characteristic ways
    in which networks grow over time
  • What are the analogous collections of patterns?

14
Evolution of the arXiv Networks
  • arXiv citation network over time LKF05
  • edges grows superlinearly in nodes, e ? n1.69
  • Have similarly observed densification laws for
    many other networks
  • Avg distance between nodes decreases over time
  • Challenges theoretical models in which diameter
    is a slowly growing function of number of nodes

15
Medium-Scale Online Social Networking Systems
  • A fundamental question in the diffusion of
    innovation
  • What is the probability an individual will adopt
    a new behavior, as a function of the number of
    his/her friends who are adopters?
  • New behavior could be adopting a new technology,
    joining a social movement, believing a rumor,
  • Large online social networks can have 100,000
    explicitly defined user communities.
  • New behavior choosing to join a community.

16
Joining a Community
  • Unprecedented scale for such a curve
  • Close to one billion (user,community) instances
  • Most standard models predict S-shaped probability
    curves
  • Further machine learning to predict joining

17
The Internet Archive Web Collection
  • The Data
  • Complete crawls of the Web, every two months
    since 1996, with
  • some gaps
  • Range of formats and depth of crawl have
    increased with time.
  • No data from sites that are protected by
    robots.txt or where owners have requested not to
    be archived
  • Some missing or lost data
  • Metadata contains format, links, anchor text
  • Organized to facilitate historical access to
    known URL (wayback machine)

18
Archive of www.microsoft.com
19
www.microsoft.com in 1996
20
The Internet Archive Web Collection
Sizes Current crawls are about 40-60 TByte
(compressed) Total archive is about 600 TByte
(compressed) Compression ratio up to
251 best estimate of overall average is
101 Rate of increase is about 1 TByte/day
(compressed) Total storage requirement at Cornell
will differ because Elimination of data that
is duplicated between crawls Expansion of
metadata for research Database indexes
21
Web Library/Laboratory
  • The Cornell petabyte data store will allow us to
    mount several large
  • portions of the Web online for broad range of Web
    research.
  • Copy snapshots of the Web from the Internet
    Archive
  • Store online at Cornell
  • Extract feature sets for researchers
  • Provide APIs for researchers (program
    interface, download of datasets, Web Services
    API)
  • Provide Web GUI for social science researchers

22
Scale of Data Processing
Balance of Resources Ideal Realistic Networkin
g 500 Mbit/sec 100 Mbit/sec Data online all few
crawls/year Metadata online all all? Disk 750
TB 240 TB Tape archive all few
crawls/year Computers research shared separate
with storage
23
Equipment
  • Scidata1 -- Initial Configuration
  • 16-Processor Unisys ES7000 Servers
  • 32 GByte RAM
  • 8 GByte/sec aggregate I/O bandwidth
  • 100 TByte RAID Online Storage
  • ADIC Scalar 10K robotic tape library for archive
  • Separate Web server
  • Near-term Expansion
  • Disk capacity will expand to 240 TByte by end
    of 2007
  • Network
  • Internet2 with dedicated 100 mbs link to
    Internet Archive

24
Data Processing Overview
25
User Services
26
Data Processing
Transfer 300-500 GByte per day Internet 2 -- 100
mbs maximum throughput Archive raw data to tape
Process raw data Uncompress and unpack Web
pages (ARC) and metadata (DAT) files Create IDs
for pages and content hashes Database
load Database load batches of metadata about
page and links (MS SQL Server 2000) Store
compressed page files
27
Metadata
  • URLs, pages and links
  • URLs contained in metadata may link to pages
    never crawled
  • URLs not canonicalized different URLs may
    refer to same page
  • Links are from a page to a URL
  • Web graph
  • Nodes are union of pages crawled and URLs seen
  • Each node and edge have time interval(s) over
    which they exist
  • Content
  • Anchor text in more recent crawls
  • File and mime types

28
Current Status
Data Capture Connection of Internet Archive
to Internet 2 (October 2005) Parallel loading
of crawls of DAT and ARC files (January
2006) Storage Relational database and preload
system under test Page store preliminary
design work User Services Basic access DLL and
download of data to users under test
29
Research Team
Faculty William Arms, Geri Gay, Dan
Huttenlocher, Jon Kleinberg, Michael Macy, David
Strang Cornell Theory Center Manuel Calimlim,
Dave Lifka, Ruth Mitchell, and the Petabyte Data
Store team Ph.D. Students Selcuk Aya, Pavel
Dmitriev, Blazej Kot, with more than 20 M.Eng.
and undergraduate students Internet
Archive Brewster Kahle, Tracey Jacquith, John
Berry
30
Thanks
This work would not be possible without the
forethought and longstanding commitment of the
Internet Archive to capture and preserve the
content of the Web for future generations. This
work has been funded in part by the National
Science Foundation, grants CNS-0403340,
DUE-0127308, and SES-0537606, with equipment
support from Unisys and by an E-Science grant and
a gift from Microsoft Corporation.
Write a Comment
User Comments (0)
About PowerShow.com