Scalable Methods for the Analysis of Network-Based Data MURI Project: University of California, Irvine Principal Investigator: Padhraic Smyth Kick-off Meeting November 18th 2008 - PowerPoint PPT Presentation

Loading...

PPT – Scalable Methods for the Analysis of Network-Based Data MURI Project: University of California, Irvine Principal Investigator: Padhraic Smyth Kick-off Meeting November 18th 2008 PowerPoint presentation | free to download - id: 674df3-YjNhO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Scalable Methods for the Analysis of Network-Based Data MURI Project: University of California, Irvine Principal Investigator: Padhraic Smyth Kick-off Meeting November 18th 2008

Description:

Title: Predictive Profiling from Massive Transactional Data Sets Author: Information and Computer Science Last modified by: Information and Computer Sciences – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Scalable Methods for the Analysis of Network-Based Data MURI Project: University of California, Irvine Principal Investigator: Padhraic Smyth Kick-off Meeting November 18th 2008


1
Scalable Methods for the Analysis of
Network-Based DataMURI Project University of
California, Irvine Principal Investigator
Padhraic Smyth Kick-off MeetingNovember 18th
2008
2
Goals for Todays Meeting
  • Review overall goals and research of MURI project
  • University research groups
  • learn about each others research
  • See opportunities for collaboration
  • MURI team and ONR/Navy
  • MURI team learn about ONR interests
  • ONR learn about expertise and plans of MURI team
  • Action items
  • Future meetings and collaborative activities
  • Review Year 1 research goals

Butts
3
Outline
  • Introductions
  • Review todays agenda
  • Schedule of talks
  • Logistics
  • Overview of our MURI research project
  • Themes and goals
  • Tasks

4
MURI Investigators
Carter Butts UCI
Michael Goodrich UCI
David Eppstein UCI
Padhraic Smyth UCI
Mark Handcock U Washington
Dave Hunter Penn State
Dave Mount U Maryland
5
MURI Project Participants
  • Postdocs
  • UC Irvine
  • Romain Thibaux (Computer Science)
  • Graduate Students (all UC Irvine)
  • Computer Science
  • Darren Strash
  • Lowell Trott
  • Statistics
  • Chris DuBois
  • Social Science
  • Chris Marcum
  • Lorien Jasny
  • Emma Spiro
  • Zack Almquist

6
Outline
  • Introductions
  • Review todays agenda
  • Schedule of talks
  • Logistics
  • Overview of our MURI research project
  • Themes and goals
  • Tasks

7
Agenda for MURI Kickoff Meeting at UC Irvine
November 18th 2008Location UC Irvine, Bren
Hall, Room 4011 MORNING SESSION830 
Arrive, continental breakfast available900 
Introductions and overview of MURI proposal
        Padhraic Smyth (UCI)930  Research
and application challenges from ONR's
perspective          Martin Kruger (ONR, MURI
Program Manager)1000 Brief discussion/QA
session between ONR representatives and PIs
8
Agenda for MURI Kickoff Meeting at UC Irvine
1015 Break1030  Tutorial Session
Statistical models for network data      
Mark Handcock (U Washington), Carter Butts (UCI),
Dave Hunter (Penn State)        -
Fundamentals of exponential family random graph
models (ERGMs)          - Parameter estimation
in ERGMs principles and computational
challenges          - Alternative statistical
frameworks such as latent space
models LUNCH1200 Lunch for PIs and UCI
visitors at the University Club (next
door to Bren Hall)
9
Agenda for MURI Kickoff Meeting at UC Irvine
AFTERNOON SESSION 15-minute brief Research
Presentations 130   - Studying networks
through an algorithmic lens           Michael
Goodrich (UCI) 145   - Fast algorithms for
computing network statistics           David
Eppstein (UCI)200   - Data structures for
dynamic and kinetic multidimensional point sets
          Dave Mount (U Maryland)215  -
Modeling dynamic and relational event data
          Carter Butts230   - Statistical
modeling of large text collectons           
Padhraic Smyth (UCI)245   - Modeling partially
observed network data           Mark Handcock
(U Washington)
10
Agenda for MURI Kickoff Meeting at UC Irvine
300  Break and Informal Discussion330 
Brief talks on Software and Data Sets        
- R software for network analysis          
Dave Hunter (Penn State)          -
Experimental results on real-world
networks         PhD students from Carter
Butt's group (Sociology Dept, UCI)         -
Large network data sets for experimentation
         Chris DuBois (PhD student, Statistics
Department, UCI)
11
Agenda for MURI Kickoff Meeting at UC Irvine
415  Open Discussion on Research Plans       
- relation of MURI research topics to military
applications        - further opportunities for
collaboration within the team        - year 1
research goals 500  Organizational Issues and
Wrap-up          - future meetings (frequency,
location)        - encouraging interaction
between team members          (conference calls,
weekly research meetings, etc)         - use of
collaborative Web pages         - action
items530  Adjourn     
12
Logistics
  • Meals
  • Lunch at University Club - for PIs and non-UCI
    folks
  • Coffee breaks
  • Wireless
  • Should be able to get 24-hour guest access
  • Slides
  • Will be available online by the end of today

13
Outline
  • Introductions
  • Review todays agenda
  • Schedule of talks
  • Logistics
  • Overview of our MURI research project
  • Themes and goals
  • Tasks

14
MURI Project Scalable Methods for Analysis of
Network-Based Data
  • 4 universities collaborating, 7 PIs
  • Support for (approx) 8 graduate students and 3
    postdocs or research associates
  • 3-year project with possible extension to 5 years
  • Time Period
  • Funding arrived at UCI in September 2008
  • At other universities in Sept/Oct 2008
  • Official project start/end
  • June 1 2008 to May 30 2011/2013

15
A Small Social Network
Eppstein
Butts
Butts
Hunter
Mount
Smyth
Handcock
Goodrich
16
A Small Social Network
Eppstein
Butts
Hunter
Mount
Smyth
Handcock
Goodrich
Statistics
17
A Small Social Network
Eppstein
Butts
Hunter
Mount
Smyth
Handcock
Goodrich
Statistics
Algorithms Data Structures
18
(No Transcript)
19
Figure from Carter Butts
20
Statistical Modeling of Network Data
  • Statistics principled approach for inference
    from data
  • Basis for optimal prediction
  • querying computation of conditional
    probabilities/expectation
  • Principles for handling noisy measurements
  • e.g., noisy edge observation process
  • Integration of different sources of information
  • e.g., combining edge information with node
    covariates
  • Quantification of uncertainty
  • e.g., which model is a better explanation of the
    data

21
Limitations of Existing Methods
  • Network data over time
  • Relatively few statistical models for dynamic
    network data
  • Heterogeneous data
  • e.g., few techniques for incorporating text,
    spatial information, etc, into network models
  • Computational tractability
  • Many network modeling algorithms scale
    exponentially in the number of nodes N

22
Example
  • G V, E
  • V set of N nodes
  • E set of directed binary edges
  • Exponential random graph model (ERGM)
  • P(G q) f( G q ) / normalization
    constant
  • The normalization constant sum over all
    possible graphs
  • How many graphs? 2 N(N-1)
  • e.g., N 20, we have 2380 1038 graphs to
    sum over

23
Key Themes of our MURI Project
  • Foundational research on new statistical
    estimation techniques for network data
  • e.g., principles of modeling with missing data
  • New algorithms for heterogeneous network data
  • Incorporating time, space, text, other covariates
  • Faster algorithms
  • E.g., approximation methods for very large data
    sets
  • Software
  • Make network inference software
    publicly-available (in R)

24
Key Themes of our MURI Project
Fast Algorithms
Statistical Methods
Richer models
Large Heterogeneous Data Sets
Software
25
Tasks
  • A Fast network estimation algorithms
  • Eppstein, Butts
  • B Spatial representations and network data
  • Goodrich, Eppstein, Mount
  • C Advanced network estimation techniques
  • Handcock, Hunter
  • D Scalable methods for relational events
  • Butts
  • E Network models with text data
  • Smyth
  • F Software for network inference and prediction
  • Hunter

26
Task A Fast Network Estimation Algorithms
Investigators Eppstein, Butts
  • Problem
  • Statistical inference algorithms can be slow
    because of repeated computation of various
    statistics on graphs
  • Goal
  • Leverage ideas from computational graph
    algorithms to enable much faster computation
    also enabling computation of more complex and
    realistic statistics
  • Projects
  • Dynamic graph methods for change-score
    computation
  • Rapid subgraph automorphism detection for feature
    counting
  • Dynamic connectivity

27
Task B Spatial Representations and Network Data
Investigators Goodrich, Eppstein, Mount
  • Problem
  • Spatial representations of network data can be
    quite useful (both latent embeddings and actual
    spatial information) but current statistical
    modeling algorithms scale poorly
  • Goal
  • Build on recent efficient geometric data
    indexing techniques in computer science to
    develop much faster and efficient algorithms
  • Projects
  • Improved algorithms for latent-space embeddings
  • Fast implementations for high-dimensional latent
    space models
  • Techniques for integrating actual and latent
    space geometry

28
Task C Advanced Estimation Techniques
Investigators Handcock, Hunter
  • Problem
  • Current statistical network inference models
    often make unrealistic assumptions, e.g.,
  • Assume complete (non-missing) data
  • Assume that exact computation is possible
  • Goal
  • Develop new theories and techniques that relax
    these assumptions, i.e., methods for handing
    missing data and techniques for approximate
    inference
  • Projects
  • Inference with partially observed network data
  • Approximation methods
  • Approximate likelihood techniques
  • Approximate MCMC algorithms
  • Will leverage new techniques developed in Tasks A
    and B

29
Figure from Dave Hunter, Penn State
30
Task D Scalable Temporal Models
Investigator Butts
  • Problem
  • Few statistical methods for modeling temporal
    sequences of events among a network of actors
  • Goal
  • Develop new statistical relational event models
    to handle an evolving set of events over time in
    a network context
  • Projects
  • Specification of relational event statistics
  • Rapid likelihood computation for relational event
    models
  • Predictive event system queries
  • Interventions, forecasting, and network
    steering
  • Can build on ideas from Tasks A, B, C

31
(No Transcript)
32
Task E Network Models and Text Data
Investigator Smyth
  • Problem
  • Lack of statistical techniques that can combine
    network and text data within a single framework
    (e.g., email communication)
  • Goal
  • Leverage recent advances in both statistical
    text mining and statistical network modeling to
    create new combined models
  • Projects
  • Latent variable models for text and network data
  • Text as exogenous data for statistical network
    models
  • Modeling of text and network data over time
  • Fast algorithms for statistical modeling of
    text/networks
  • Can build on ideas from Tasks A, B, C and D

33
Network of email communication patterns in a
corporate research lab
34
Task F Software for Network Inference and
Prediction
Investigator Hunter
  • Goal
  • Disseminate algorithms and software to research
    and practitioner communities
  • How?
  • By incorporating our new algorithms into the R
    statistical package
  • R open source language for stat
    computing/graphics
  • MURI team has significant prior experience with
    developing statistical network modeling packages
    in R
  • network (Butts et al, 2007)
  • latentnet (Handcock et al, 2004)
  • ergm (Handcock et al, 2003)
  • sna (Butts, 2000)
  • Will integrate algorithms and techniques from
    earlier tasks

35
Data Sets
  • Traditional social network data sets
  • Next generation data sets
  • Dynamic network data
  • E.g., WTC communications
  • Network data with text
  • E.g., political blogs, Enron emails
  • Often much larger and richer than traditional
    data sets
  • See afternoon talks by PhD students Lorien Jasny
    and Chris DuBois

36
Evaluation Methods
  • Traditional statistical metrics
  • Log-likelihood on training data
  • Model comparisons using penalized and marginal
    likelihood
  • Predictive metrics
  • E.g., for dynamic networks, prediction of edge
    and node properties out of sample, and
    assessment of the accuracy of these predictions
  • Classification accuracy, precision-recall (ROC),
    etc
  • Computational metrics
  • Worst and average-case analysis
  • Empirical evaluations of computation time
  • Trade-offs of statistical/predictive accuracy
    with computation time

37
Summary
  • Statistical modeling is a key approach for
    quantitative analysis and prediction using
    network data
  • Existing statistical network modeling techniques
    are potentially very powerful
  • But are currently computationally limited to
    small networks
  • Leverage ideas from computer science to extend
    the reach of statistical network modeling to
    larger networks
  • Benefits
  • Computationally tractable modeling of much larger
    networks
  • More sophisticated representations for network
    models
  • New applications of statistical network modeling

38
Agenda for MURI Kickoff Meeting at UC Irvine
November 18th 2008Location UC Irvine, Bren
Hall, Room 4011 MORNING SESSION830 
Arrive, continental breakfast available900 
Introductions and overview of MURI proposal
        Padhraic Smyth (UCI)930  Research
and application challenges from ONR's
perspective          Martin Kruger (ONR, MURI
Program Manager)1000 Brief discussion/QA
session between ONR representatives and PIs
About PowerShow.com