1%20Stanford%20University%202%20MPI%20for%20Biological%20Cybernetics%203%20California%20Institute%20of%20Technology - PowerPoint PPT Presentation

About This Presentation
Title:

1%20Stanford%20University%202%20MPI%20for%20Biological%20Cybernetics%203%20California%20Institute%20of%20Technology

Description:

Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez1,2 Jure Leskovec1 Andreas Krause3 1 Stanford University 2 MPI for Biological Cybernetics – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: 1%20Stanford%20University%202%20MPI%20for%20Biological%20Cybernetics%203%20California%20Institute%20of%20Technology


1
1 Stanford University2 MPI for Biological
Cybernetics 3 California Institute of Technology
Inferring Networks of Diffusion and Influence
Manuel Gomez Rodriguez1,2Jure Leskovec1Andreas
Krause3
2
Networks and Processes
  • Many times hard to directly observe a social or
    information network
  • Hidden/hard-to-reach populations
  • Drug injection users
  • Implicit connections
  • Network of information sharing in online media
  • Often easier to observe results of the processes
    taking place on such (invisible) networks
  • Virus propagation
  • People get sick, they see the doctor
  • Information networks
  • Blogs mention information

3
Information Diffusion Network
  • Information diffuses through the network
  • We only see who mentions but not where they got
    the information from
  • Can we reconstruct the (hidden) diffusion network?

4
More Examples
Word of mouth Viral marketing
  • Virus propagation

Viruses propagate through the network We only
observe when people get sick But NOT who infected
them
Recommendations and influence propagate We only
observe when people buy products But NOT who
influenced them
Process
We observe
Its hidden
Can we infer the underlying social or information
network?
5
Inferring the Network
  • There is a hidden directed network
  • We only see times when nodes get infected
  • Contagion c1 (a, 1), (c, 2), (b, 3), (e, 4)
  • Contagion c2 (c, 1), (a, 4), (b, 5), (d, 6)
  • Want to infer the who-infects-whom network

a
a
a
b
b
b
d
d
c
c
c
e
e
6
Our Problem Formulation
  • Plan for the talk
  • Define a continuous time model of diffusion
  • Define the likelihood of the observed propagation
    data given a graph
  • Show how to efficiently compute the likelihood
  • Show how to efficiently optimize the likelihood
  • Find a graph G that maximizes the likelihood
  • There is a super-exponential number of graphs G
  • Our method finds a near-optimal solution in O(N2)!

7
Cascade Generation Model
  • Cascade generation model
  • Cascade reaches u at tu, and spreads to us
    neighbors v
  • with probability ß cascade propagates along (u,
    v) and tv tu ?, with ? f(?)

ta
tb
tc
te
tf
?1
?2
?3
?4
a
a
a
b
b
b
c
c
c
d
We assume each node v has only one parent!
e
f
e
f
8
Likelihood of a Cascade
  • If u infected v in a cascade c, its transmission
    probability is
  • Pc(u, v) ? f(tv - tu) with tv gt tu and (u,
    v) are neighbors
  • To model that in reality any node v in a cascade
    can have been infected by an external influence
    m Pc(m, j) e

m
  • Prob. that cascade c propagates in a tree T

e
e
e
Tree pattern T on cascade c (a, 1), (b, 2), (c,
4), (e, 8)
9
Finding the Diffusion Network
Good news Computing P(cG) is tractable
Bad news We actually want to find We have a
super-exponential number of graphs!
  • There are many possible propagation trees
  • c (a, 1), (c, 2), (b, 3), (e, 4)
  • Need to consider all possible propagation trees T
    supported by G

For each c, consider all O(nn) possible
transmission trees of G. Matrix Tree Theorem can
compute this sum in O(n3)!
  • Likelihood of a set of cascades C on G
  • Want to find

10
An Alternative Formulation
  • We consider only the most likely tree
  • Maximum log-likelihood for a cascade c under a
    graph G
  • Log-likelihood of G given a set of cascades C

The problem is NP-hard MAX-k-COVER Our
algorithm can do it near-optimally in O(N2)
11
Max Directed Spanning Tree
  • Given a cascade c,
  • What is the most likely propagation tree?

  • where
  • A maximum directed spanning tree (MDST)
  • Just need to compute the MDST of a the sub-graph
    of G induced by c (i.e., a DAG)
  • For each node, just picks an in-edge of
    max-weight

a
Subgraph of G induced by c doesnt have loops
(DAG)
a
3
b
5
b
2
c
4
6
c
1
d
d
Local greedy selection gives optimal tree!
12
Great News Submodularity
  • Theorem
  • Log-likelihood Fc(G) is monotonic, and
    submodular in the edges of the graph G

A ? B ? VxV
  • Proof

A
  • Single cascade c, edge e with weight x

r
w
x
i
i
  • Let w be max weight in-edge of s in A

s
k
k
  • Let w be max weight in-edge of s in B

B
  • We know w w

j
w
  • Now Fc(A ? e) Fc(A) max (w, x) w
  • max (w, x) w Fc(B ? e) Fc(B)

o
a
Then, log-likelihood FC(G) is monotonic, and
submodular too
13
Finding the Diffusion Graph
  • Use the greedy hill-climbing to maximize FC(G)
  • At every step, pick the edge that maximizes the
    marginal improvement

Marginal gains
b
a
12
a
c
a
3
b
d
a
6
a
b
20
c
b
18
17
d
Localized update
Localized update
d
b
4
2
1
c
e
b
5
3
1
a
c
15
Localized update
b
c
8
6
e
b
d
16
Localized update
c
d
8
7
e
d
10
8
b
e
7
d
e
13
14
Experimental Setup
  • We validate our method on

Synthetic data Generate a graph G on k
edges Generate cascades Record node infection
times Reconstruct G
Real data MemeTracker 172m news articles Aug 08
Sept 09 343m textual phrases (quotes)
  • How well do we optimize Fc(G)?
  • How many edges of G can we find?
  • Precision-Recall
  • Break-even point
  • How fast are we?
  • How many cascades do we need?

15
Small Synthetic Example
  • Small synthetic network

True network
Baseline network
Our method
Pick k strongest edges
15
16
Synthetic Networks
1024 node hierarchical Kronecker exponential
transmission model
1000 node Forest Fire (a 1.1) power law
transmission model
  • Our performance does not depend on the network
    structure
  • Synthetic Networks Forest Fire, Kronecker, etc.
  • Prob. of transmission Exponential, Power Law
  • Break-even points of gt 90 when the baseline gets
    30-50!

17
How good is our graph?
  • We achieve 90 of the best possible network!

18
How many cascades do we need?
  • With 2x as many infections as edges, the
    break-even point is already 0.8 - 0.9!

19
Running Time
  • Lazy evaluation and localized updates speed up 2
    orders of magnitude!

20
Real Data
  • MemeTracker dataset
  • 172m news articles
  • Aug 08 Sept 09
  • 343m textual phrases (quotes)
  • Times tc(w) when site w mentions phrase (quote) c

http//memetracker.org
  • Given times when sites mention phrases
  • We infer the network of information diffusion
  • Who tends to copy (repeat after) whom

21
Real Network
  • We use the hyperlinks in the MemeTracker dataset
    to generate the edges of a ground truth G
  • From the MemeTracker dataset, we have the
    timestamps of
  • 1. cascades of hyperlinks
  • sites link other sites
  • 2. cascades of (MemeTracker) quotes
  • sites copy quotes from other sites

e
c
a
Are they correlated?
f
e
a
c
f
Can we infer the hyperlinks network
from cascades of hyperlinks? cascades of
MemeTracker quotes?
22
Real Network
500 node hyperlink network using hyperlinks
cascades
500 node hyperlink network using MemeTracker
cascades
  • Break-even points of 50 for hyperlinks cascades
    and 30 for MemeTracker cascades!

23
Diffusion Network
  • 5,000 news sites

Blogs Mainstream media
24
Diffusion Network (small part)
Blogs Mainstream media
25
Networks and Processes
  • We infer hidden networks based on diffusion data
    (timestamps)
  • Problem formulation in a maximum likelihood
    framework
  • NP-hard problem to solve exactly
  • We develop an approximation algorithm that
  • It is efficient -gt It runs in O(N2)
  • It is invariant to the structure of the
    underlying network
  • It gives a sub-optimal network with tight bound
  • Future work
  • Learn both the network and the diffusion model
  • Extensions to other processes taking place on
    networks
  • Applications to other domains biology,
    neuroscience, etc.

26
Thanks!
For more (Code Data)http//snap.stanford.edu/n
etinf
Write a Comment
User Comments (0)
About PowerShow.com