Proximity on Large Graphs: definitions, fast solutions and applications - PowerPoint PPT Presentation

About This Presentation
Title:

Proximity on Large Graphs: definitions, fast solutions and applications

Description:

Jones. Alan. Peter. Adam. Anna. Beck. Tom. Cell Phone. SameTime. Lotus Mail Graph Level. Patterns ... Adam. Anna. We are here! SCS CMU. 5. Proximity on Graph: What? ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 107
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Proximity on Large Graphs: definitions, fast solutions and applications


1
Proximity on Large Graphsdefinitions, fast
solutions and applications
  • Speaker Hanghang Tong
  • Carnegie Mellon University

IBM T.J. Watson
2008-7-31
2
Joint work with
Jia-Yu Pan (Google)
Christos Faloutsos (CMU)
Yehuda Koren (ATT Labs)
IBM
Spiros Papadimitriou
Philip S. Yu
Huiming Qu
Hani Jamjoom
Tina Eliassi-Rad (LLNL)
Brian Gallagher (LLNL)
Kensuke Oonuma (Sony Corp.)
Yasushi Sakurai (NTT Labs)
3
Graphs are everywhere!
4
Graph Mining Big Picture
  • Graph Level
  • Patterns
  • Laws
  • Generators

Smith
Alan
Adam
Adam
John
Jones
Tom
Peter
Subgraph Level - Community
Beck
  • Node Level
  • Association
  • Correlation
  • Causality
  • Proximity

Jack
Amy
Dan
Anna
Anna
Tom
Alice
Cell Phone
We are here!
5
Proximity on Graph What?
a.k.a Relevance, Closeness, Similarity
6
Proximity on Graphs Why?
  • Link prediction Liben-Nowell, Tong
  • Ranking Haveliwala, Chakrabarti
  • Email Management Minkov
  • Image caption Pan
  • Neighborhooh Formulation Sun
  • Conn. subgraph Faloutsos, Tong, Koren
  • Pattern match Tong
  • Collaborative Filtering Fouss
  • Many more

Will return to this later
7
Link Prediction
density
Prox. Hist. for a set of deleted links
Prox (i?j)Prox (j?i)
Prox. is effective to deleted and absent edges!
density
Prox. Hist. for a set of absent links
Prox (i?j)Prox (j?i)
Q How to predict the existence of the link? A
Proximity! Liben-Nowell 2003
8
Neighborhood Search on graphs




Conference
Author
Q what is most related conference to ICDM?
A Proximity! Sun ICDM2005
9
Automatic Image Caption
Region
Image
Test Image
Keyword
Sea
Sun
Sky
Wave
Cat
Forest
Tiger
Grass
Q How to assign keywords to the test image? A
Proximity! Pan 2004
10
Center-Piece Subgraph(CePS)
Input
Output
CePS guy
CePS
Original Graph
Q How to find hub for the black nodes? A
Proximity! Tong KDD 2006
11
Input
Output
Query Graph
Best-Effort Pattern Match
Data Graph
Matching Subgraph
Q How to find matching subgraph? A
Proximity!Tong KDD 2007 b
12
Roadmap
  • Basic RWR
  • Variants
  • Properties
  • Generalizations
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion

13
Why not shortest path?
Some bad proximities
pizza delivery guy problem
multi-facet relationship
14
Why not max. netflow?
Some bad proximities
No punishment on long paths
15
What is a good Proximity?
  • Multiple Connections
  • Quality of connection
  • Direct In-directed Conns
  • Length, Degree, Weight


16
Random walk with restart
17
Random walk with restart
Nearby nodes, higher scores
Ranking vector
More red, more relevant
18
Why RWR is a good score?
adjacency matrix. c damping factor
19
Roadmap
  • Basic RWR
  • Variants
  • Properties
  • Generalizations
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion

20
Variant escape probability
  • Define Random Walk (RW) on the graph
  • Esc_Prob(CMU?IBM)
  • Prob (starting at CMU, reaches IBM before
    returning to CMU)

the remaining graph
CMU
IBM
Esc_Prob Pr (smile before cry)
21
Other Variants
  • Other measure by RWs
  • Community Time/Hitting Time Fouss
  • SimRank Jeh
  • Equivalence of Random Walks
  • Electric Networks
  • EC Doyle SAECFaloutsos CFECKoren
  • String Systems
  • Katz Katz, Huang, Scholkopf
  • Matrix-Forest-based Alg Chobotarev

All are related to or similar to random walk
with restart!
22
Chaptering different measurements
Regularized Un-constrained Quad Opt.
Norma lize
RWR
Katz
4 ssp decides 1 esc_prob
Esc_Prob Sink
Hitting Time/ Commute Time
relax
X out-degree
Harmonic Func. Constrained Quad Opt.
Effective Conductance
voltage position
String System
Physical Models
Mathematic Tools
23
Roadmap
  • Basic RWR
  • Variants
  • Properties
  • Generalizations
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion

24
Property Monotonicity
  • We want

A degree preserving! Koren KDD06Tong
KDD07aTong SDM08
25
Property Asymmetry Tong KDD07 a
What is Prox from A to B? What is Prox from B to
A?
What is Prox between A and B?
26
Asymmetry in un-directed graphs
  • Hanghangs 1 employer is IBM
  • The 1 employee of IBM is ...

Hanghang
IBM
So is love
27
Roadmap
  • Basic RWR
  • Variants
  • Properties
  • Generalizations
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion

28
Group Proximity Tong KDD07 a
  • Q How close are Accountants to SECs?
  • A Prob (starting at any RED, reaches any GREEN
    before touching any RED again)

29
Proximity on Attributed Graphs Tong KDD07 b
What is the proximity from node 7 to 10?
If we know that
30
A Augmented graphs Tong KDD07 b
31
More on Generalizations
  • Attributed on edges Chakrabarti KDD 06
  • Proximity w/ Time
  • Minkov, Tong SDM 2008, Tong CIKM 2008
  • Proximity w/ Side Information Tong 2008

32
Summary of Part I
  • Goal Summarize multiple relationship
  • Solutions
  • Basic Random Walk with Restart
  • Pan 2004Sun 2006Tong 2006
  • Properties Asymmetry, monotonicity
  • Koren 2006Tong 2007 Tong 2008
  • Variants Esc_Prob and many others.
  • Faloutsos 2004 Koren 2006Tong 2007
  • Generalizations Group Prox, w/ Attr., w/ Time,
    w/ Side Information
  • Charkrabarti 2006Tong 2007 Tong 2008

33
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion

34
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion
  • B_Lin RWR
  • BB_Lin Skewed BGs
  • FastUpdate Time-Evolving

35
Preliminary ShermanMorrison Lemma
36
SM The block-form
37
SM Lemma Applications
  • RLS (Recursive least square)
  • and almost any algorithm in time series!
  • Leave-one-out cross validation for LSR
  • Kalman filtering
  • Incremental matrix decomposition
  • and all the fast sol.s we will introduce!

38
Computing RWR
Starting vector
Restart p
Adjacency matrix
Ranking vector
1
n x n
n x 1
n x 1
39
Q Given query i, how to solve it?
Query
?
?
Starting vector
Ranking vector
Ranking vector
Adjacency matrix
40
OntheFly
No pre-computation/ light storage
Slow on-line response
O(mE)
41
PreCompute
10
9
12
2
8
1
11
R
3
4
6
5
7
c x Q
Haveliwala 2002
Q
42
PreCompute


Fast on-line response
Heavy pre-computation/storage cost
O(n )
3
O(n )
2
43
Q How to Balance?
On-line
Off-line
44
B_Lin Basic IdeaTong ICDM 2006
Find Community
Combine
Fix the remaining
45
B_Lin details



Cross-community
46
B_Lin details
47
B_Lin summary
  • Pre-Computational Stage
  • Q
  • A A few small, instead of ONE BIG, matrices
    inversions
  • On-Line Stage
  • Q Efficiently recover one column of Q
  • A A few, instead of MANY, matrix-vector
    multiplication

48
Query Time vs. Pre-Compute Time
Log Query Time
  • Quality 90
  • On-line
  • Up to 150x speedup
  • Pre-computation
  • Two orders saving

Log Pre-compute Time
49
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion
  • B_Lin RWR
  • BB_Lin Skewed BGs
  • FastUpdate Time-Evolving

50
RWR on Bipartite Graph
authors
Author-Conf. Matrix
Observation n gtgt m! Examples 1. DBLP
400k aus, 3.5k confs 2. NetFlix 2.7M usrs,18k
mvs
n
Conferences
m
51
RWR on Skewed bipartite graphs
  • Q Given query i, how to solve it?

m confs
. . . . . . . . . . . . ..
0
Ar
?
?
. . . . . . . . . . . . ..
0
Ac
n aus
n
m
52
BB_Lin Pre-Computation Tong ICDM 06
2-step RWR for Conferences
  • Step 1
  • Step 2
  • Cost
  • Examples
  • NetFlix 1.5hr for pre-computation
  • DBLP 1 few minutes

m conferences
All Conf-Conf Prox. Scores
n authors
53
BB_Lin Pre-Computation Tong ICDM 06
2-step RWR for Conferences
  • Step 1
  • Step 2

m conferences
All Conf-Conf Prox. Scores
n authors
54
BB_Lin Pre-Computation Tong ICDM 06
2-step RWR for Conferences
  • Step 1
  • Step 2
  • Cost
  • Examples
  • NetFlix 1.5hr for pre-computation
  • DBLP 1 few minutes

All Conf-Conf Prox. Scores
m x m
Ac/Ar E edges
55
BB_Lin On-Line Stage
authors
Conferences
(Base) Case 1 - Conf - Conf
Read out !
56
BB_Lin On-Line Stage
authors
Conferences
Case 2 - Au - Conf
1 matrix-vec!
57
BB_Lin On-Line Stage
authors
Conferences
Case 3 - Au - Au
2 matrix-vec!
58
BB_Lin Examples
400k authors x 3.5k conf.s
2.7m user x 18k movies
59
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion
  • B_Lin RWR
  • BB_Lin Skewed BGs
  • FastUpdate Time-Evolving

60
Challenges
  • BB_Lin is good for skewed bipartite graphs
  • for NetFlix (2.7M nodes and 100M edges)
  • On-line cost for query fraction of seconds
  • w/ 1.5 hr pre-computation for m x m core matrix
  • Butwhat if the graph is evolving over time
  • New edges/nodes arrive edge weights increase
  • On-line cost 1.5hr itself becomes a part of
    this!

61
Q How to update the core matrix?
62
Update the core matrix
  • Step 1
  • Step 2



X
M

Rank 2 update


63
Update General Case
n authors
  • E edges changed
  • Involves n authors, m confs.
  • Observation


m Conferences
64
Update General Case
  • Observation
  • the rank of update is small!
  • Real Example (DBLP Post)
  • 1258 time steps
  • E up to 20,000!
  • min(n,m) lt132
  • Our Algorithm

m Conferences
65
Fast-Single-Update
log(Time) (Seconds)
176x speedup
40x speedup
Our method
Our method
Datasets
66
Fast-Batch-Update
Time (Seconds)
Time (Seconds)
Our method
Our method
Min (n, m)
E
15x speed-up on average!
67
More on Fast Solutions
  • FastAllDAP
  • Simultaneously solve multiple linear systems
  • Tong KDD 2007 a
  • MT3
  • Multiple-Resolution Analysis on Time Tong CIKM
    2008
  • Fast-ProSIN
  • On-Line response for users feedback Tong 2008

68
Summary of Part II
  • Goal Efficiently Solve Linear System(s)
  • Sols.
  • B_Lin one large linear system Tong ICDM06
  • BB_Lin the intrinsic complexity is small Tong
    ICDM06
  • FastUpdate dynamic linear system Tong SDM08
  • FastAllDAP multiple linear systems Tong KDD07
    a
  • MT3 Tong CIKM 2008
  • Fast-ProSIN Tong 2008

69
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion

70
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion
  • Link Prediction
  • Ranking Related Tasks
  • User Specific Patterns
  • Time Related Tasks

71
density
Link Prediction existence
Prox. Hist. for a set of deleted links
Prox (i?j)Prox (j?i)
Prox. is effective to deleted and absent edges!
density
Prox. Hist. for a set of absent links
Prox (i?j)Prox (j?i)
Q How to predict the existence of the link? A
Proximity! Liben-Nowell 2003
72
Link Prediction direction Tong KDD 07 a
  • Q Given the existence of the link, what is the
    direction of the link?
  • A Compare prox(i?j) and prox(j?i)

gt70
density
Prox (i?j) - Prox (j?i)
73
Beyond Link Prediction
  • Collaborative Filtering Fouss
  • Name Disambiguation
  • Minkov SIGIR 06
  • Anomaly Nodes/Edges
  • a is abnormal if the neighborhood of a is so
    different
  • Sun ICDM 2005

74
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion
  • Link Prediction
  • Ranking Related Tasks
  • User Specific Patterns
  • Time Related Tasks

75
Neighborhood Search on graphs




Conference
Author
Q what is most related conference to ICDM?
A Proximity! Sun ICDM2005
76
NF example
77
gCaP Automatic Image Caption
  • Q




Sea
Sun
Sky
Wave
?
A Proximity! Pan KDD2004
78
Region
Image
Test Image
Keyword
79
Region
Image
Test Image
Grass, Forest, Cat, Tiger
Sea
Sun
Sky
Wave
Cat
Forest
Tiger
Grass
Keyword
80
C-DEM Multi-Modal Query System for
DrosophilaEmbryo Databases Fan VLDB 2008
81
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion
  • Link Prediction
  • Ranking Related Tasks
  • User Specific Patterns
  • Time Related Tasks

82
Center-Piece Subgraph(CePS)
Input
Output
CePS guy
CePS
Original Graph
Q How to find hub for the black nodes? A
Proximity! Tong KDD 2006
Red Max (Prox(A, Red) x Prox(B, Red) x Prox(C,
Red))
83
CePS Example
84
K_SoftAnd Relaxation of AND
Disconnected Communities
Noise
  • Asking AND query? ? No Answer!

85
CePS 2 SoftAND
DB
Stat.
86
Input
Output
Query Graph
Best-Effort Pattern Match
Data Graph
Matching Subgraph
Q How to find matching subgraph? A
Proximity!Tong KDD 2007 b
87
G-Ray How to?
matching node
matching node
matching node
matching node
Goodness Prox (12, 4) x Prox (4, 12) x
Prox (7, 4) x Prox (4, 7) x
Prox (11, 7) x Prox (7, 11) x
Prox (12, 11) x Prox (11, 12)
88
Effectiveness star-query
Query
Result
89
Effectiveness line-query
Query
Result
90
Effectiveness loop-query
Query
Result
91
Roadmap
  • Motivation
  • Part I Definitions
  • Part II Fast Solutions
  • Part III Applications
  • Conclusion
  • Link Prediction
  • Ranking Related Tasks
  • User Specific Patterns
  • Time Related Tasks

92
Challenge
  • Graphs are evolving over time!
  • New nodes/edges show up
  • Existing nodes/edges die out
  • Edge weights change

Q How to Generalize everything? A Track
Proximity! Tong SDM 2008
93
pTrack/cTrack Trend analysis on graph level
T. Sejnowski
Rank of Influential-ness
G.Hinton
C. Koch
M. Jordan
Year
94
pTrack Philip S. Yus Top-5 conferences up to
each year
DBLP (Au. x Conf.) - 400k aus, - 3.5k confs
- 20 yrs
Databases Performance Distributed Sys.
Databases Data Mining
95
KDDs Rank wrt. VLDB over years
Prox. Rank
Data Mining and Databases are more and more
relavant!
Year
96
cTrack10 most influential authors in NIPS
community up to each year
T. Sejnowski
M. Jordan
Author-paper bipartite graph from NIPS 1987-1999.
3k. 1740 papers, 2037 authors, spreading over
13 years
97
T3 Understand Time in Complex Context Tong
CIKM 2008
Time Cluster, rep. entities b7,b6, b8
Abnormal Time rep. entities b5,b4
Time Cluster rep. entities b3, b2, b1
Output
Input
98
T3 Time-to-Time Proximity Matrix
99
More Applications
  • Clustering
  • Proximity as input Ding KDD 2007
  • Email management Minkov CEAS 06.
  • Business Process Management Qu 2008
  • ProSIN
  • Listen to clients comments Tong 2008
  • TANGENT
  • Broaden Users Horizon Oonuma Tong 2008
  • Ghost Edge
  • Within Network Classification Gallagher Tong
    KDD08 b

100
Applications
Computations
Use Proximity as Building block
Efficiently Solve Linear System(s)
MT3 Tong 2008
Fast-ProSIN Tong 2008
FastUpdate Tong 2008
FastAllDAP Tong 2007
BB_Lin Tong 2006
B_Lin Tong 2006
Weighted Multiple Relationship
Proximity On Graphs
Definitions
RWR Pan 2004Sun 2006Tong 2006
Properties. Koren 2006Tong 2007, 2008
Variants Faloutsos 2004 Koren 2006Tong
2007
Generalizations Charkrabarti 2006Tong 2007,
2008
101
Take-home messages
  • Proximity Definitions
  • RWR
  • and a lot of variants
  • Computations
  • Find out smoothness
  • SM Lemma
  • Applications
  • Proximity as a building block

102
References
  • L. Page, S. Brin, R. Motwani, T. Winograd.
    (1998), The PageRank Citation Ranking Bringing
    Order to the Web, Technical report, Stanford
    Library.
  • T.H. Haveliwala. (2002) Topic-Sensitive
    PageRank. In WWW, 517-526, 2002
  • J.Y. Pan, H.J. Yang, C. Faloutsos P. Duygulu.
    (2004) Automatic multimedia cross-modal
    correlation discovery. In KDD, 653-658, 2004.
  • C. Faloutsos, K. S. McCurley A. Tomkins. (2002)
    Fast discovery of connection subgraphs. In KDD,
    118-127, 2004.
  • J. Sun, H. Qu, D. Chakrabarti C. Faloutsos.
    (2005) Neighborhood Formation and Anomaly
    Detection in Bipartite Graphs. In ICDM, 418-425,
    2005.
  • W. Cohen. (2007) Graph Walks and Graphical
    Models. Draft.

103
References
  • P. Doyle J. Snell. (1984) Random walks and
    electric networks, volume 22. Mathematical
    Association America, New York.
  • Y. Koren, S. C. North, and C. Volinsky. (2006)
    Measuring and extracting proximity in networks.
    In KDD, 245255, 2006.
  • A. Agarwal, S. Chakrabarti S. Aggarwal. (2006)
    Learning to rank networked entities. In KDD,
    14-23, 2006.
  • S. Chakrabarti. (2007) Dynamic personalized
    pagerank in entity-relation graphs. In WWW,
    571-580, 2007.
  • F. Fouss, A. Pirotte, J.-M. Renders, M.
    Saerens. (2007) Random-Walk Computation of
    Similarities between Nodes of a Graph with
    Application to Collaborative Recommendation. IEEE
    Trans. Knowl. Data Eng. 19(3), 355-369 2007.

104
References
  • H. Tong C. Faloutsos. (2006) Center-piece
    subgraphs problem definition and fast solutions.
    In KDD, 404-413, 2006.
  • H. Tong, C. Faloutsos, J.Y. Pan. (2006) Fast
    Random Walk with Restart and Its Applications. In
    ICDM, 613-622, 2006.
  • H. Tong, Y. Koren, C. Faloutsos. (2007) Fast
    direction-aware proximity for graph mining. In
    KDD, 747-756, 2007.
  • H. Tong, B. Gallagher, C. Faloutsos, T.
    Eliassi-Rad. (2007) Fast best-effort pattern
    matching in large attributed graphs. In KDD,
    737-746, 2007.
  • H. Tong, S. Papadimitriou, P.S. Yu C.
    Faloutsos. (2008) Proximity Tracking on
    Time-Evolving Bipartite Graphs. to appear in SDM
    2008.

105
References
  • B. Gallagher, H. Tong, T. Eliassi-Rad, C.
    Faloutsos. Using Ghost Edges for Classification
    in Sparsely Labeled Networks. KDD 2008
  • H. Tong, Y. Sakurai, T. Eliassi-Rad, and C.
    Faloutsos. Fast Mining of Complex Time-Stamped
    Events CIKM 08
  • H. Tong, H. Qu, and H. Jamjoom. Measuring
    Proximity on Graphs with Side Information.
    Submitted.
  • K. Oonuma, H. Tong, and C. Faloutsos. TANGENT A
    Novel, Surprise-me, Recommendation Algorithm.
    Submitted.

106
  • Thank you!
  • htong_at_cs.cmu.edu
  • www.cs.cmu.edu/htong
Write a Comment
User Comments (0)
About PowerShow.com