Title: Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema
1Ranking-Based Clustering of Heterogeneous
Information Networks with Star Network Schema
- Yizhou Sun, Yintao Yu and Jiawei Han
- Department of Computer Science
- University of Illinois at Urbana-Champaign
- 9/1/2009
2Outline
- Background and Motivation
- Preliminaries
- NetClus Algorithm
- Experiments
- Conclusions and Future Work
3Homogenous vs. Heterogeneous Networks
- Information Networks are ubiquitous
- Homogenous network
- Collaboration network, friendship network,
citation network, and so on - Usually converted from a heterogeneous network
- Heterogeneous network
- Bibliographic network, movie network, tagging
network, and so on - Represent the real relations
4Why Clustering on Heterogeneous Information
Networks?
- Why clustering on heterogeneous networks?
- Understand the hidden structure
- Understand individual roles of each object plays
in the network - Existing work
- Clustering on homogeneous network
- SimRank Clustering methods on homogeneous
network - Time consuming
- Meaning of similarity becomes controversial
- RankClus EDBT09
- Clustering on one type of objects
- Experiments are on two-typed heterogeneous
networks
5Better and More Efficient Clustering
- Motivation 1 Generate clusters that are
- More meaningful
- Propose a new definition of cluster called
net-cluster, following the schema of original
network, comprised of different types of objects - More understandable
- Provide ranking information for each type of
objects in each cluster - Motivation 2 Provide an efficient algorithm
- NetClus linear to the links in the network
6SubNetwork-Clusters An Illustration
- Database net-cluster of bibliographic network
Conference Type
Author Type
Term Type
7NetClus Methodology An Illustration
- Split a network into different layers, each
representing by a net-cluster
8Outline
- Background and Motivation
- Preliminaries
- Star Network Schema
- Ranking Functions
- Net-Cluster Definition
- NetClus Algorithm
- Experiments
- Conclusions and Future Work
9Star Network Schema
- Addresses a specific type of heterogeneous
networks - Special type Star Network Schema
- Center type target type
- E.g., a paper, a movie, a tagging event
- A center object is a co-occurrence of a bag of
different types of objects, which stands for a
multi-relation among different types of objects - Surrounding type attribute type
10 11 12 13Ranking Functions
- Ranking objects in a network, denoted as p(xTx,
G) - Give a score to each object, according to its
importance - Different rules defined different ranking
functions - Simple Ranking
- Ranking score is assigned according to the degree
of an object - Authority Ranking
- Ranking score is assigned according to the mutual
enhancement by propagations of score through
links - E.g., according to rules of (1) highly ranked
conferences accept many good papers published by
many highly ranked authors and (2) highly ranked
authors publish many good papers in highly ranked
conferences
14Ranking Function (Cont.)
- Ranking distribution
- Normalize ranking scores to 1, given them a
probabilistic meaning - Similar to the idea of PageRank
- Priors can be added
- PP(XTx,Gk) (1-?P)P(XTx,Gk) ?PP0(XTx,Gk)
- P0(XTx,Gk) is the prior knowledge, usually given
as a distribution, denoted by only several words - ?P is the parameter that we believe in prior
distribution
15Net-Cluster
- Given an information network G, a net-cluster C
contains two sorts of information - Topology Node set and link set as a sub-network
of G - Stats info Membership indicator for each node x
P(xC) - Given a information network G, cluster number K,
a clustering for G, , is
defined - is a net-cluster of G
-
16Outline
- Background and Motivation
- Preliminaries
- NetClus Algorithm
- Framework of NetClus
- Net-Cluster Generative Model
- Posterior Probability Estimation (PPE)
- Impact of Ranking Functions
- Experiments
- Conclusions and Future Work
17Framework of NetClus
- General idea Map each target object into a new
low-dimensional feature space according to
current net-clustering, and adjust the clustering
further in the new measure space. - Step 0 generate initial random clusters
- Step 1 generate ranking-based generative model
for target objects for each net-cluster - Step 2 calculate posterior probs for target
objects, which serves as the new measure, and
assign target objects to the nearest cluster
accordingly - Step 3 repeat 1 and 2, until clusters do not
change significantly - Step 4 calculate posterior probs for attribute
objects in each net-cluster
18Generative Model for Target Objects Given a
Net-cluster
- Recall that, each target object stands for an
co-occurrence of a bag of attribute objects - Define the probability of a target object ltgt
define the probability of the co-occurrence of
all the associated attribute objects - Generative probability for target
object d in cluster - where is ranking distribution,
is type probability - Two assumptions of independence
- The probabilities to visit objects of different
types are independent to each other - The probabilities to visit two objects within the
same type are independent to each other
19PPE Smoothing and Background Generative Model
- Smoothing on ranking distributions for each type
of objects in each net-cluster - Smoothing each conditional ranking distribution
with global ranking distribution - PS(XTx,Gk) (1-?S)P(XTx,Gk) ?SP(XTx,G)
- Goal avoid zero probabilities for some object
- Background generative model (BG)
- The probability to generate target object d in
the original network P(dG) - Target objects that are not high related to any
specific cluster should have high probs in back
ground model
20PPE Posterior Probability Estimation for Target
Objects
- Now we have K net-clusters, corresponding to K
generative models, and a background model - Given p(dG1), p(dG2), , p(dGK), p(dG),
whats the posterior probabilities of p(kd),
(k1,2,,K,K1)? - Estimation solution
- Maximize the log-likelihood for the whole
collection -
- Using EM algorithm to estimate the best p(zk)
- Hidden variable
- Iterative formula
21PPE Posterior Probability Estimation for
Attribute Objects
- Posterior probs for attribute objects are only
needed when the sub-networks of the
net-clustering are stable (Step 4) - Aim calculate the membership for each attribute
object - Solution using the target object information for
each attribute object, average of target objects
membership! -
- E.g., a conference membership indicator is the
percentage of papers in each cluster
22Cluster Adjustment
- Using posterior probabilities of target objects
as new feature space - Each target object gt K-dimension vector
- Each net-cluster center gt K-dimension vector
- Average on the objects in the cluster
- Assign each target object into nearest cluster
center (e.g., cosine similarity) - A sub-network corresponding to a new net-cluster
is then built - by extracting all the target objects in that
cluster and all the linked attribute objects
23Time Complexity Analysis
- Global ranking for attribute objects
- O(t1E)
- During each iteration
- Conditional ranking O(t1E)
- Conditional probs for target objects O(E)
- Posterior for target objects O(t2(K1)N)
- Cluster adjustment O(K2N)
- Posterior for attribute objects
- O(KE)
- In all
- O(E) for fixed K
24Impact of Ranking Functions
- Which ranking function is better?
- For a simple 3-type star network on object set
, where Z is the center type - The estimated joint probability for
simple ranking has the error of I(X,Y) - The estimated joint probability for
authority ranking has the best estimation for the
propagation matrix between X and Y under
Frobenius norm.
25Outline
- Background and Motivation
- Preliminaries
- NetClus Algorithm
- Experiments
- Conclusions and Future Work
26Experiments
- Data Set
- DBLP all-area data set
- All conferences
- Top 50K authors
- DBLP four-area data set
- 20 conferences from DB, DM, ML, IR
- All authors from these conferences
- All papers published in these conferences
- Running case illustration
27NetClus Database System Cluster
Surajit Chaudhuri 0.00678065 Michael Stonebraker
0.00616469 Michael J. Carey 0.00545769 C. Mohan
0.00528346 David J. DeWitt 0.00491615 Hector
Garcia-Molina 0.00453497 H. V. Jagadish
0.00434289 David B. Lomet 0.00397865 Raghu
Ramakrishnan 0.0039278 Philip A. Bernstein
0.00376314 Joseph M. Hellerstein
0.00372064 Jeffrey F. Naughton 0.00363698 Yannis
E. Ioannidis 0.00359853 Jennifer Widom
0.00351929 Per-?ke Larson 0.00334911 Rakesh
Agrawal 0.00328274 Dan Suciu 0.00309047 Michael
J. Franklin 0.00304099 Umeshwar Dayal
0.00290143 Abraham Silberschatz 0.00278185
database 0.0995511 databases 0.0708818 system
0.0678563 data 0.0214893 query 0.0133316 systems
0.0110413 queries 0.0090603 management
0.00850744 object 0.00837766 relational
0.0081175 processing 0.00745875 based
0.00736599 distributed 0.0068367 xml
0.00664958 oriented 0.00589557 design
0.00527672 web 0.00509167 information
0.0050518 model 0.00499396 efficient 0.00465707
VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE
0.188746 PODS 0.107943 EDBT 0.0436849
Ranking authors in XML
27
28Clustering and Ranking Performance along
Iterations
29Parameter Study Parm Setting
- Prior parameter is relatively stable when it
gt0.4, the bigger the better - Smoothing parameter is relatively stable, the
smaller the better (except at no smoothing)
30Accuracy Study Experiments
- Accuracy study, compare with
- PLSA, which is pure text model, no other types of
objects and links are used, use the same prior as
in NetClus - RankClus, which is a bi-typed clustering method
on only one type
31NetClus Distinguishing Conferences
- AAAI 0.0022667 0.00899168 0.934024 0.0300042
0.0247133 - CIKM 0.150053 0.310172 0.00723807 0.444524
0.0880127 - CVPR 0.000163812 0.00763072 0.931496 0.0281342
0.032575 - ECIR 3.47023e-05 0.00712695 0.00657402 0.978391
0.00787288 - ECML 0.00077477 0.110922 0.814362 0.0579426
0.015999 - EDBT 0.573362 0.316033 0.00101442 0.0245591
0.0850319 - ICDE 0.529522 0.376542 0.00239152 0.0151113
0.0764334 - ICDM 0.000455028 0.778452 0.0566457 0.113184
0.0512633 - ICML 0.000309624 0.050078 0.878757 0.0622335
0.00862134 - IJCAI 0.00329816 0.0046758 0.94288 0.0303745
0.0187718 - KDD 0.00574223 0.797633 0.0617351 0.067681
0.0672086 - PAKDD 0.00111246 0.813473 0.0403105 0.0574755
0.0876289 - PKDD 5.39434e-05 0.760374 0.119608 0.052926
0.0670379 - PODS 0.78935 0.113751 0.013939 0.00277417
0.0801858 - SDM 0.000172953 0.841087 0.058316 0.0527081
0.0477156 - SIGIR 0.00600399 0.00280013 0.00275237 0.977783
0.0106604 - SIGMOD Conference 0.689348 0.223122 0.0017703
0.00825455 0.0775055 - VLDB 0.701899 0.207428 0.00100012 0.0116966
0.0779764 - WSDM 0.00751654 0.269259 0.0260291 0.683646
0.0135497
31
32Case Study DBLP
- all-area data set
- K8
- A xml net-cluster derived from database
net-cluster
33NetClus KDD Field
mining 0.0790963 data 0.0509959 association
0.0424484 frequent 0.0413659 rule
0.0223015 pattern 0.0221282 based
0.012448 clustering 0.00915418 efficient
0.00870164 databases 0.00654573 rules
0.00638362 web 0.00618587 approach
0.00558388 patterns 0.00546508 time
0.00532743 discovery 0.00520791 queries
0.00512735 large 0.00505302 algorithm
0.00495221 classification 0.00477521
Philip S. Yu 0.00984668 Jiawei Han
0.0080883 Charu C. Aggarwal 0.00688184 Christos
Faloutsos 0.00534601 Wei Wang 0.0039633 Hans-Peter
Kriegel 0.0036941 Rakesh Agrawal 0.00352178 Jian
Pei 0.00352033 Nick Koudas 0.00326135 Heikki
Mannila 0.00302283 Eamonn J. Keogh
0.00285453 Haixun Wang 0.00277766 Divesh
Srivastava 0.00275084 Beng Chin Ooi
0.00270741 Ming-Syan Chen 0.00252245 Johannes
Gehrke 0.00248227 Mohammed Javeed Zaki
0.0024233 Ke Wang 0.00237186 Yufei Tao
0.00234508 H. V. Jagadish 0.0023317
ICDE 0.193106 KDD 0.177786 SIGMOD Conf.
0.116497 VLDB 0.112015 ICDM 0.0968135
33
34NetClus ML
learning 0.0785149 recognition 0.0616076 pattern
0.0569329 machine 0.0210515 based
0.012122 knowledge 0.0062703 model
0.00563725 system 0.00538452 approach
0.00534144 reasoning 0.00518959 models
0.00482448 data 0.00428022 analysis
0.00427453 planning 0.00416088 search
0.00414499 systems 0.00407711 logic
0.00371819 multi 0.00349816 algorithm
0.0034679 classification 0.00321972
Richard E. Korf 0.00299098 Craig Boutilier
0.00246557 Tuomas Sandholm 0.00244961 Judea Pearl
0.00242606 Hector J. Levesque 0.00234726 Yoav
Shoham 0.00230554 Kenneth D. Forbus
0.00211045 Rina Dechter 0.00208683 Stuart J.
Russell 0.00188014 Johan de Kleer 0.00187524 Toby
Walsh 0.00186112 Benjamin Kuipers
0.00185742 Subbarao Kambhampati 0.00175271 Peter
Stone 0.00170711 Kurt Konolige 0.00170513 James
P. Delgrande 0.00167945 Joseph Y. Halpern
0.00164386 Jeffrey S. Rosenschein
0.00161199 Brian C. Williams 0.00157864 Daniel S.
Weld 0.00156658
IJCAI 0.427665 AAAI 0.403056 ICML 0.0899892 ECML
0.0245488 CVPR 0.0229665
34
35NetClus IR
retrieval 0.0833119 information 0.0777979 text
0.0689247 search 0.0306999 web 0.0145188 based
0.0143753 document 0.00950089 query
0.00783011 system 0.0064804 classification
0.00618953 model 0.00614568 language
0.00540877 data 0.00517338 learning
0.0050341 analysis 0.00480311 approach
0.00462792 models 0.0046184 clustering
0.00460905 documents 0.00453735 user 0.00449431
W. Bruce Croft 0.0141826 James Allan
0.00630046 Norbert Fuhr 0.00547785 ChengXiang
Zhai 0.00493936 James P. Callan 0.00481386 C. J.
van Rijsbergen 0.00471779 Ellen M. Voorhees
0.00467488 Gerard Salton 0.00462283 Mark
Sanderson 0.00437391 K. L. Kwok 0.00427169 Chris
Buckley 0.00404819 Abraham Bookstein
0.00383661 Justin Zobel 0.00374904 Tetsuya Sakai
0.0035631 Yiming Yang 0.0034947 Donna Harman
0.00335238 Clement T. Yu 0.00330327 Alistair
Moffat 0.003292 Ian Soboroff 0.00324063 Nicholas
J. Belkin 0.00313201
SIGIR 0.638595 CIKM 0.14482 ECIR 0.0726454 WWW
0.0366223 KDD 0.015487
35
36Outline
- Background and Motivation
- Preliminaries
- NetClus Algorithm
- Experiments
- Conclusions and Future Work
37Conclusions and Future Work
- A new kind of cluster, net-cluster, is proposed
for heterogeneous information networks comprised
of multiple types of objects. - An effective and efficient algorithm, NetClus, is
proposed, that detects net-clusters in a star
network with arbitrary number of types, is
proposed. - Experiments on real dataset shows our algorithm
can give quite reasonable clustering and ranking
results. The clustering accuracy is much higher
than the baseline methods. - See our iNextCube system demo at VLDB09
- http//inextcube.cs.uiuc.edu/DBLP
- http//inextcube.cs.uiuc.edu/netclus
- Future work
- How to automatically set the cluster number K?
- How to select a good ranking function in a
complex network?
38Q A