Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema

Description:

Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema ... E.g., a paper, a movie, a tagging event ... – PowerPoint PPT presentation

Number of Views:426

Avg rating:3.0/5.0

Slides: 39

Provided by: Sha189

Category:

more less

Transcript and Presenter's Notes

Title: Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema

1
Ranking-Based Clustering of Heterogeneous
Information Networks with Star Network Schema

Yizhou Sun, Yintao Yu and Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
9/1/2009

2
Outline

Background and Motivation
Preliminaries
NetClus Algorithm
Experiments
Conclusions and Future Work

3
Homogenous vs. Heterogeneous Networks

Information Networks are ubiquitous
Homogenous network
Collaboration network, friendship network,
citation network, and so on
Usually converted from a heterogeneous network
Heterogeneous network
Bibliographic network, movie network, tagging
network, and so on
Represent the real relations

4
Why Clustering on Heterogeneous Information
Networks?

Why clustering on heterogeneous networks?
Understand the hidden structure
Understand individual roles of each object plays
in the network
Existing work
Clustering on homogeneous network
SimRank Clustering methods on homogeneous
network
Time consuming
Meaning of similarity becomes controversial
RankClus EDBT09
Clustering on one type of objects
Experiments are on two-typed heterogeneous
networks

5
Better and More Efficient Clustering

Motivation 1 Generate clusters that are
More meaningful
Propose a new definition of cluster called
net-cluster, following the schema of original
network, comprised of different types of objects
More understandable
Provide ranking information for each type of
objects in each cluster
Motivation 2 Provide an efficient algorithm
NetClus linear to the links in the network

6
SubNetwork-Clusters An Illustration

Database net-cluster of bibliographic network

Conference Type
Author Type
Term Type
7
NetClus Methodology An Illustration

Split a network into different layers, each
representing by a net-cluster

8
Outline

Background and Motivation
Preliminaries
Star Network Schema
Ranking Functions
Net-Cluster Definition
NetClus Algorithm
Experiments
Conclusions and Future Work

9
Star Network Schema

Addresses a specific type of heterogeneous
networks
Special type Star Network Schema
Center type target type
E.g., a paper, a movie, a tagging event
A center object is a co-occurrence of a bag of
different types of objects, which stands for a
multi-relation among different types of objects
Surrounding type attribute type

13
Ranking Functions

Ranking objects in a network, denoted as p(xTx,
G)
Give a score to each object, according to its
importance
Different rules defined different ranking
functions
Simple Ranking
Ranking score is assigned according to the degree
of an object
Authority Ranking
Ranking score is assigned according to the mutual
enhancement by propagations of score through
links
E.g., according to rules of (1) highly ranked
conferences accept many good papers published by
many highly ranked authors and (2) highly ranked
authors publish many good papers in highly ranked
conferences

14
Ranking Function (Cont.)

Ranking distribution
Normalize ranking scores to 1, given them a
probabilistic meaning
Similar to the idea of PageRank
Priors can be added
PP(XTx,Gk) (1-?P)P(XTx,Gk) ?PP0(XTx,Gk)
P0(XTx,Gk) is the prior knowledge, usually given
as a distribution, denoted by only several words
?P is the parameter that we believe in prior
distribution

15
Net-Cluster

Given an information network G, a net-cluster C
contains two sorts of information
Topology Node set and link set as a sub-network
of G
Stats info Membership indicator for each node x
P(xC)
Given a information network G, cluster number K,
a clustering for G, , is
defined
is a net-cluster of G

16
Outline

Background and Motivation
Preliminaries
NetClus Algorithm
Framework of NetClus
Net-Cluster Generative Model
Posterior Probability Estimation (PPE)
Impact of Ranking Functions
Experiments
Conclusions and Future Work

17
Framework of NetClus

General idea Map each target object into a new
low-dimensional feature space according to
current net-clustering, and adjust the clustering
further in the new measure space.
Step 0 generate initial random clusters
Step 1 generate ranking-based generative model
for target objects for each net-cluster
Step 2 calculate posterior probs for target
objects, which serves as the new measure, and
assign target objects to the nearest cluster
accordingly
Step 3 repeat 1 and 2, until clusters do not
change significantly
Step 4 calculate posterior probs for attribute
objects in each net-cluster

18
Generative Model for Target Objects Given a
Net-cluster

Recall that, each target object stands for an
co-occurrence of a bag of attribute objects
Define the probability of a target object ltgt
define the probability of the co-occurrence of
all the associated attribute objects
Generative probability for target
object d in cluster
where is ranking distribution,
is type probability
Two assumptions of independence
The probabilities to visit objects of different
types are independent to each other
The probabilities to visit two objects within the
same type are independent to each other

19
PPE Smoothing and Background Generative Model

Smoothing on ranking distributions for each type
of objects in each net-cluster
Smoothing each conditional ranking distribution
with global ranking distribution
PS(XTx,Gk) (1-?S)P(XTx,Gk) ?SP(XTx,G)
Goal avoid zero probabilities for some object
Background generative model (BG)
The probability to generate target object d in
the original network P(dG)
Target objects that are not high related to any
specific cluster should have high probs in back
ground model

20
PPE Posterior Probability Estimation for Target
Objects

Now we have K net-clusters, corresponding to K
generative models, and a background model
Given p(dG1), p(dG2), , p(dGK), p(dG),
whats the posterior probabilities of p(kd),
(k1,2,,K,K1)?
Estimation solution
Maximize the log-likelihood for the whole
collection
Using EM algorithm to estimate the best p(zk)
Hidden variable
Iterative formula

21
PPE Posterior Probability Estimation for
Attribute Objects

Posterior probs for attribute objects are only
needed when the sub-networks of the
net-clustering are stable (Step 4)
Aim calculate the membership for each attribute
object
Solution using the target object information for
each attribute object, average of target objects
membership!
E.g., a conference membership indicator is the
percentage of papers in each cluster

22
Cluster Adjustment

Using posterior probabilities of target objects
as new feature space
Each target object gt K-dimension vector
Each net-cluster center gt K-dimension vector
Average on the objects in the cluster
Assign each target object into nearest cluster
center (e.g., cosine similarity)
A sub-network corresponding to a new net-cluster
is then built
by extracting all the target objects in that
cluster and all the linked attribute objects

23
Time Complexity Analysis

Global ranking for attribute objects
O(t1E)
During each iteration
Conditional ranking O(t1E)
Conditional probs for target objects O(E)
Posterior for target objects O(t2(K1)N)
Cluster adjustment O(K2N)
Posterior for attribute objects
O(KE)
In all
O(E) for fixed K

24
Impact of Ranking Functions

Which ranking function is better?
For a simple 3-type star network on object set
, where Z is the center type
The estimated joint probability for
simple ranking has the error of I(X,Y)
The estimated joint probability for
authority ranking has the best estimation for the
propagation matrix between X and Y under
Frobenius norm.

25
Outline

Background and Motivation
Preliminaries
NetClus Algorithm
Experiments
Conclusions and Future Work

26
Experiments

Data Set
DBLP all-area data set
All conferences
Top 50K authors
DBLP four-area data set
20 conferences from DB, DM, ML, IR
All authors from these conferences
All papers published in these conferences
Running case illustration

27
NetClus Database System Cluster
Surajit Chaudhuri 0.00678065 Michael Stonebraker
0.00616469 Michael J. Carey 0.00545769 C. Mohan
0.00528346 David J. DeWitt 0.00491615 Hector
Garcia-Molina 0.00453497 H. V. Jagadish
0.00434289 David B. Lomet 0.00397865 Raghu
Ramakrishnan 0.0039278 Philip A. Bernstein
0.00376314 Joseph M. Hellerstein
0.00372064 Jeffrey F. Naughton 0.00363698 Yannis
E. Ioannidis 0.00359853 Jennifer Widom
0.00351929 Per-?ke Larson 0.00334911 Rakesh
Agrawal 0.00328274 Dan Suciu 0.00309047 Michael
J. Franklin 0.00304099 Umeshwar Dayal
0.00290143 Abraham Silberschatz 0.00278185
database 0.0995511 databases 0.0708818 system
0.0678563 data 0.0214893 query 0.0133316 systems
0.0110413 queries 0.0090603 management
0.00850744 object 0.00837766 relational
0.0081175 processing 0.00745875 based
0.00736599 distributed 0.0068367 xml
0.00664958 oriented 0.00589557 design
0.00527672 web 0.00509167 information
0.0050518 model 0.00499396 efficient 0.00465707
VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE
0.188746 PODS 0.107943 EDBT 0.0436849
Ranking authors in XML
27
28
Clustering and Ranking Performance along
Iterations
29
Parameter Study Parm Setting

Prior parameter is relatively stable when it
gt0.4, the bigger the better
Smoothing parameter is relatively stable, the
smaller the better (except at no smoothing)

30
Accuracy Study Experiments

Accuracy study, compare with
PLSA, which is pure text model, no other types of
objects and links are used, use the same prior as
in NetClus
RankClus, which is a bi-typed clustering method
on only one type

31
NetClus Distinguishing Conferences

AAAI 0.0022667 0.00899168 0.934024 0.0300042
0.0247133
CIKM 0.150053 0.310172 0.00723807 0.444524
0.0880127
CVPR 0.000163812 0.00763072 0.931496 0.0281342
0.032575
ECIR 3.47023e-05 0.00712695 0.00657402 0.978391
0.00787288
ECML 0.00077477 0.110922 0.814362 0.0579426
0.015999
EDBT 0.573362 0.316033 0.00101442 0.0245591
0.0850319
ICDE 0.529522 0.376542 0.00239152 0.0151113
0.0764334
ICDM 0.000455028 0.778452 0.0566457 0.113184
0.0512633
ICML 0.000309624 0.050078 0.878757 0.0622335
0.00862134
IJCAI 0.00329816 0.0046758 0.94288 0.0303745
0.0187718
KDD 0.00574223 0.797633 0.0617351 0.067681
0.0672086
PAKDD 0.00111246 0.813473 0.0403105 0.0574755
0.0876289
PKDD 5.39434e-05 0.760374 0.119608 0.052926
0.0670379
PODS 0.78935 0.113751 0.013939 0.00277417
0.0801858
SDM 0.000172953 0.841087 0.058316 0.0527081
0.0477156
SIGIR 0.00600399 0.00280013 0.00275237 0.977783
0.0106604
SIGMOD Conference 0.689348 0.223122 0.0017703
0.00825455 0.0775055
VLDB 0.701899 0.207428 0.00100012 0.0116966
0.0779764
WSDM 0.00751654 0.269259 0.0260291 0.683646
0.0135497

31
32
Case Study DBLP

all-area data set
K8
A xml net-cluster derived from database
net-cluster

33
NetClus KDD Field
mining 0.0790963 data 0.0509959 association
0.0424484 frequent 0.0413659 rule
0.0223015 pattern 0.0221282 based
0.012448 clustering 0.00915418 efficient
0.00870164 databases 0.00654573 rules
0.00638362 web 0.00618587 approach
0.00558388 patterns 0.00546508 time
0.00532743 discovery 0.00520791 queries
0.00512735 large 0.00505302 algorithm
0.00495221 classification 0.00477521
Philip S. Yu 0.00984668 Jiawei Han
0.0080883 Charu C. Aggarwal 0.00688184 Christos
Faloutsos 0.00534601 Wei Wang 0.0039633 Hans-Peter
Kriegel 0.0036941 Rakesh Agrawal 0.00352178 Jian
Pei 0.00352033 Nick Koudas 0.00326135 Heikki
Mannila 0.00302283 Eamonn J. Keogh
0.00285453 Haixun Wang 0.00277766 Divesh
Srivastava 0.00275084 Beng Chin Ooi
0.00270741 Ming-Syan Chen 0.00252245 Johannes
Gehrke 0.00248227 Mohammed Javeed Zaki
0.0024233 Ke Wang 0.00237186 Yufei Tao
0.00234508 H. V. Jagadish 0.0023317
ICDE 0.193106 KDD 0.177786 SIGMOD Conf.
0.116497 VLDB 0.112015 ICDM 0.0968135
33
34
NetClus ML
learning 0.0785149 recognition 0.0616076 pattern
0.0569329 machine 0.0210515 based
0.012122 knowledge 0.0062703 model
0.00563725 system 0.00538452 approach
0.00534144 reasoning 0.00518959 models
0.00482448 data 0.00428022 analysis
0.00427453 planning 0.00416088 search
0.00414499 systems 0.00407711 logic
0.00371819 multi 0.00349816 algorithm
0.0034679 classification 0.00321972
Richard E. Korf 0.00299098 Craig Boutilier
0.00246557 Tuomas Sandholm 0.00244961 Judea Pearl
0.00242606 Hector J. Levesque 0.00234726 Yoav
Shoham 0.00230554 Kenneth D. Forbus
0.00211045 Rina Dechter 0.00208683 Stuart J.
Russell 0.00188014 Johan de Kleer 0.00187524 Toby
Walsh 0.00186112 Benjamin Kuipers
0.00185742 Subbarao Kambhampati 0.00175271 Peter
Stone 0.00170711 Kurt Konolige 0.00170513 James
P. Delgrande 0.00167945 Joseph Y. Halpern
0.00164386 Jeffrey S. Rosenschein
0.00161199 Brian C. Williams 0.00157864 Daniel S.
Weld 0.00156658
IJCAI 0.427665 AAAI 0.403056 ICML 0.0899892 ECML
0.0245488 CVPR 0.0229665
34
35
NetClus IR
retrieval 0.0833119 information 0.0777979 text
0.0689247 search 0.0306999 web 0.0145188 based
0.0143753 document 0.00950089 query
0.00783011 system 0.0064804 classification
0.00618953 model 0.00614568 language
0.00540877 data 0.00517338 learning
0.0050341 analysis 0.00480311 approach
0.00462792 models 0.0046184 clustering
0.00460905 documents 0.00453735 user 0.00449431
W. Bruce Croft 0.0141826 James Allan
0.00630046 Norbert Fuhr 0.00547785 ChengXiang
Zhai 0.00493936 James P. Callan 0.00481386 C. J.
van Rijsbergen 0.00471779 Ellen M. Voorhees
0.00467488 Gerard Salton 0.00462283 Mark
Sanderson 0.00437391 K. L. Kwok 0.00427169 Chris
Buckley 0.00404819 Abraham Bookstein
0.00383661 Justin Zobel 0.00374904 Tetsuya Sakai
0.0035631 Yiming Yang 0.0034947 Donna Harman
0.00335238 Clement T. Yu 0.00330327 Alistair
Moffat 0.003292 Ian Soboroff 0.00324063 Nicholas
J. Belkin 0.00313201
SIGIR 0.638595 CIKM 0.14482 ECIR 0.0726454 WWW
0.0366223 KDD 0.015487
35
36
Outline

Background and Motivation
Preliminaries
NetClus Algorithm
Experiments
Conclusions and Future Work

37
Conclusions and Future Work

A new kind of cluster, net-cluster, is proposed
for heterogeneous information networks comprised
of multiple types of objects.
An effective and efficient algorithm, NetClus, is
proposed, that detects net-clusters in a star
network with arbitrary number of types, is
proposed.
Experiments on real dataset shows our algorithm
can give quite reasonable clustering and ranking
results. The clustering accuracy is much higher
than the baseline methods.
See our iNextCube system demo at VLDB09
http//inextcube.cs.uiuc.edu/DBLP
http//inextcube.cs.uiuc.edu/netclus
Future work
How to automatically set the cluster number K?
How to select a good ranking function in a
complex network?