Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema

Description:

Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema ... E.g., a paper, a movie, a tagging event ... – PowerPoint PPT presentation

Number of Views:426
Avg rating:3.0/5.0
Slides: 39
Provided by: Sha189
Category:

less

Transcript and Presenter's Notes

Title: Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema


1
Ranking-Based Clustering of Heterogeneous
Information Networks with Star Network Schema
  • Yizhou Sun, Yintao Yu and Jiawei Han
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign
  • 9/1/2009

2
Outline
  • Background and Motivation
  • Preliminaries
  • NetClus Algorithm
  • Experiments
  • Conclusions and Future Work

3
Homogenous vs. Heterogeneous Networks
  • Information Networks are ubiquitous
  • Homogenous network
  • Collaboration network, friendship network,
    citation network, and so on
  • Usually converted from a heterogeneous network
  • Heterogeneous network
  • Bibliographic network, movie network, tagging
    network, and so on
  • Represent the real relations

4
Why Clustering on Heterogeneous Information
Networks?
  • Why clustering on heterogeneous networks?
  • Understand the hidden structure
  • Understand individual roles of each object plays
    in the network
  • Existing work
  • Clustering on homogeneous network
  • SimRank Clustering methods on homogeneous
    network
  • Time consuming
  • Meaning of similarity becomes controversial
  • RankClus EDBT09
  • Clustering on one type of objects
  • Experiments are on two-typed heterogeneous
    networks

5
Better and More Efficient Clustering
  • Motivation 1 Generate clusters that are
  • More meaningful
  • Propose a new definition of cluster called
    net-cluster, following the schema of original
    network, comprised of different types of objects
  • More understandable
  • Provide ranking information for each type of
    objects in each cluster
  • Motivation 2 Provide an efficient algorithm
  • NetClus linear to the links in the network

6
SubNetwork-Clusters An Illustration
  • Database net-cluster of bibliographic network

Conference Type
Author Type
Term Type
7
NetClus Methodology An Illustration
  • Split a network into different layers, each
    representing by a net-cluster

8
Outline
  • Background and Motivation
  • Preliminaries
  • Star Network Schema
  • Ranking Functions
  • Net-Cluster Definition
  • NetClus Algorithm
  • Experiments
  • Conclusions and Future Work

9
Star Network Schema
  • Addresses a specific type of heterogeneous
    networks
  • Special type Star Network Schema
  • Center type target type
  • E.g., a paper, a movie, a tagging event
  • A center object is a co-occurrence of a bag of
    different types of objects, which stands for a
    multi-relation among different types of objects
  • Surrounding type attribute type

10

11

12

13
Ranking Functions
  • Ranking objects in a network, denoted as p(xTx,
    G)
  • Give a score to each object, according to its
    importance
  • Different rules defined different ranking
    functions
  • Simple Ranking
  • Ranking score is assigned according to the degree
    of an object
  • Authority Ranking
  • Ranking score is assigned according to the mutual
    enhancement by propagations of score through
    links
  • E.g., according to rules of (1) highly ranked
    conferences accept many good papers published by
    many highly ranked authors and (2) highly ranked
    authors publish many good papers in highly ranked
    conferences

14
Ranking Function (Cont.)
  • Ranking distribution
  • Normalize ranking scores to 1, given them a
    probabilistic meaning
  • Similar to the idea of PageRank
  • Priors can be added
  • PP(XTx,Gk) (1-?P)P(XTx,Gk) ?PP0(XTx,Gk)
  • P0(XTx,Gk) is the prior knowledge, usually given
    as a distribution, denoted by only several words
  • ?P is the parameter that we believe in prior
    distribution

15
Net-Cluster
  • Given an information network G, a net-cluster C
    contains two sorts of information
  • Topology Node set and link set as a sub-network
    of G
  • Stats info Membership indicator for each node x
    P(xC)
  • Given a information network G, cluster number K,
    a clustering for G, , is
    defined
  • is a net-cluster of G

16
Outline
  • Background and Motivation
  • Preliminaries
  • NetClus Algorithm
  • Framework of NetClus
  • Net-Cluster Generative Model
  • Posterior Probability Estimation (PPE)
  • Impact of Ranking Functions
  • Experiments
  • Conclusions and Future Work

17
Framework of NetClus
  • General idea Map each target object into a new
    low-dimensional feature space according to
    current net-clustering, and adjust the clustering
    further in the new measure space.
  • Step 0 generate initial random clusters
  • Step 1 generate ranking-based generative model
    for target objects for each net-cluster
  • Step 2 calculate posterior probs for target
    objects, which serves as the new measure, and
    assign target objects to the nearest cluster
    accordingly
  • Step 3 repeat 1 and 2, until clusters do not
    change significantly
  • Step 4 calculate posterior probs for attribute
    objects in each net-cluster

18
Generative Model for Target Objects Given a
Net-cluster
  • Recall that, each target object stands for an
    co-occurrence of a bag of attribute objects
  • Define the probability of a target object ltgt
    define the probability of the co-occurrence of
    all the associated attribute objects
  • Generative probability for target
    object d in cluster
  • where is ranking distribution,
    is type probability
  • Two assumptions of independence
  • The probabilities to visit objects of different
    types are independent to each other
  • The probabilities to visit two objects within the
    same type are independent to each other

19
PPE Smoothing and Background Generative Model
  • Smoothing on ranking distributions for each type
    of objects in each net-cluster
  • Smoothing each conditional ranking distribution
    with global ranking distribution
  • PS(XTx,Gk) (1-?S)P(XTx,Gk) ?SP(XTx,G)
  • Goal avoid zero probabilities for some object
  • Background generative model (BG)
  • The probability to generate target object d in
    the original network P(dG)
  • Target objects that are not high related to any
    specific cluster should have high probs in back
    ground model

20
PPE Posterior Probability Estimation for Target
Objects
  • Now we have K net-clusters, corresponding to K
    generative models, and a background model
  • Given p(dG1), p(dG2), , p(dGK), p(dG),
    whats the posterior probabilities of p(kd),
    (k1,2,,K,K1)?
  • Estimation solution
  • Maximize the log-likelihood for the whole
    collection
  • Using EM algorithm to estimate the best p(zk)
  • Hidden variable
  • Iterative formula

21
PPE Posterior Probability Estimation for
Attribute Objects
  • Posterior probs for attribute objects are only
    needed when the sub-networks of the
    net-clustering are stable (Step 4)
  • Aim calculate the membership for each attribute
    object
  • Solution using the target object information for
    each attribute object, average of target objects
    membership!
  • E.g., a conference membership indicator is the
    percentage of papers in each cluster

22
Cluster Adjustment
  • Using posterior probabilities of target objects
    as new feature space
  • Each target object gt K-dimension vector
  • Each net-cluster center gt K-dimension vector
  • Average on the objects in the cluster
  • Assign each target object into nearest cluster
    center (e.g., cosine similarity)
  • A sub-network corresponding to a new net-cluster
    is then built
  • by extracting all the target objects in that
    cluster and all the linked attribute objects

23
Time Complexity Analysis
  • Global ranking for attribute objects
  • O(t1E)
  • During each iteration
  • Conditional ranking O(t1E)
  • Conditional probs for target objects O(E)
  • Posterior for target objects O(t2(K1)N)
  • Cluster adjustment O(K2N)
  • Posterior for attribute objects
  • O(KE)
  • In all
  • O(E) for fixed K

24
Impact of Ranking Functions
  • Which ranking function is better?
  • For a simple 3-type star network on object set
    , where Z is the center type
  • The estimated joint probability for
    simple ranking has the error of I(X,Y)
  • The estimated joint probability for
    authority ranking has the best estimation for the
    propagation matrix between X and Y under
    Frobenius norm.

25
Outline
  • Background and Motivation
  • Preliminaries
  • NetClus Algorithm
  • Experiments
  • Conclusions and Future Work

26
Experiments
  • Data Set
  • DBLP all-area data set
  • All conferences
  • Top 50K authors
  • DBLP four-area data set
  • 20 conferences from DB, DM, ML, IR
  • All authors from these conferences
  • All papers published in these conferences
  • Running case illustration

27
NetClus Database System Cluster
Surajit Chaudhuri 0.00678065 Michael Stonebraker
0.00616469 Michael J. Carey 0.00545769 C. Mohan
0.00528346 David J. DeWitt 0.00491615 Hector
Garcia-Molina 0.00453497 H. V. Jagadish
0.00434289 David B. Lomet 0.00397865 Raghu
Ramakrishnan 0.0039278 Philip A. Bernstein
0.00376314 Joseph M. Hellerstein
0.00372064 Jeffrey F. Naughton 0.00363698 Yannis
E. Ioannidis 0.00359853 Jennifer Widom
0.00351929 Per-?ke Larson 0.00334911 Rakesh
Agrawal 0.00328274 Dan Suciu 0.00309047 Michael
J. Franklin 0.00304099 Umeshwar Dayal
0.00290143 Abraham Silberschatz 0.00278185
database 0.0995511 databases 0.0708818 system
0.0678563 data 0.0214893 query 0.0133316 systems
0.0110413 queries 0.0090603 management
0.00850744 object 0.00837766 relational
0.0081175 processing 0.00745875 based
0.00736599 distributed 0.0068367 xml
0.00664958 oriented 0.00589557 design
0.00527672 web 0.00509167 information
0.0050518 model 0.00499396 efficient 0.00465707
VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE
0.188746 PODS 0.107943 EDBT 0.0436849
Ranking authors in XML
27
28
Clustering and Ranking Performance along
Iterations
29
Parameter Study Parm Setting
  • Prior parameter is relatively stable when it
    gt0.4, the bigger the better
  • Smoothing parameter is relatively stable, the
    smaller the better (except at no smoothing)

30
Accuracy Study Experiments
  • Accuracy study, compare with
  • PLSA, which is pure text model, no other types of
    objects and links are used, use the same prior as
    in NetClus
  • RankClus, which is a bi-typed clustering method
    on only one type

31
NetClus Distinguishing Conferences
  • AAAI 0.0022667 0.00899168 0.934024 0.0300042
    0.0247133
  • CIKM 0.150053 0.310172 0.00723807 0.444524
    0.0880127
  • CVPR 0.000163812 0.00763072 0.931496 0.0281342
    0.032575
  • ECIR 3.47023e-05 0.00712695 0.00657402 0.978391
    0.00787288
  • ECML 0.00077477 0.110922 0.814362 0.0579426
    0.015999
  • EDBT 0.573362 0.316033 0.00101442 0.0245591
    0.0850319
  • ICDE 0.529522 0.376542 0.00239152 0.0151113
    0.0764334
  • ICDM 0.000455028 0.778452 0.0566457 0.113184
    0.0512633
  • ICML 0.000309624 0.050078 0.878757 0.0622335
    0.00862134
  • IJCAI 0.00329816 0.0046758 0.94288 0.0303745
    0.0187718
  • KDD 0.00574223 0.797633 0.0617351 0.067681
    0.0672086
  • PAKDD 0.00111246 0.813473 0.0403105 0.0574755
    0.0876289
  • PKDD 5.39434e-05 0.760374 0.119608 0.052926
    0.0670379
  • PODS 0.78935 0.113751 0.013939 0.00277417
    0.0801858
  • SDM 0.000172953 0.841087 0.058316 0.0527081
    0.0477156
  • SIGIR 0.00600399 0.00280013 0.00275237 0.977783
    0.0106604
  • SIGMOD Conference 0.689348 0.223122 0.0017703
    0.00825455 0.0775055
  • VLDB 0.701899 0.207428 0.00100012 0.0116966
    0.0779764
  • WSDM 0.00751654 0.269259 0.0260291 0.683646
    0.0135497

31
32
Case Study DBLP
  • all-area data set
  • K8
  • A xml net-cluster derived from database
    net-cluster

33
NetClus KDD Field
mining 0.0790963 data 0.0509959 association
0.0424484 frequent 0.0413659 rule
0.0223015 pattern 0.0221282 based
0.012448 clustering 0.00915418 efficient
0.00870164 databases 0.00654573 rules
0.00638362 web 0.00618587 approach
0.00558388 patterns 0.00546508 time
0.00532743 discovery 0.00520791 queries
0.00512735 large 0.00505302 algorithm
0.00495221 classification 0.00477521
Philip S. Yu 0.00984668 Jiawei Han
0.0080883 Charu C. Aggarwal 0.00688184 Christos
Faloutsos 0.00534601 Wei Wang 0.0039633 Hans-Peter
Kriegel 0.0036941 Rakesh Agrawal 0.00352178 Jian
Pei 0.00352033 Nick Koudas 0.00326135 Heikki
Mannila 0.00302283 Eamonn J. Keogh
0.00285453 Haixun Wang 0.00277766 Divesh
Srivastava 0.00275084 Beng Chin Ooi
0.00270741 Ming-Syan Chen 0.00252245 Johannes
Gehrke 0.00248227 Mohammed Javeed Zaki
0.0024233 Ke Wang 0.00237186 Yufei Tao
0.00234508 H. V. Jagadish 0.0023317
ICDE 0.193106 KDD 0.177786 SIGMOD Conf.
0.116497 VLDB 0.112015 ICDM 0.0968135
33
34
NetClus ML
learning 0.0785149 recognition 0.0616076 pattern
0.0569329 machine 0.0210515 based
0.012122 knowledge 0.0062703 model
0.00563725 system 0.00538452 approach
0.00534144 reasoning 0.00518959 models
0.00482448 data 0.00428022 analysis
0.00427453 planning 0.00416088 search
0.00414499 systems 0.00407711 logic
0.00371819 multi 0.00349816 algorithm
0.0034679 classification 0.00321972
Richard E. Korf 0.00299098 Craig Boutilier
0.00246557 Tuomas Sandholm 0.00244961 Judea Pearl
0.00242606 Hector J. Levesque 0.00234726 Yoav
Shoham 0.00230554 Kenneth D. Forbus
0.00211045 Rina Dechter 0.00208683 Stuart J.
Russell 0.00188014 Johan de Kleer 0.00187524 Toby
Walsh 0.00186112 Benjamin Kuipers
0.00185742 Subbarao Kambhampati 0.00175271 Peter
Stone 0.00170711 Kurt Konolige 0.00170513 James
P. Delgrande 0.00167945 Joseph Y. Halpern
0.00164386 Jeffrey S. Rosenschein
0.00161199 Brian C. Williams 0.00157864 Daniel S.
Weld 0.00156658
IJCAI 0.427665 AAAI 0.403056 ICML 0.0899892 ECML
0.0245488 CVPR 0.0229665
34
35
NetClus IR
retrieval 0.0833119 information 0.0777979 text
0.0689247 search 0.0306999 web 0.0145188 based
0.0143753 document 0.00950089 query
0.00783011 system 0.0064804 classification
0.00618953 model 0.00614568 language
0.00540877 data 0.00517338 learning
0.0050341 analysis 0.00480311 approach
0.00462792 models 0.0046184 clustering
0.00460905 documents 0.00453735 user 0.00449431
W. Bruce Croft 0.0141826 James Allan
0.00630046 Norbert Fuhr 0.00547785 ChengXiang
Zhai 0.00493936 James P. Callan 0.00481386 C. J.
van Rijsbergen 0.00471779 Ellen M. Voorhees
0.00467488 Gerard Salton 0.00462283 Mark
Sanderson 0.00437391 K. L. Kwok 0.00427169 Chris
Buckley 0.00404819 Abraham Bookstein
0.00383661 Justin Zobel 0.00374904 Tetsuya Sakai
0.0035631 Yiming Yang 0.0034947 Donna Harman
0.00335238 Clement T. Yu 0.00330327 Alistair
Moffat 0.003292 Ian Soboroff 0.00324063 Nicholas
J. Belkin 0.00313201
SIGIR 0.638595 CIKM 0.14482 ECIR 0.0726454 WWW
0.0366223 KDD 0.015487
35
36
Outline
  • Background and Motivation
  • Preliminaries
  • NetClus Algorithm
  • Experiments
  • Conclusions and Future Work

37
Conclusions and Future Work
  • A new kind of cluster, net-cluster, is proposed
    for heterogeneous information networks comprised
    of multiple types of objects.
  • An effective and efficient algorithm, NetClus, is
    proposed, that detects net-clusters in a star
    network with arbitrary number of types, is
    proposed.
  • Experiments on real dataset shows our algorithm
    can give quite reasonable clustering and ranking
    results. The clustering accuracy is much higher
    than the baseline methods.
  • See our iNextCube system demo at VLDB09
  • http//inextcube.cs.uiuc.edu/DBLP
  • http//inextcube.cs.uiuc.edu/netclus
  • Future work
  • How to automatically set the cluster number K?
  • How to select a good ranking function in a
    complex network?

38
  • END.

Q A
Write a Comment
User Comments (0)
About PowerShow.com