A General Probabilistic Framework for Clustering Individuals - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

A General Probabilistic Framework for Clustering Individuals

Description:

Igor Cadez, Scott Gafney and Padhraic Smyth. University of California Irvine. 11/7/09 ... Scope of standard vector-based clustering. Probabilistic framework of ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 16
Provided by: suju
Category:

less

Transcript and Presenter's Notes

Title: A General Probabilistic Framework for Clustering Individuals


1
A General Probabilistic Framework for Clustering
Individuals
  • Technical report No.00-09
  • Igor Cadez, Scott Gafney and Padhraic Smyth
  • University of California Irvine

2
Overview
  • Scope of standard vector-based clustering
  • Probabilistic framework of generative mixture
    models
  • Generative Mixture-Based Cluster Model
  • Web Browsing example
  • EM-based Clustering Algorithm for Clustering
    Individuals
  • Web Browsing example
  • Experimental results for the Web Browsing example

3
Standard Vector Based Clustering
  • Not directly applicable to data that is a
    non-vector form (sequences, time-series)
  • Different individuals can have different
    amounts of observed data ( e.g. Web browsing
    behavior, gene expression data)
  • How they got around it?
  • Reduce observed data to vectors of fixed
    dimensionality
  • Apply standard clustering techniques
  • But, loss of information for sequential/temporal
    data

4
Probabilistic Framework of Generative Mixture
Models
  • Models non-vector data in its native form
  • Handles multiplicities of data sizes and data
    types across individuals
  • Expectation-Maximization (EM) algorithm quite
    similar to the standard EM algorithm-but
    different individuals have different effects on
    the estimation depending on the number of
    observations.

5
Generative Mixture-Based Cluster Model
  • Draw an individual i from the overall population.
  • The individual is assigned to one of K clusters
    , with probability p(cik),
    where ci indicates the
    cluster membership.
  • Each cluster k, , has a data
    generating model
  • where Qk are the
    parameters of pk
  • Di now generated for an individual by
  • once cluster membership cik is known and
    given Qk.

6
Web browsing example
  • Each individual has a set Dis1,s2,,sniwhere
    each s is a sequence that represent the observed
    record of page requests for individual i and the
    different sequences represent the different
    sessions.

7
Web Browsing example (continued)
  • The population of Web users is divided into K
    clusters and a user i has a probability p(cik)
    of belonging to cluster k.
  • Each cluster is modeled by a finite-state Markov
    model with parameters Fk. (Prob. of a sequence s
    for cluster k is
  • Number of session ni for each i follows a
    geometric model with parameter lk and
    distribution
  • Overall parameters consist of

8
Web Browsing example(continued)
  • The probability of Di from each i on assuming
    that i of a member of cluster k is


  • The probability of Di for i when ci is unknown
  • The probability that i belongs to k (Bayes rule)

9
EM-Based Clustering Algorithm for Clustering
Individuals.
  • Consider N individuals each having a data set Di
    . Let each Di consist of ni observations dij.
    Each dij represents another smaller data subset.
  • According to the generative cluster model each i
    has a pdf where
    QQ1,Q2 ,,Qk
  • and each ci is the cluster identity of the i
    th individual
  • Assuming that the observations are conditionally
    independent, the prob. that i belongs to ci is


10
EM Based Clustering Algorithm (continued)
  • To learn the ML or MAP estimates given D under
    the assumption of data from different individuals
    being conditionally independent we get
  • where QMLarg maxQp(DQ) and
  • QMAParg maxQp(DQ)p(Q).

11
Web Browsing example
  • Each individual i has ni sequences and we
    consider a generative Markov model, having states
    from 1 to M.
  • The model parameters fro each Markov cluster Qk
    has an initial state probability pk(s) and a MXM
    transition matrix Tk(s2s1), where s,s1,s2 denote
    discrete states.
  • Let Disi,1,,s i,j,,si,ni be the data for the
    i th individual where the subscript j denotes the
    j th sequence for i.
  • The likelihood of si,j conditioned on any ci with
    parameters Qci is

12
Web Browsing example (continued)
  • The probability of all of the data Di from I,
    conditioned on cluster ci is
  • The marginal probability given the model
    parameter can be written as

  • where ak defines the proportion of the
    component models.

13
Web Browsing example (continued)
  • Given the definition of the likelihood function,
    the EM procedure becomes
  • E step
  • M step

14
Results of clustering
  •  
  • Each square represents joint probability
    p(si,sj).
  • Clustering Genes using Gene Expression Data
  • -Real valued sequences as a
    function of time
  • Clustering Patients based on Red Blood Cell
    Cytograms
  • -Vector data variable number of data
    points per person

15
Conclusion
  • If one individual has more observations than the
    other then that data will have more weight in the
    parameter estimation.
  • Models heterogeneous data across different
    individuals in a general framework
  • All available data on an individual can be used
    without having to develop specialized algorithms
Write a Comment
User Comments (0)
About PowerShow.com