A General Probabilistic Framework for Clustering Individuals - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

A General Probabilistic Framework for Clustering Individuals

Description:

Number of Views:33

Avg rating:3.0/5.0

Slides: 16

Provided by: suju

Category:

more less

Transcript and Presenter's Notes

Title: A General Probabilistic Framework for Clustering Individuals

1
A General Probabilistic Framework for Clustering
Individuals

2
Overview

3
Standard Vector Based Clustering

Not directly applicable to data that is a
non-vector form (sequences, time-series)
Different individuals can have different
amounts of observed data ( e.g. Web browsing
behavior, gene expression data)
How they got around it?
Reduce observed data to vectors of fixed
dimensionality
Apply standard clustering techniques
But, loss of information for sequential/temporal
data

4
Probabilistic Framework of Generative Mixture
Models

Models non-vector data in its native form
Handles multiplicities of data sizes and data
types across individuals
Expectation-Maximization (EM) algorithm quite
similar to the standard EM algorithm-but
different individuals have different effects on
the estimation depending on the number of
observations.

5
Generative Mixture-Based Cluster Model

Draw an individual i from the overall population.
The individual is assigned to one of K clusters
, with probability p(cik),
where ci indicates the
cluster membership.
Each cluster k, , has a data
generating model
where Qk are the
parameters of pk
Di now generated for an individual by
once cluster membership cik is known and
given Qk.

6
Web browsing example

Each individual has a set Dis1,s2,,sniwhere
each s is a sequence that represent the observed
record of page requests for individual i and the
different sequences represent the different
sessions.

7
Web Browsing example (continued)

The population of Web users is divided into K
clusters and a user i has a probability p(cik)
of belonging to cluster k.
Each cluster is modeled by a finite-state Markov
model with parameters Fk. (Prob. of a sequence s
for cluster k is
Number of session ni for each i follows a
geometric model with parameter lk and
distribution
Overall parameters consist of

8
Web Browsing example(continued)

The probability of Di from each i on assuming
that i of a member of cluster k is
The probability of Di for i when ci is unknown
The probability that i belongs to k (Bayes rule)

9
EM-Based Clustering Algorithm for Clustering
Individuals.

Consider N individuals each having a data set Di
. Let each Di consist of ni observations dij.
Each dij represents another smaller data subset.
According to the generative cluster model each i
has a pdf where
QQ1,Q2 ,,Qk
and each ci is the cluster identity of the i
th individual
Assuming that the observations are conditionally
independent, the prob. that i belongs to ci is

10
EM Based Clustering Algorithm (continued)

To learn the ML or MAP estimates given D under
the assumption of data from different individuals
being conditionally independent we get
where QMLarg maxQp(DQ) and
QMAParg maxQp(DQ)p(Q).

11
Web Browsing example

Each individual i has ni sequences and we
consider a generative Markov model, having states
from 1 to M.
The model parameters fro each Markov cluster Qk
has an initial state probability pk(s) and a MXM
transition matrix Tk(s2s1), where s,s1,s2 denote
discrete states.
Let Disi,1,,s i,j,,si,ni be the data for the
i th individual where the subscript j denotes the
j th sequence for i.
The likelihood of si,j conditioned on any ci with
parameters Qci is

12
Web Browsing example (continued)

13
Web Browsing example (continued)

14
Results of clustering

15
Conclusion

If one individual has more observations than the
other then that data will have more weight in the
parameter estimation.
Models heterogeneous data across different
individuals in a general framework
All available data on an individual can be used
without having to develop specialized algorithms

Write a Comment

User Comments (0)