Title: Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures
1Clustering in Generalized Linear Mixed Model
Using Dirichlet Process Mixtures
- Ya Xue Xuejun Liao
- April 1, 2005
2Introduction
- Concept drift is in the framework of generalized
linear mixed model, but brings new question of
exploiting the structuring of auxiliary data. - Mixtures with a countably infinite number of
components can be handled in a Bayesian framework
by employing Dirichlet process priors.
3Outline
- Part I generalized linear mixed model
- Generalized linear model (GLM)
- Generalized linear mixed model (GLMM)
- Advanced applications
- Bayesian feature selection in GLMM
- Part II nonparametric method
- Chinese restaurant process
- Dirichlet process (DP)
- Dirichlet process mixture models
- Variational inference for Dirichlet process
mixtures
4- Part I
- Generalized Linear Mixed Model
5 Generalized Linear Model (GLM)
- A linear model specifies the relationship between
a dependent (or response) variable Y, and a set
of predictor variables, Xs, so that - GLM is a generalization of normal linear
regression models to exponential family (normal,
Poisson, Gamma, binomial, etc).
6 Generalized Linear Model (GLM)
- GLM differs from linear model in two major
respects - The distribution of Y can be non-normal, and does
not have to be continuous. - Y still can be predicted from a linear
combination of Xs, but they are "connected" via a
link function.
7Generalized Linear Model(GLM)
- DDE Example binomial distribution
- Scientific interest does DDE exposure increase
the risk of cancer? Test on rats. Let i index
rat. - Dependent variables
- Independent variable dose of DDE exposure,
denoted by xi.
8Generalized Linear Model(GLM)
- Likelihood function of yi
- Choosing the canonical link
, the likelihood function becomes
9GLMM Basic Model
- Returning to the DDE example, 19 labs all over
the world participated this bioassay. - There are unmeasured factors that vary between
the different labs. - For example, rodent diet.
- GLMM is an extension of the generalized linear
model by adding random effects to the linear
predictor (Schall 1991).
10GLMM Basic Model
- The previous linear predictor is modified as
- ,
- where index lab,
index rat within lab . - are fixed effects - parameters common to
all rats. - are random effects - deviations for lab i.
11GLMM Basic Model
- If we choose xij zij , then all the regression
coefficients are assumed to vary for the
different labs. - If we choose zij 1, then only the intercept
varies for the different labs (random intercept
model).
12GLMM - Implementation
- Gibbs sampling
- Disadvantage slow convergence.
- Solution hierarchical centering
reparametrisation (Gelfand 1994 Gelfand 1995) - Deterministic methods are only available for
logit and probit models. - EM algorithm (Anderson 1985)
- Simplex method (Im 1988)
13GLMM Advanced Applications
- Nested GLMM within each lab, rats were group
housed with three cats per cage. - let i index lab, j index cage and k index rat.
- Crossed GLMM for all labs, four dose protocols
were applied on different rats. - let i index lab, j index rat and k indicate
the protocol applied on rat i,j.
14GLMM Advanced Applications
- Nested GLMM within each lab, rats were group
housed with three cats per cage. - Two-level GLMM
- level I lab, level II cage.
-
- Crossed GLMM for all labs, four dose protocols
were applied on different rats. - Rats are sorted into 19 groups by lab.
- Rats are sorted into 4 groups by protocol.
15GLMM Advanced Applications
- Temporal/spatial statistics
- Account for correlation between the random
effects at different times/locations. - Dynamic latent variable model (Dunson 2003)
- Let i index patient and t index follow-up
time, -
-
16GLMM Advanced Applications
- Spatially varying coefficient processes (Gelfand
2003) random effects are modeled as spatially
correlated process.
Possible application A landmine field where
landmines tend to be close together.
17Bayesian Feature Selection in GLMM
- Simultaneous selection of fixed and random
effects in GLMM (Cai and Dunson 2005) - Mixture prior
18Bayesian Feature Selection in GLMM
- Fixed effects choose mixture priors for the
fixed effects coefficients. - Random effects reparameterization
- LDU decomposition of the random effect covariance
- Choose mixture prior for the elements in the
diagonal matrix.
19Missing Identification in GLMM
- Data table of DDE bioassay
- What if the first column is missing?
- Unusual case in statistics, so few people work on
it. - But this is the problem we have to solve for
concept drift.
20Concept Drift
- Primary data
- Auxiliary data
- If we treat the drift variable as random
variable, concept drift is a random intercept
model - a special case of GLMM.
21Clustering in Concept Drift
K 51 clusters (including 0) out of 300
auxiliary data points Bin resolution 1
22Clustering in Concept Drift
- There are intrinsic clusters in auxiliary data
with respect to drift value. - The simplest explanation is best.
- Occam Razor
- Why dont we instead give each cluster a
random effect variable?
23Clustering in Concept Drift
- In usual statistics applications, we know which
individuals share the same random effect . - However, in concept drift, we do not know which
individuals (data points or features) share the
same random-intercept. - Can we train the classifier and cluster the
auxiliary data simultaneously? This is a new
problem we aim to solve.
24Clustering in Concept Drift
- How many clusters (K) should we include in our
model? - Does choosing K actually make sense?
- Is there a better way?
25- Part II
- Nonparametric Method
26Nonparametric method
- Parametric method the forms of the underlying
density functions were known. - Nonparametric method is a wide category, e.g. NN,
minmax, bootstrapping... - Nonparametric Bayesian method make use of the
Bayesian calculus without prior parameterized
knowledge.
27Cornerstones of NBM
- Dirichlet process (DP)
- allow flexible structures to be learned and
allow sharing of statistical strength among sets
of related structures. - Gaussian process (GP)
- allow sharing in the context of multiple
nonparametric regressions - (suggest to have a separate seminar on GP)
28Chinese Restaurant Process
- Chinese restaurant process (CRP) is a
distribution on partitions of integers. - CRP is used to represent uncertainty over the
number of components in a mixture model.
29Chinese Restaurant Process
- Unlimited number of tables
- Each table has an unlimited capacity to seat
customers.
30Chinese Restaurant Process
The (m1)th subsequent customer sits at a table
drawn from the following distribution
where mi is the number of previous customers at
table i and is a parameter.
31Chinese Restaurant Process
Example The probability that next customer sits
at table
32Chinese Restaurant Process
- CRP yields an exchangeable distribution on
partitions of integers, i.e., the specific
ordering of the customers is irrelevant. - An infinite set of random variables is said to be
infinitely exchangeable if for every finite
subset , we have
for any permutation .
33Dirichlet Process
G0 any probability measure on the reals,
partition.
A process is a Dirichlet process if the following
equation holds for all partitions
where is a concentration parameter.
Note Dir Dirichlet distribution, DP -
Dirichlet process.
34Dirichlet Process
- Denote a sample from the Dirichlet process as
- G is a distribution.
- Denote a sample from the distribution G as
Graphical model for a DP generating the
parameters .
35Dirichlet Process
36Dirichlet Process
- The marginal probabilities for a new
This is Chinese restaurant process.
37DP Mixtures
If F is a normal distribution, this is the a
Gaussian mixture model.
38Applications of DP
- Infinite Gaussian Mixture Model (Rasmussen 2000)
- Infinite Hidden Markov Model (Beal 2002)
- Hierarchical Topic Models and the Nested Chinese
Restaurant Process (Blei 2004)
39Implementation of DP
- Gibbs sampling
- If G0 is a conjugate prior for the likelihood
given by F (Escobar 1995) - Non-conjugate prior (Neal 1998)
40Variational Inference for DPM
- The goal is to compute the predictive density
under DP mixture - Also, we minimized the KL distance between p and
a variational distribution q. - This algorithm is based on the stick-breaking
representation of DP. - (I would suggest to have a separate seminar on
stick-breaking view of DP and variational DP.)
41Open Questions
- Can we apply ideas of infinite models beyond
identifying the number of states or components in
a mixture? - Under what conditions can we expect these models
to give consistent estimates of densities? - ...
- Specified to our problem Non conjugate due to
sigmoid function