Title: A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML
1A Guided Tour of Finite Mixture Models From
Pearson to the WebICML 01 Keynote
TalkWilliams College, MAJune 29th 2001
- Padhraic Smyth
- Information and Computer Science
- University of California, Irvine
- www.datalab.uci.edu
2Outline
- What are mixture models?
- Definitions and examples
- How can we learn mixture models?
- A brief history and illustration
- What are mixture models useful for?
- Applications in Web and transaction data
- Recent research in mixtures?
3Acknowledgements
- Students
- Igor Cadez, Scott Gaffney, Xianping Ge, Dima
Pavlov - Collaborators
- David Heckerman, Chris Meek, Heikki Mannila,
Christine McLaren, Geoff McLachlan, David Wolpert - Funding
- NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center,
Microsoft Research, IBM Research, HNC Software.
4Finite Mixture Models
5Finite Mixture Models
6Finite Mixture Models
7Finite Mixture Models
8Finite Mixture Models
Weightk
ComponentModelk
Parametersk
9Example Mixture of Gaussians
10Example Mixture of Gaussians
Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk
11Example Mixture of Gaussians
Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk e.g., K2, 1-dim q, a m1 ,
s1 , m2 , s2 , a1
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Example Mixture of Naïve Bayes
16Example Mixture of Naïve Bayes
17Example Mixture of Naïve Bayes
Conditional Independence model for each
component (often quite useful to first-order)
18Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
19Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
20Other Component Models
- Mixtures of Rectangles
- Pelleg and Moore (ICML, 2001)
- Mixtures of Trees
- Meila and Jordan (2000)
- Mixtures of Curves
- Quandt and Ramsey (1978)
- Mixtures of Sequences
- Poulsen (1990)
21Interpretation of Mixtures
- 1. C has a direct (physical) interpretation
- e.g., C age of fish, C male, female
22Interpretation of Mixtures
- 1. C has a direct (physical) interpretation
- e.g., C age of fish, C male, female
- 2. C might have an interpretation
- e.g., clusters of Web surfers
23Interpretation of Mixtures
- 1. C has a direct (physical) interpretation
- e.g., C age of fish, C male, female
- 2. C might have an interpretation
- e.g., clusters of Web surfers
- 3. C is just a convenient latent variable
- e.g., flexible density estimation
24Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
25Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
X1
X2
X3
26Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
(discrete, hidden)
(observed)
X1
X2
X3
27Sequential Mixtures
C
C
C
X1
X2
X3
X1
X2
X3
X1
X2
X3
Time t-1
Time t
Time t1
28Sequential Mixtures
C
C
C
X1
X2
X3
X1
X2
X3
X1
X2
X3
Time t-1
Time t
Time t1
Markov Mixtures C has Markov dependence
Hidden Markov Model (here with naïve
Bayes) C discrete state, couples
observables through time
29Dynamic Mixtures
- Computer Vision
- mixtures of Kalman filters for tracking
- Atmospheric Science
- mixtures of curves and dynamical models for
cyclones - Economics
- mixtures of switching regressions for the US
economy
30Limitations of Mixtures
- Discrete state space
- not always appropriate
- e.g., in modeling dynamical systems
- Training
- no closed form solution, can be tricky
- Interpretability
- many different mixture solutions may explain the
same data
31Learning of mixture models
32Learning Mixtures from Data
- Consider fixed K
- e.g., Unknown parameters Q m1 , s1 , m2 , s2 ,
a1 - Given data D x1,.xN, we want to find the
parameters Q that best fit the data
33Early Attempts
- Weldons data, 1893
- - n1000 crabs from Bay of Naples
- - Ratio of forehead to body length
- - suspected existence of 2 separate species
-
34Early Attempts
- Karl Pearson, 1894
- - JRSS paper
- - proposed a mixture of 2 Gaussians
- - 5 parameters Q m1 , s1 , m2 , s2 , a1
-
- - parameter estimation -gt method of moments
- - involved solution of 9th order equations!
- (see Chapter 10, Stigler (1986), The History of
Statistics)
35The solution of an equation of the ninth degree,
where almost all powers, to the ninth, of the
unknown quantity are existing, is, however, a
very laborious task. Mr. Pearson has indeed
possessed the energy to perform his heroic task.
But I fear he will have few successors..
Charlier (1906)
36Maximum Likelihood Principle
- Fisher, 1922
- assume a probabilistic model
- likelihood p(data parameters, model)
- find the parameters that make the data most likely
37Maximum Likelihood Principle
- Fisher, 1922
- assume a probabilistic model
- likelihood p(data parameters, model)
- find the parameters that make the data most likely
381977 The EM Algorithm
- Dempster, Laird, and Rubin
- general framework for likelihood-based parameter
estimation with missing data - start with initial guesses of parameters
- Estep estimate memberships given params
- Mstep estimate params given memberships
- Repeat until convergence
- converges to a (local) maximum of likelihood
- Estep and Mstep are often computationally simple
- generalizes to maximum a posteriori (with priors)
39(No Transcript)
40Example of a Log-Likelihood Surface
Mean 2
Log Scale for Sigma 2
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50Control Group
Anemia Group
51Data for an Individual Patient
Healthy state
Anemic state
(Cadez et al., Machine Learning, in press)
52Alternatives to EM
- Method of Moments
- EM is more efficient
- Direct optimization
- e.g., gradient descent, Newton methods
- EM is simpler to implement
- Sampling (e.g., MCMC)
- Minimum distance, e.g.,
53How many components?
- 2 general approaches
- 1. Best density estimator
- e.g., what predicts best on new data
- 2. True number of components
- - cannot be answered from data alone
54Data-generating process (truth)
Closest model in terms of KL distance
K2 Model Class
55Data-generating process (truth)
K10 Model Class
56Prescriptions for Model Selection
- Minimize distance to truth
- Maximize predictive logp score
- gives an estimate of KL(model, truth)
- pick model that predicts best on validation data
- Bayesian techniques
- p(kdata) impossible to compute exactly
- closed-form approximations
- BIC, Autoclass, etc
- sampling
- Monte Carlo techniques quite tricky for mixtures
57Mixture model applications
58What are Mixtures used for?
- Modeling heterogeneity
- e.g., inferring multiple species in biology
- Handling missing data
- e.g., variables and cases missing in
model-building - Density estimation
- e.g., as flexible priors in Bayesian statistics
- Clustering
- components as clusters
- Model Averaging
- combining density models
59Mixtures of non-vector data
- Example
- N individuals, and sets of sequences for each
- e.g., Web session data
- Clustering of the N individuals?
- Vectorize data and apply vector methods?
- Pairwise distances of sets of sequences?
- Parameter estimate for each individual and then
cluster?
60Mixtures of Sequences, Curves,
61Mixtures of Sequences, Curves,
Generative Model - select a component ck for
individual i - generate data according to p(Di
ck) - p(Di ck) can be very general - e.g.,
sets of sequences, spatial patterns, etc Note
given p(Di ck), we can define an EM algorithm
62Application 1 Web Log Visualization
(Cadez, Heckerman, Meek, Smyth, KDD 2000)
- MSNBC Web logs
- 2 million individuals per day
- different session lengths per individual
- difficult visualization and clustering problem
- WebCanvas
- uses mixtures of SFSMs to cluster individuals
based on their observed sequences - software tool EM mixture modeling
visualization
63(No Transcript)
64Example Mixtures of SFSMs
- Simple model for traversal on a Web site
- (equivalent to first-order Markov with end-state)
- Generative model for large sets of Web users
- - different behaviors ltgt mixture of SFSMs
- EM algorithm is quite simple weighted counts
65WebCanvas Cadez, Heckerman, et al, KDD 2000
66(No Transcript)
67Timing Results
68Transaction Data
69 Profiling for Transaction Data
- Profiling
- given transactions for a set of individuals
- infer a predictive model for future transactions
of individual i - typical applications
- automated recommender systems e.g., Amazon.com
- personalization e.g., wireless information
services - existing techniques
- collaborative filtering, association rules
- unsuited for prediction, e.g., inclusion of
seasonality.
70Application 2 Transaction Data
(Cadez, Smyth, Mannila, KDD 2001)
- Retail Data Set
- 200,000 individuals
- all market baskets over 2 years
- 50 departments, 50k items
- Problem
- predict what an individual will purchase in the
future - want to generalize across products
- want to allow heterogeneity
71Transaction Data
72Examples of Mixture Components
Components encode typical combinations of
clothes
73Mixture-based Profiles
Probability that Individual i engages
in behavior k, given that they enter the store
Predictive Profile for Individual i
Basket model for Component k (Basis function)
74Hierarchical Bayes Model
Empirical Prior on Mixture Weights
Individual 1
Individual i
Individual N
B1
B1
B2
B3
B1
B2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
75Data and Profile Example
76Data and Profile Example
77Data and Profile Example
78Data and Profile Example
79Data and Profile Example
No Training Data for 14
No Purchases above Dept 25
80(No Transcript)
81(No Transcript)
82Transaction mixtures
- Mixture-based profiles
- interpretable and flexible
- more accurate than non-mixture approaches
- training time linear(number of items)
- Applications
- early detection of high-value customers
- visualization and exploration
- forecasting customer behavior
83Extensions of Mixtures
84Extensions Multiple Causes
C
X
85Extensions Multiple Causes
- Single cause variable
- Multiple Causes (Factors)
- examples
- Hoffman (1999,2000) for text
- ICA for signals
C
X
C2
C1
C3
X
86Extensions High Dimensions
- Density estimation is difficult in high-d
87Extensions High Dimensions
- Density estimation is difficult in high-d
88Extensions High Dimensions
- Density estimation is difficult in high-d
Global PCA
89Extensions High Dimensions
- Density estimation is difficult in high-d
Mixtures of PCA (Tipping and Bishop 1999)
90Extensions Predictive Mixtures
91Extensions Predictive Mixtures
- Standard mixture model
- Conditional Mixtures
- e.g., mixtures of experts, Jordan and Jacobs
(1994)
Input-dependent
92Extensions Learning Algorithms
- Fast algorithms
- kd-trees (Moore, 1998)
- Caching(Bradley et al.)
- Random projections
- DasGupta (2000)
- Mean-squared error criteria
- Scott (2000)
- Bayesian techniques
- reversible-jump MCMC (Green, et al.)
93Classic References
Statistical Analysis of Finite Mixture
Distributions Titterington, Smith, and
Makov Wiley, 1985 Finite Mixture
Models McLachlan and Peel Wiley 2000
94Conclusions
- Mixtures
- flexible tool in the machine learners toolbox
- Beyond mixtures of Gaussians
- mixtures of sequences, curves, trees, etc.
- Applications
- numerous and broad
95(No Transcript)
96(No Transcript)
97Concavity of Likelihood (Cadez and Smyth, NIPS
2000)
98Application 3 Model Averaging
Bayesian model averaging for p(x) - since we
dont know which model (if any) is the true one,
average out this uncertainty
Prediction of Model k
Weight of Model k
Prediction of x given data D
99Stacked Mixtures(Smyth and Wolpert, Machine
Learning, 1999)
Simple idea use cross-validation to estimate
the weights, rather than using a Bayesian
scheme Two-phase learning 1. Learn each model
Mk on Dtrain 2. Learn mixture model weights on
Dvalidation - components are fixed - EM just
learns the weights Outperforms any single model
selection technique Even outperforms cheating
100Query Approximation(Pavlov and Smyth, KDD 2001)
Large Database
Approximate Models
Query Generator
101Query Approximation
Large Database
Approximate Models
Query Generator
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
102Query Approximation
Large Database
Query Generator
Approximate Models
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
Provide Fast Query Answers Online
103Stacking for Query Model Combining
Conjunctive Queries on Microsoft Web Data, 32k
records, 294 attributes Available online at UCI
KDD Archive
104Attributes
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Observation Vectors
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
105Treat as Missing
Attributes
1
1
1
1
1
C1
1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
Observation Vectors
C1
C1
C2
1
1
1
C2
1
1
1
C2
1
1
1
1
C2
1
1
1
C2
1
1
1
1
1
C2
1
1
C2
1
1
1
C2
106Treat as Missing
Attributes
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Observation Vectors
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
E-Step estimate component membership
probabilities given current parameter estimates
107Treat as Missing
Attributes
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Observation Vectors
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
M-Step use fractional weighted data to get new
estimates of the parameters