A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML - PowerPoint PPT Presentation

About This Presentation

Title:

A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML

Description:

general framework for likelihood-based parameter estimation with missing data ... to a (local) maximum of likelihood. Estep and Mstep are often computationally ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 108

Provided by: Informatio367

Learn more at: http://www.datalab.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML

1
A Guided Tour of Finite Mixture Models From
Pearson to the WebICML 01 Keynote
TalkWilliams College, MAJune 29th 2001

Padhraic Smyth
Information and Computer Science
University of California, Irvine
www.datalab.uci.edu

2
Outline

What are mixture models?
Definitions and examples
How can we learn mixture models?
A brief history and illustration
What are mixture models useful for?
Applications in Web and transaction data
Recent research in mixtures?

3
Acknowledgements

Students
Igor Cadez, Scott Gaffney, Xianping Ge, Dima
Pavlov
Collaborators
David Heckerman, Chris Meek, Heikki Mannila,
Christine McLaren, Geoff McLachlan, David Wolpert
Funding
NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center,
Microsoft Research, IBM Research, HNC Software.

4
Finite Mixture Models
5
Finite Mixture Models
6
Finite Mixture Models
7
Finite Mixture Models
8
Finite Mixture Models
Weightk
ComponentModelk
Parametersk
9
Example Mixture of Gaussians

Gaussian mixtures

10
Example Mixture of Gaussians

Gaussian mixtures

Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk
11
Example Mixture of Gaussians

Gaussian mixtures

Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk e.g., K2, 1-dim q, a m1 ,
s1 , m2 , s2 , a1
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Example Mixture of Naïve Bayes

16
Example Mixture of Naïve Bayes

17
Example Mixture of Naïve Bayes

Conditional Independence model for each
component (often quite useful to first-order)
18
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
19
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
20
Other Component Models

Mixtures of Rectangles
Pelleg and Moore (ICML, 2001)
Mixtures of Trees
Meila and Jordan (2000)
Mixtures of Curves
Quandt and Ramsey (1978)
Mixtures of Sequences
Poulsen (1990)

21
Interpretation of Mixtures

1. C has a direct (physical) interpretation
e.g., C age of fish, C male, female

22
Interpretation of Mixtures

1. C has a direct (physical) interpretation
e.g., C age of fish, C male, female
2. C might have an interpretation
e.g., clusters of Web surfers

23
Interpretation of Mixtures

1. C has a direct (physical) interpretation
e.g., C age of fish, C male, female
2. C might have an interpretation
e.g., clusters of Web surfers
3. C is just a convenient latent variable
e.g., flexible density estimation

24
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
25
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
X1
X2
X3
26
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
(discrete, hidden)
(observed)
X1
X2
X3
27
Sequential Mixtures
C
C
C
X1
X2
X3
X1
X2
X3
X1
X2
X3
Time t-1
Time t
Time t1
28
Sequential Mixtures
C
C
C
X1
X2
X3
X1
X2
X3
X1
X2
X3
Time t-1
Time t
Time t1
Markov Mixtures C has Markov dependence
Hidden Markov Model (here with naïve
Bayes) C discrete state, couples
observables through time
29
Dynamic Mixtures

Computer Vision
mixtures of Kalman filters for tracking
Atmospheric Science
mixtures of curves and dynamical models for
cyclones
Economics
mixtures of switching regressions for the US
economy

30
Limitations of Mixtures

Discrete state space
not always appropriate
e.g., in modeling dynamical systems
Training
no closed form solution, can be tricky
Interpretability
many different mixture solutions may explain the
same data

31
Learning of mixture models
32
Learning Mixtures from Data

Consider fixed K
e.g., Unknown parameters Q m1 , s1 , m2 , s2 ,
a1
Given data D x1,.xN, we want to find the
parameters Q that best fit the data

33
Early Attempts

Weldons data, 1893
- n1000 crabs from Bay of Naples
- Ratio of forehead to body length
- suspected existence of 2 separate species

34
Early Attempts

Karl Pearson, 1894
- JRSS paper
- proposed a mixture of 2 Gaussians
- 5 parameters Q m1 , s1 , m2 , s2 , a1
- parameter estimation -gt method of moments
- involved solution of 9th order equations!
(see Chapter 10, Stigler (1986), The History of
Statistics)

35
The solution of an equation of the ninth degree,
where almost all powers, to the ninth, of the
unknown quantity are existing, is, however, a
very laborious task. Mr. Pearson has indeed
possessed the energy to perform his heroic task.
But I fear he will have few successors..
Charlier (1906)
36
Maximum Likelihood Principle

Fisher, 1922
assume a probabilistic model
likelihood p(data parameters, model)
find the parameters that make the data most likely

37
Maximum Likelihood Principle

Fisher, 1922
assume a probabilistic model
likelihood p(data parameters, model)
find the parameters that make the data most likely

38
1977 The EM Algorithm

Dempster, Laird, and Rubin
general framework for likelihood-based parameter
estimation with missing data
start with initial guesses of parameters
Estep estimate memberships given params
Mstep estimate params given memberships
Repeat until convergence
converges to a (local) maximum of likelihood
Estep and Mstep are often computationally simple
generalizes to maximum a posteriori (with priors)

39
(No Transcript)
40
Example of a Log-Likelihood Surface

Mean 2
Log Scale for Sigma 2
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
Control Group
Anemia Group
51
Data for an Individual Patient
Healthy state
Anemic state
(Cadez et al., Machine Learning, in press)
52
Alternatives to EM

Method of Moments
EM is more efficient
Direct optimization
e.g., gradient descent, Newton methods
EM is simpler to implement
Sampling (e.g., MCMC)
Minimum distance, e.g.,

53
How many components?

2 general approaches
1. Best density estimator
e.g., what predicts best on new data
2. True number of components
- cannot be answered from data alone

54
Data-generating process (truth)
Closest model in terms of KL distance
K2 Model Class
55
Data-generating process (truth)
K10 Model Class
56
Prescriptions for Model Selection

Minimize distance to truth
Maximize predictive logp score
gives an estimate of KL(model, truth)
pick model that predicts best on validation data
Bayesian techniques
p(kdata) impossible to compute exactly
closed-form approximations
BIC, Autoclass, etc
sampling
Monte Carlo techniques quite tricky for mixtures

57
Mixture model applications
58
What are Mixtures used for?

Modeling heterogeneity
e.g., inferring multiple species in biology
Handling missing data
e.g., variables and cases missing in
model-building
Density estimation
e.g., as flexible priors in Bayesian statistics
Clustering
components as clusters
Model Averaging
combining density models

59
Mixtures of non-vector data

Example
N individuals, and sets of sequences for each
e.g., Web session data
Clustering of the N individuals?
Vectorize data and apply vector methods?
Pairwise distances of sets of sequences?
Parameter estimate for each individual and then
cluster?

60
Mixtures of Sequences, Curves,
61
Mixtures of Sequences, Curves,
Generative Model - select a component ck for
individual i - generate data according to p(Di
ck) - p(Di ck) can be very general - e.g.,
sets of sequences, spatial patterns, etc Note
given p(Di ck), we can define an EM algorithm
62
Application 1 Web Log Visualization
(Cadez, Heckerman, Meek, Smyth, KDD 2000)

MSNBC Web logs
2 million individuals per day
different session lengths per individual
difficult visualization and clustering problem
WebCanvas
uses mixtures of SFSMs to cluster individuals
based on their observed sequences
software tool EM mixture modeling
visualization

63
(No Transcript)
64
Example Mixtures of SFSMs

Simple model for traversal on a Web site
(equivalent to first-order Markov with end-state)
Generative model for large sets of Web users
- different behaviors ltgt mixture of SFSMs
EM algorithm is quite simple weighted counts

65
WebCanvas Cadez, Heckerman, et al, KDD 2000
66
(No Transcript)
67
Timing Results
68
Transaction Data
69
Profiling for Transaction Data

Profiling
given transactions for a set of individuals
infer a predictive model for future transactions
of individual i
typical applications
automated recommender systems e.g., Amazon.com
personalization e.g., wireless information
services
existing techniques
collaborative filtering, association rules
unsuited for prediction, e.g., inclusion of
seasonality.

70
Application 2 Transaction Data
(Cadez, Smyth, Mannila, KDD 2001)

Retail Data Set
200,000 individuals
all market baskets over 2 years
50 departments, 50k items
Problem
predict what an individual will purchase in the
future
want to generalize across products
want to allow heterogeneity

71
Transaction Data
72
Examples of Mixture Components
Components encode typical combinations of
clothes
73
Mixture-based Profiles
Probability that Individual i engages
in behavior k, given that they enter the store
Predictive Profile for Individual i
Basket model for Component k (Basis function)
74
Hierarchical Bayes Model
Empirical Prior on Mixture Weights
Individual 1
Individual i
Individual N
B1
B1
B2
B3
B1
B2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
75
Data and Profile Example
76
Data and Profile Example
77
Data and Profile Example
78
Data and Profile Example
79
Data and Profile Example
No Training Data for 14
No Purchases above Dept 25
80
(No Transcript)
81
(No Transcript)
82
Transaction mixtures

Mixture-based profiles
interpretable and flexible
more accurate than non-mixture approaches
training time linear(number of items)
Applications
early detection of high-value customers
visualization and exploration
forecasting customer behavior

83
Extensions of Mixtures
84
Extensions Multiple Causes

Single cause variable

C
X
85
Extensions Multiple Causes

Single cause variable
Multiple Causes (Factors)
examples
Hoffman (1999,2000) for text
ICA for signals

C
X
C2
C1
C3
X
86
Extensions High Dimensions

Density estimation is difficult in high-d

87
Extensions High Dimensions

Density estimation is difficult in high-d

88
Extensions High Dimensions

Density estimation is difficult in high-d

Global PCA
89
Extensions High Dimensions

Density estimation is difficult in high-d

Mixtures of PCA (Tipping and Bishop 1999)
90
Extensions Predictive Mixtures

Standard mixture model

91
Extensions Predictive Mixtures

Standard mixture model
Conditional Mixtures
e.g., mixtures of experts, Jordan and Jacobs
(1994)

Input-dependent
92
Extensions Learning Algorithms

Fast algorithms
kd-trees (Moore, 1998)
Caching(Bradley et al.)
Random projections
DasGupta (2000)
Mean-squared error criteria
Scott (2000)
Bayesian techniques
reversible-jump MCMC (Green, et al.)

93
Classic References
Statistical Analysis of Finite Mixture
Distributions Titterington, Smith, and
Makov Wiley, 1985 Finite Mixture
Models McLachlan and Peel Wiley 2000
94
Conclusions

Mixtures
flexible tool in the machine learners toolbox
Beyond mixtures of Gaussians
mixtures of sequences, curves, trees, etc.
Applications
numerous and broad

95
(No Transcript)
96
(No Transcript)
97
Concavity of Likelihood (Cadez and Smyth, NIPS
2000)
98
Application 3 Model Averaging
Bayesian model averaging for p(x) - since we
dont know which model (if any) is the true one,
average out this uncertainty
Prediction of Model k
Weight of Model k
Prediction of x given data D
99
Stacked Mixtures(Smyth and Wolpert, Machine
Learning, 1999)
Simple idea use cross-validation to estimate
the weights, rather than using a Bayesian
scheme Two-phase learning 1. Learn each model
Mk on Dtrain 2. Learn mixture model weights on
Dvalidation - components are fixed - EM just
learns the weights Outperforms any single model
selection technique Even outperforms cheating

100
Query Approximation(Pavlov and Smyth, KDD 2001)
Large Database
Approximate Models
Query Generator
101
Query Approximation
Large Database
Approximate Models
Query Generator
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
102
Query Approximation
Large Database
Query Generator
Approximate Models
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
Provide Fast Query Answers Online
103
Stacking for Query Model Combining
Conjunctive Queries on Microsoft Web Data, 32k
records, 294 attributes Available online at UCI
KDD Archive
104
Attributes
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Observation Vectors
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
105
Treat as Missing
Attributes
1
1
1
1
1
C1
1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
Observation Vectors
C1
C1
C2
1
1
1
C2
1
1
1
C2
1
1
1
1
C2
1
1
1
C2
1
1
1
1
1
C2
1
1
C2
1
1
1
C2
106
Treat as Missing
Attributes
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Observation Vectors
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
E-Step estimate component membership
probabilities given current parameter estimates
107
Treat as Missing
Attributes
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Observation Vectors
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
M-Step use fractional weighted data to get new
estimates of the parameters

Write a Comment

User Comments (0)