A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML - PowerPoint PPT Presentation

About This Presentation
Title:

A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML

Description:

general framework for likelihood-based parameter estimation with missing data ... to a (local) maximum of likelihood. Estep and Mstep are often computationally ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 108
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML


1
A Guided Tour of Finite Mixture Models From
Pearson to the WebICML 01 Keynote
TalkWilliams College, MAJune 29th 2001
  • Padhraic Smyth
  • Information and Computer Science
  • University of California, Irvine
  • www.datalab.uci.edu

2
Outline
  • What are mixture models?
  • Definitions and examples
  • How can we learn mixture models?
  • A brief history and illustration
  • What are mixture models useful for?
  • Applications in Web and transaction data
  • Recent research in mixtures?

3
Acknowledgements
  • Students
  • Igor Cadez, Scott Gaffney, Xianping Ge, Dima
    Pavlov
  • Collaborators
  • David Heckerman, Chris Meek, Heikki Mannila,
    Christine McLaren, Geoff McLachlan, David Wolpert
  • Funding
  • NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center,
    Microsoft Research, IBM Research, HNC Software.

4
Finite Mixture Models
5
Finite Mixture Models
6
Finite Mixture Models
7
Finite Mixture Models
8
Finite Mixture Models
Weightk
ComponentModelk
Parametersk
9
Example Mixture of Gaussians
  • Gaussian mixtures


10
Example Mixture of Gaussians
  • Gaussian mixtures

Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk
11
Example Mixture of Gaussians
  • Gaussian mixtures

Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk e.g., K2, 1-dim q, a m1 ,
s1 , m2 , s2 , a1
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Example Mixture of Naïve Bayes

16
Example Mixture of Naïve Bayes

17
Example Mixture of Naïve Bayes

Conditional Independence model for each
component (often quite useful to first-order)
18
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
19
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
20
Other Component Models
  • Mixtures of Rectangles
  • Pelleg and Moore (ICML, 2001)
  • Mixtures of Trees
  • Meila and Jordan (2000)
  • Mixtures of Curves
  • Quandt and Ramsey (1978)
  • Mixtures of Sequences
  • Poulsen (1990)

21
Interpretation of Mixtures
  • 1. C has a direct (physical) interpretation
  • e.g., C age of fish, C male, female

22
Interpretation of Mixtures
  • 1. C has a direct (physical) interpretation
  • e.g., C age of fish, C male, female
  • 2. C might have an interpretation
  • e.g., clusters of Web surfers

23
Interpretation of Mixtures
  • 1. C has a direct (physical) interpretation
  • e.g., C age of fish, C male, female
  • 2. C might have an interpretation
  • e.g., clusters of Web surfers
  • 3. C is just a convenient latent variable
  • e.g., flexible density estimation

24
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
25
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
X1
X2
X3
26
Graphical Models for Mixtures
E.g., Mixtures of Naïve Bayes
C
(discrete, hidden)
(observed)
X1
X2
X3
27
Sequential Mixtures
C
C
C
X1
X2
X3
X1
X2
X3
X1
X2
X3
Time t-1
Time t
Time t1
28
Sequential Mixtures
C
C
C
X1
X2
X3
X1
X2
X3
X1
X2
X3
Time t-1
Time t
Time t1
Markov Mixtures C has Markov dependence
Hidden Markov Model (here with naïve
Bayes) C discrete state, couples
observables through time
29
Dynamic Mixtures
  • Computer Vision
  • mixtures of Kalman filters for tracking
  • Atmospheric Science
  • mixtures of curves and dynamical models for
    cyclones
  • Economics
  • mixtures of switching regressions for the US
    economy

30
Limitations of Mixtures
  • Discrete state space
  • not always appropriate
  • e.g., in modeling dynamical systems
  • Training
  • no closed form solution, can be tricky
  • Interpretability
  • many different mixture solutions may explain the
    same data

31
Learning of mixture models
32
Learning Mixtures from Data
  • Consider fixed K
  • e.g., Unknown parameters Q m1 , s1 , m2 , s2 ,
    a1
  • Given data D x1,.xN, we want to find the
    parameters Q that best fit the data

33
Early Attempts
  • Weldons data, 1893
  • - n1000 crabs from Bay of Naples
  • - Ratio of forehead to body length
  • - suspected existence of 2 separate species

34
Early Attempts
  • Karl Pearson, 1894
  • - JRSS paper
  • - proposed a mixture of 2 Gaussians
  • - 5 parameters Q m1 , s1 , m2 , s2 , a1
  • - parameter estimation -gt method of moments
  • - involved solution of 9th order equations!
  • (see Chapter 10, Stigler (1986), The History of
    Statistics)

35
The solution of an equation of the ninth degree,
where almost all powers, to the ninth, of the
unknown quantity are existing, is, however, a
very laborious task. Mr. Pearson has indeed
possessed the energy to perform his heroic task.
But I fear he will have few successors..
Charlier (1906)
36
Maximum Likelihood Principle
  • Fisher, 1922
  • assume a probabilistic model
  • likelihood p(data parameters, model)
  • find the parameters that make the data most likely

37
Maximum Likelihood Principle
  • Fisher, 1922
  • assume a probabilistic model
  • likelihood p(data parameters, model)
  • find the parameters that make the data most likely

38
1977 The EM Algorithm
  • Dempster, Laird, and Rubin
  • general framework for likelihood-based parameter
    estimation with missing data
  • start with initial guesses of parameters
  • Estep estimate memberships given params
  • Mstep estimate params given memberships
  • Repeat until convergence
  • converges to a (local) maximum of likelihood
  • Estep and Mstep are often computationally simple
  • generalizes to maximum a posteriori (with priors)

39
(No Transcript)
40
Example of a Log-Likelihood Surface

Mean 2
Log Scale for Sigma 2
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
Control Group
Anemia Group
51
Data for an Individual Patient
Healthy state
Anemic state
(Cadez et al., Machine Learning, in press)
52
Alternatives to EM
  • Method of Moments
  • EM is more efficient
  • Direct optimization
  • e.g., gradient descent, Newton methods
  • EM is simpler to implement
  • Sampling (e.g., MCMC)
  • Minimum distance, e.g.,

53
How many components?
  • 2 general approaches
  • 1. Best density estimator
  • e.g., what predicts best on new data
  • 2. True number of components
  • - cannot be answered from data alone

54
Data-generating process (truth)
Closest model in terms of KL distance
K2 Model Class
55
Data-generating process (truth)
K10 Model Class
56
Prescriptions for Model Selection
  • Minimize distance to truth
  • Maximize predictive logp score
  • gives an estimate of KL(model, truth)
  • pick model that predicts best on validation data
  • Bayesian techniques
  • p(kdata) impossible to compute exactly
  • closed-form approximations
  • BIC, Autoclass, etc
  • sampling
  • Monte Carlo techniques quite tricky for mixtures

57
Mixture model applications
58
What are Mixtures used for?
  • Modeling heterogeneity
  • e.g., inferring multiple species in biology
  • Handling missing data
  • e.g., variables and cases missing in
    model-building
  • Density estimation
  • e.g., as flexible priors in Bayesian statistics
  • Clustering
  • components as clusters
  • Model Averaging
  • combining density models

59
Mixtures of non-vector data
  • Example
  • N individuals, and sets of sequences for each
  • e.g., Web session data
  • Clustering of the N individuals?
  • Vectorize data and apply vector methods?
  • Pairwise distances of sets of sequences?
  • Parameter estimate for each individual and then
    cluster?

60
Mixtures of Sequences, Curves,
61
Mixtures of Sequences, Curves,
Generative Model - select a component ck for
individual i - generate data according to p(Di
ck) - p(Di ck) can be very general - e.g.,
sets of sequences, spatial patterns, etc Note
given p(Di ck), we can define an EM algorithm
62
Application 1 Web Log Visualization
(Cadez, Heckerman, Meek, Smyth, KDD 2000)
  • MSNBC Web logs
  • 2 million individuals per day
  • different session lengths per individual
  • difficult visualization and clustering problem
  • WebCanvas
  • uses mixtures of SFSMs to cluster individuals
    based on their observed sequences
  • software tool EM mixture modeling
    visualization

63
(No Transcript)
64
Example Mixtures of SFSMs
  • Simple model for traversal on a Web site
  • (equivalent to first-order Markov with end-state)
  • Generative model for large sets of Web users
  • - different behaviors ltgt mixture of SFSMs
  • EM algorithm is quite simple weighted counts

65
WebCanvas Cadez, Heckerman, et al, KDD 2000
66
(No Transcript)
67
Timing Results
68
Transaction Data
69
Profiling for Transaction Data
  • Profiling
  • given transactions for a set of individuals
  • infer a predictive model for future transactions
    of individual i
  • typical applications
  • automated recommender systems e.g., Amazon.com
  • personalization e.g., wireless information
    services
  • existing techniques
  • collaborative filtering, association rules
  • unsuited for prediction, e.g., inclusion of
    seasonality.

70
Application 2 Transaction Data
(Cadez, Smyth, Mannila, KDD 2001)
  • Retail Data Set
  • 200,000 individuals
  • all market baskets over 2 years
  • 50 departments, 50k items
  • Problem
  • predict what an individual will purchase in the
    future
  • want to generalize across products
  • want to allow heterogeneity

71
Transaction Data
72
Examples of Mixture Components
Components encode typical combinations of
clothes
73
Mixture-based Profiles
Probability that Individual i engages
in behavior k, given that they enter the store
Predictive Profile for Individual i
Basket model for Component k (Basis function)
74
Hierarchical Bayes Model
Empirical Prior on Mixture Weights
Individual 1
Individual i
Individual N
B1
B1
B2
B3
B1
B2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
75
Data and Profile Example
76
Data and Profile Example
77
Data and Profile Example
78
Data and Profile Example
79
Data and Profile Example
No Training Data for 14
No Purchases above Dept 25
80
(No Transcript)
81
(No Transcript)
82
Transaction mixtures
  • Mixture-based profiles
  • interpretable and flexible
  • more accurate than non-mixture approaches
  • training time linear(number of items)
  • Applications
  • early detection of high-value customers
  • visualization and exploration
  • forecasting customer behavior

83
Extensions of Mixtures
84
Extensions Multiple Causes
  • Single cause variable

C
X
85
Extensions Multiple Causes
  • Single cause variable
  • Multiple Causes (Factors)
  • examples
  • Hoffman (1999,2000) for text
  • ICA for signals

C
X
C2
C1
C3
X
86
Extensions High Dimensions
  • Density estimation is difficult in high-d

87
Extensions High Dimensions
  • Density estimation is difficult in high-d

88
Extensions High Dimensions
  • Density estimation is difficult in high-d

Global PCA
89
Extensions High Dimensions
  • Density estimation is difficult in high-d

Mixtures of PCA (Tipping and Bishop 1999)
90
Extensions Predictive Mixtures
  • Standard mixture model

91
Extensions Predictive Mixtures
  • Standard mixture model
  • Conditional Mixtures
  • e.g., mixtures of experts, Jordan and Jacobs
    (1994)

Input-dependent
92
Extensions Learning Algorithms
  • Fast algorithms
  • kd-trees (Moore, 1998)
  • Caching(Bradley et al.)
  • Random projections
  • DasGupta (2000)
  • Mean-squared error criteria
  • Scott (2000)
  • Bayesian techniques
  • reversible-jump MCMC (Green, et al.)

93
Classic References
Statistical Analysis of Finite Mixture
Distributions Titterington, Smith, and
Makov Wiley, 1985 Finite Mixture
Models McLachlan and Peel Wiley 2000
94
Conclusions
  • Mixtures
  • flexible tool in the machine learners toolbox
  • Beyond mixtures of Gaussians
  • mixtures of sequences, curves, trees, etc.
  • Applications
  • numerous and broad

95
(No Transcript)
96
(No Transcript)
97
Concavity of Likelihood (Cadez and Smyth, NIPS
2000)
98
Application 3 Model Averaging
Bayesian model averaging for p(x) - since we
dont know which model (if any) is the true one,
average out this uncertainty
Prediction of Model k
Weight of Model k
Prediction of x given data D
99
Stacked Mixtures(Smyth and Wolpert, Machine
Learning, 1999)
Simple idea use cross-validation to estimate
the weights, rather than using a Bayesian
scheme Two-phase learning 1. Learn each model
Mk on Dtrain 2. Learn mixture model weights on
Dvalidation - components are fixed - EM just
learns the weights Outperforms any single model
selection technique Even outperforms cheating

100
Query Approximation(Pavlov and Smyth, KDD 2001)
Large Database
Approximate Models
Query Generator
101
Query Approximation
Large Database
Approximate Models
Query Generator
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
102
Query Approximation
Large Database
Query Generator
Approximate Models
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
Provide Fast Query Answers Online
103
Stacking for Query Model Combining
Conjunctive Queries on Microsoft Web Data, 32k
records, 294 attributes Available online at UCI
KDD Archive
104
Attributes
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Observation Vectors
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
105
Treat as Missing
Attributes
1
1
1
1
1
C1
1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
Observation Vectors
C1
C1
C2
1
1
1
C2
1
1
1
C2
1
1
1
1
C2
1
1
1
C2
1
1
1
1
1
C2
1
1
C2
1
1
1
C2
106
Treat as Missing
Attributes
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Observation Vectors
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
E-Step estimate component membership
probabilities given current parameter estimates
107
Treat as Missing
Attributes
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Observation Vectors
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
M-Step use fractional weighted data to get new
estimates of the parameters
Write a Comment
User Comments (0)
About PowerShow.com