Title: Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu
1Large Data Set Analysis using Mixture
ModelsSeminar at IBM Watson Research
CenterJune 27th 2001 Padhraic
SmythInformation and Computer ScienceUniversity
of California, Irvinewww.datalab.uci.edu
2Outline
- Part 1 Basic concepts in mixture modeling
- representational capabilities of mixtures
- learning mixtures from data
- extensions of mixtures to non-vector data
- Part 2 New applications of mixtures
- 1. Visualization and clustering of Web log data
- 2. predictive profiles from transaction data
- 3. query approximation problems
3Acknowledgements
- Students
- Igor Cadez, Scott Gaffney, Xianping Ge, Dima
Pavlov - Collaborators
- David Heckerman, Chris Meek, Heikki Mannila,
Christine McLaren, Geoff McLachlan, David Wolpert - Funding
- NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center,
Microsoft Research, IBM Research, HNC Software.
4Finite Mixture Models
5Finite Mixture Models
6Finite Mixture Models
7Finite Mixture Models
Weightk
ComponentModelk
Parametersk
8Example Mixture of Gaussians
9Example Mixture of Gaussians
Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk
10Example Mixture of Gaussians
Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk e.g., K2, 1-dim q, a m1 ,
s1 , m2 , s2 , a1
11(No Transcript)
12(No Transcript)
13(No Transcript)
14Example Mixture of Naïve Bayes
15Example Mixture of Naïve Bayes
16Example Mixture of Naïve Bayes
Conditional Independence model for each
component (often quite useful to first-order)
17Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
18Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
19Interpretation of Mixtures
- 1. C has a direct (physical) interpretation
- e.g., C age of fish, C male, female
20Interpretation of Mixtures
- 1. C has a direct (physical) interpretation
- e.g., C age of fish, C male, female
- 2. C might have an interpretation
- e.g., clusters of Web surfers
21Interpretation of Mixtures
- 1. C has a direct (physical) interpretation
- e.g., C age of fish, C male, female
- 2. C might have an interpretation
- e.g., clusters of Web surfers
- 3. C is just a convenient latent variable
- e.g., flexible density estimation
22Learning Mixtures from Data
- Consider fixed K
- e.g., Unknown parameters Q m1 , s1 , m2 , s2 ,
a1 - Given data D x1,.xN, we want to find the
parameters Q that best fit the data
23Maximum Likelihood Principle
- Fisher, 1922
- assume a probabilistic model
- likelihood p(data parameters, model)
- find the parameters that make the data most likely
24Maximum Likelihood Principle
- Fisher, 1922
- assume a probabilistic model
- likelihood p(data parameters, model)
- find the parameters that make the data most likely
25(No Transcript)
26Example of a Log-Likelihood Surface
Mean 2
Log Scale for Sigma 2
271977 The EM Algorithm
- Dempster, Laird, and Rubin
- general framework for likelihood-based parameter
estimation with missing data - start with initial guesses of parameters
- Estep estimate memberships given params
- Mstep estimate params given memberships
- Repeat until convergence
- converges to a (local) maximum of likelihood
- Estep and Mstep are often computationally simple
- generalizes to maximum a posteriori (with priors)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Control Group
Anemia Group
37Alternatives to EM
- Direct optimization
- e.g., gradient descent, Newton methods
- EM is simpler to implement
- Sampling (e.g., MCMC)
- computationally intensive
- Minimum distance, e.g.,
38How many components?
- 2 general approaches
- 1. Best density estimator
- e.g., what predicts best on new data
- 2. True number of components
- typically cannot be done with data alone
39K1 Model Class
40Data-generating process (truth)
K1 Model Class
41Data-generating process (truth)
Closest model in terms of logp scores on new
data
K1 Model Class
42Data-generating process (truth)
Closest model in terms of logp scores on new
data
K1 Model Class
Best model is relatively far from truth gt High
Bias
43Data-generating process (truth)
K1 Model Class
K10 Model Class
44Data-generating process (truth)
K1 Model Class
Best model is closer to Truth gt Low Bias
K10 Model Class
45However,. This could be the model that best fits
the observed data gt High Variance
Data-generating process (truth)
K1 Model Class
K10 Model Class
46Prescriptions for Model Selection
- Minimize distance to truth
- Method 1 Predictive logp scores
- calculate log p(test data model k)
- select model that predicts best
- Method 2 Bayesian techniques
- p(kdata) impossible to compute exactly
- closed-form approximations
- BIC, Autoclass, MDL, etc
- sampling
- Monte Carlo techniques quite tricky for mixtures
47Mixtures of non-vector data
- Example
- N individuals, and sets of sequences for each
- e.g., Web session data
- Clustering of the N individuals?
- Vectorize data and apply vector methods?
- Estimate parameters for each sequence and cluster
in parameter space? - Pairwise distances of sequences?
48Mixtures of Sequences, Curves,
49Mixtures of Sequences, Curves,
Generative Model - pick individual i - select a
component ck for individual i - generate data
according to p(Di ck) - p(Di ck) can be
very general - e.g., sets of sequences, spatial
patterns, etc Note given p(Di ck), we can
define an EM algorithm
50Outline
- Part 1 Basic concepts in mixture modeling
- representational capabilities of mixtures
- learning mixtures from data
- extensions of mixtures to non-vector data
- Part 2 New applications of mixtures
- 1. predictive profiles from transaction data
- 2. sequence clustering with mixtures of Markov
models - 3. query approximation problems
51Application 1 Web Log Visualizationand
Clustering (Cadez, Heckerman, Meek, Smyth,
White, KDD 2000)
52Web Log Visualization
- MSNBC Web logs
- 2 million individuals per day
- different session lengths per individual
- difficult visualization and clustering problem
- WebCanvas
- uses mixtures of finite state machines to cluster
individuals - software tool EM mixture modeling
visualization
53(No Transcript)
54Web Log Files
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5
55Model Mixtures of SFSMs
- SFSM stochastic finite state machine
- Simple model for traversal on a Web site
- (equivalent to first-order Markov with end-state)
- Generative model for large sets of Web users
- - different behaviors ltgt mixture of SFSMs
- EM algorithm is quite simple weighted counts
56(No Transcript)
57Timing Results
58WebCanvas Cadez et al, KDD 2000
59Application 2 Building Predictive Profiles from
Transaction Data (Cadez, Smyth, Mannila, KDD
2001)
60Example of Transaction Data
61 Profiling Approaches
- Predictive profile
- predictive model of an individuals behavior
- Histograms
- Simple but inefficient
- No generalization p(product you did not buy) 0
- Collaborative filtering
- Your profile function(k most similar other
individuals) - Ad hoc no statistical basis (e.g., cannot
incorporate covariates, seasonality, etc) - Proposed approach generative probabilistic
models - mixtures of baskets (captures heterogeneity)
- hierarchical Bayes (helps with sparseness)
62The Nature of Transaction Data
- Large and sparse
- N number of individuals, can be order of
millions - P number of items, can be in the 1000s
- Very sparse
- Each transaction may only have a few items
- Most individuals only have a few transactions
- Implications for modeling
- Effectively modeling the joint distribution of a
set of very high-dimensional binary/categorical
random variables - Relatively little information on any single
individual - Typically want to start inferring a profile even
after an individual purchases a single item - Volume of data, nature of applications (e.g.,
ecommerce) dictates that inference methods must
be computationally efficient
63Mixture-based Profiles
Predictive Profile for Individual i
Basket model for Component k (multinomial, empha
sizes certain products)
Probability that Individual i engages
in behavior k, given that they enter the store
64Hierarchical Model
Prior on Mixture Weights
Individual 1
Individual i
Individual N
B1
B1
B2
B3
B1
B2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
65Inference Algorithms
- MAP/Empirical v. Full Bayes
- Full Bayesian analysis is computationally
impractical - 59k transactions even for this small study data
set - We use a maximum a posterior (MAP) inference
approach - prior is matched to data, i.e., empirical Bayes
66Inference Algorithm
- 3-phase estimation algorithm
- 1. Use EM (MAP version) to learn a K-component
mixture model - Ignore individual grouping, just find K
components for all transactions - 2. Empirical Bayes prior
- Use the resulting global mixture weights to
determine the mean of the population prior
(Dirichlet) - 3. Fitting of individual weights (k for each
individual) - Use EM (MAP) again on each individual, with
population prior - Mixture components are fixed, just use EM to find
the weights (very fast)
67Experiments on Real Data
- Retail transaction data set
- 2 years worth of transactions from chain of 9
stores - 1 million transactions in total
- 200,000 individuals, product tree of 50k items
- Experiments described here
- Data used for model training (months 1 to 6)
- 4300 individuals with at least 10 transactions
(10 store visits) - 58,886 transactions, 164,000 items purchased
- Out-of-sample data used for model test (months 7
and 8) - 4040 individuals, 25,292 transactions, and 69,103
items - Predictive accuracy on out-of-sample data
- Logp log p score on new data higher is
better - -Logp/n predictive entropy, lower is better,
68Transaction Data
69Examples of Mixture Components
Components encode typical combinations of
clothes
70Data and Profile Example
71Data and Profile Example
72Data and Profile Example
73Data and Profile Example
74Data and Profile Example
No Training Data for 14
No Purchases above Dept 25
75(No Transcript)
76(No Transcript)
77(No Transcript)
78 Time taken to fit amodel with 4300 individuals
and 59,000 transactions, and 164,000 items
79Ongoing Work
- Applications
- interactive visualization and exploration tool
- early identification of high value customers
- segmentation
- Extensions
- factored mixtures multiple behaviors in one
transaction - time-rate of purchases (e.g., Poisson, seasonal)
- covariate information (e.g., demographics, etc)
- outlier detection, clustering, forecasting,
cross-selling - Other Applications
- sequential profiles for Web users
- component models integrate time and content
- hierarchical models
80Summary of Transaction Results
- Predictive performance out-of-sample
- hierarchical mixtures are better than both
- global mixture weights
- hierarchical multinomials
- predictive power continues to improve up to about
K50, 100 mixture components - Computational efficiency
- model fitting is relatively fast
- estimation time scales roughly as 10 to 100
transactions per second - Predictive profiles are interpretable, fast, and
accurate
81Application 3 Fast Approximate Querying
(Pavlov and Smyth, KDD 2001)
82Query Approximation
Large Database
Approximate Models
Query Generator
83Query Approximation
Large Database
Approximate Models
Query Generator
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
84Query Approximation
Large Database
Query Generator
Approximate Models
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
Provide Fast Query Answers Online
85Model Averaging
Bayesian model averaging for p(x) - since we
dont know which model (if any) is the true one,
average out this uncertainty
Prediction of Model k
Weight of Model k
Prediction of x given data D
86Stacked Mixtures(Smyth and Wolpert, Machine
Learning, 1999)
Simple idea use cross-validation to estimate
the weights, rather than using a Bayesian
scheme Two-phase learning 1. Learn each model
Mk on Dtrain 2. Learn mixture model weights on
Dvalidation - components are fixed - EM just
learns the weights Outperforms any single model
selection technique Even outperforms cheating
87Model Averaging for Queries
Best model is a function of (a) data (b)
query distribution l(Q) Minimize
Weight of Model k
Distribution over queries
Long run frequency with which Q occurs
88Stacking for Query Model Combining
Conjunctive Queries on Microsoft Web Data, 32k
records, 294 attributes Available online at UCI
KDD Archive
89Other Work in Our Group
- Pattern recognition in time series
- semi-Markov models for time-series pattern
matching - Ge and Smyth, KDD 2000
- applications
- semiconductor manufacturing
- NASA space station data
- Pattern discovery in categorical sequences
- unsupervised hidden Markov learning of patterns
embedded in background - preliminary results at KDD 2001 temporal DM
workshop
90Other Work in Our Group
- Trajectory modeling and mixtures
- general frameworks for modeling/clustering/predict
ion with sets of trajectories - application to cyclone-tracking and fluid flow
data - Gaffney and Smyth, KDD 1999
- Spatial data models for pattern classification
- learn priors from human-labeled images
applications - biological cell image segmentation
- detecting double-bent galaxies
- Medical diagnosis with mixtures
- see Cadez et al., Machine Learning, in press