Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

About This Presentation

Title:

Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

Description:

Large Data Set Analysis using Mixture Models. Seminar at IBM Watson Research Center ... Each mixture component is a multidimensional Gaussian with its own mean mk and ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 91

Provided by: Informatio367

Learn more at: http://www.datalab.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

1
Large Data Set Analysis using Mixture
ModelsSeminar at IBM Watson Research
CenterJune 27th 2001 Padhraic
SmythInformation and Computer ScienceUniversity
of California, Irvinewww.datalab.uci.edu
2
Outline

Part 1 Basic concepts in mixture modeling
representational capabilities of mixtures
learning mixtures from data
extensions of mixtures to non-vector data
Part 2 New applications of mixtures
1. Visualization and clustering of Web log data
2. predictive profiles from transaction data
3. query approximation problems

3
Acknowledgements

Students
Igor Cadez, Scott Gaffney, Xianping Ge, Dima
Pavlov
Collaborators
David Heckerman, Chris Meek, Heikki Mannila,
Christine McLaren, Geoff McLachlan, David Wolpert
Funding
NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center,
Microsoft Research, IBM Research, HNC Software.

4
Finite Mixture Models
5
Finite Mixture Models
6
Finite Mixture Models
7
Finite Mixture Models
Weightk
ComponentModelk
Parametersk
8
Example Mixture of Gaussians

Gaussian mixtures

9
Example Mixture of Gaussians

Gaussian mixtures

Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk
10
Example Mixture of Gaussians

Gaussian mixtures

Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk e.g., K2, 1-dim q, a m1 ,
s1 , m2 , s2 , a1
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
Example Mixture of Naïve Bayes

15
Example Mixture of Naïve Bayes

16
Example Mixture of Naïve Bayes

Conditional Independence model for each
component (often quite useful to first-order)
17
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
18
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
19
Interpretation of Mixtures

1. C has a direct (physical) interpretation
e.g., C age of fish, C male, female

20
Interpretation of Mixtures

1. C has a direct (physical) interpretation
e.g., C age of fish, C male, female
2. C might have an interpretation
e.g., clusters of Web surfers

21
Interpretation of Mixtures

1. C has a direct (physical) interpretation
e.g., C age of fish, C male, female
2. C might have an interpretation
e.g., clusters of Web surfers
3. C is just a convenient latent variable
e.g., flexible density estimation

22
Learning Mixtures from Data

Consider fixed K
e.g., Unknown parameters Q m1 , s1 , m2 , s2 ,
a1
Given data D x1,.xN, we want to find the
parameters Q that best fit the data

23
Maximum Likelihood Principle

Fisher, 1922
assume a probabilistic model
likelihood p(data parameters, model)
find the parameters that make the data most likely

24
Maximum Likelihood Principle

Fisher, 1922
assume a probabilistic model
likelihood p(data parameters, model)
find the parameters that make the data most likely

25
(No Transcript)
26
Example of a Log-Likelihood Surface

Mean 2
Log Scale for Sigma 2
27
1977 The EM Algorithm

Dempster, Laird, and Rubin
general framework for likelihood-based parameter
estimation with missing data
start with initial guesses of parameters
Estep estimate memberships given params
Mstep estimate params given memberships
Repeat until convergence
converges to a (local) maximum of likelihood
Estep and Mstep are often computationally simple
generalizes to maximum a posteriori (with priors)

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Control Group
Anemia Group
37
Alternatives to EM

Direct optimization
e.g., gradient descent, Newton methods
EM is simpler to implement
Sampling (e.g., MCMC)
computationally intensive
Minimum distance, e.g.,

38
How many components?

2 general approaches
1. Best density estimator
e.g., what predicts best on new data
2. True number of components
typically cannot be done with data alone

39
K1 Model Class
40
Data-generating process (truth)
K1 Model Class
41
Data-generating process (truth)
Closest model in terms of logp scores on new
data
K1 Model Class
42
Data-generating process (truth)
Closest model in terms of logp scores on new
data
K1 Model Class
Best model is relatively far from truth gt High
Bias
43
Data-generating process (truth)
K1 Model Class
K10 Model Class
44
Data-generating process (truth)
K1 Model Class
Best model is closer to Truth gt Low Bias
K10 Model Class
45
However,. This could be the model that best fits
the observed data gt High Variance
Data-generating process (truth)
K1 Model Class
K10 Model Class
46
Prescriptions for Model Selection

Minimize distance to truth
Method 1 Predictive logp scores
calculate log p(test data model k)
select model that predicts best
Method 2 Bayesian techniques
p(kdata) impossible to compute exactly
closed-form approximations
BIC, Autoclass, MDL, etc
sampling
Monte Carlo techniques quite tricky for mixtures

47
Mixtures of non-vector data

Example
N individuals, and sets of sequences for each
e.g., Web session data
Clustering of the N individuals?
Vectorize data and apply vector methods?
Estimate parameters for each sequence and cluster
in parameter space?
Pairwise distances of sequences?

48
Mixtures of Sequences, Curves,
49
Mixtures of Sequences, Curves,
Generative Model - pick individual i - select a
component ck for individual i - generate data
according to p(Di ck) - p(Di ck) can be
very general - e.g., sets of sequences, spatial
patterns, etc Note given p(Di ck), we can
define an EM algorithm
50
Outline

Part 1 Basic concepts in mixture modeling
representational capabilities of mixtures
learning mixtures from data
extensions of mixtures to non-vector data
Part 2 New applications of mixtures
1. predictive profiles from transaction data
2. sequence clustering with mixtures of Markov
models
3. query approximation problems

51
Application 1 Web Log Visualizationand
Clustering (Cadez, Heckerman, Meek, Smyth,
White, KDD 2000)
52
Web Log Visualization

MSNBC Web logs
2 million individuals per day
different session lengths per individual
difficult visualization and clustering problem
WebCanvas
uses mixtures of finite state machines to cluster
individuals
software tool EM mixture modeling
visualization

53
(No Transcript)
54
Web Log Files
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5

55
Model Mixtures of SFSMs

SFSM stochastic finite state machine
Simple model for traversal on a Web site
(equivalent to first-order Markov with end-state)
Generative model for large sets of Web users
- different behaviors ltgt mixture of SFSMs
EM algorithm is quite simple weighted counts

56
(No Transcript)
57
Timing Results
58
WebCanvas Cadez et al, KDD 2000
59
Application 2 Building Predictive Profiles from
Transaction Data (Cadez, Smyth, Mannila, KDD
2001)
60
Example of Transaction Data
61
Profiling Approaches

Predictive profile
predictive model of an individuals behavior
Histograms
Simple but inefficient
No generalization p(product you did not buy) 0
Collaborative filtering
Your profile function(k most similar other
individuals)
Ad hoc no statistical basis (e.g., cannot
incorporate covariates, seasonality, etc)
Proposed approach generative probabilistic
models
mixtures of baskets (captures heterogeneity)
hierarchical Bayes (helps with sparseness)

62
The Nature of Transaction Data

Large and sparse
N number of individuals, can be order of
millions
P number of items, can be in the 1000s
Very sparse
Each transaction may only have a few items
Most individuals only have a few transactions
Implications for modeling
Effectively modeling the joint distribution of a
set of very high-dimensional binary/categorical
random variables
Relatively little information on any single
individual
Typically want to start inferring a profile even
after an individual purchases a single item
Volume of data, nature of applications (e.g.,
ecommerce) dictates that inference methods must
be computationally efficient

63
Mixture-based Profiles
Predictive Profile for Individual i
Basket model for Component k (multinomial, empha
sizes certain products)
Probability that Individual i engages
in behavior k, given that they enter the store
64
Hierarchical Model
Prior on Mixture Weights
Individual 1
Individual i
Individual N
B1
B1
B2
B3
B1
B2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
65
Inference Algorithms

MAP/Empirical v. Full Bayes
Full Bayesian analysis is computationally
impractical
59k transactions even for this small study data
set
We use a maximum a posterior (MAP) inference
approach
prior is matched to data, i.e., empirical Bayes

66
Inference Algorithm

3-phase estimation algorithm
1. Use EM (MAP version) to learn a K-component
mixture model
Ignore individual grouping, just find K
components for all transactions
2. Empirical Bayes prior
Use the resulting global mixture weights to
determine the mean of the population prior
(Dirichlet)
3. Fitting of individual weights (k for each
individual)
Use EM (MAP) again on each individual, with
population prior
Mixture components are fixed, just use EM to find
the weights (very fast)

67
Experiments on Real Data

Retail transaction data set
2 years worth of transactions from chain of 9
stores
1 million transactions in total
200,000 individuals, product tree of 50k items
Experiments described here
Data used for model training (months 1 to 6)
4300 individuals with at least 10 transactions
(10 store visits)
58,886 transactions, 164,000 items purchased
Out-of-sample data used for model test (months 7
and 8)
4040 individuals, 25,292 transactions, and 69,103
items
Predictive accuracy on out-of-sample data
Logp log p score on new data higher is
better
-Logp/n predictive entropy, lower is better,

68
Transaction Data
69
Examples of Mixture Components
Components encode typical combinations of
clothes
70
Data and Profile Example
71
Data and Profile Example
72
Data and Profile Example
73
Data and Profile Example
74
Data and Profile Example
No Training Data for 14
No Purchases above Dept 25
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
Time taken to fit amodel with 4300 individuals
and 59,000 transactions, and 164,000 items
79
Ongoing Work

Applications
interactive visualization and exploration tool
early identification of high value customers
segmentation
Extensions
factored mixtures multiple behaviors in one
transaction
time-rate of purchases (e.g., Poisson, seasonal)
covariate information (e.g., demographics, etc)
outlier detection, clustering, forecasting,
cross-selling
Other Applications
sequential profiles for Web users
component models integrate time and content
hierarchical models

80
Summary of Transaction Results

Predictive performance out-of-sample
hierarchical mixtures are better than both
global mixture weights
hierarchical multinomials
predictive power continues to improve up to about
K50, 100 mixture components
Computational efficiency
model fitting is relatively fast
estimation time scales roughly as 10 to 100
transactions per second
Predictive profiles are interpretable, fast, and
accurate

81
Application 3 Fast Approximate Querying
(Pavlov and Smyth, KDD 2001)
82
Query Approximation
Large Database
Approximate Models
Query Generator
83
Query Approximation
Large Database
Approximate Models
Query Generator
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
84
Query Approximation
Large Database
Query Generator
Approximate Models
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
Provide Fast Query Answers Online
85
Model Averaging
Bayesian model averaging for p(x) - since we
dont know which model (if any) is the true one,
average out this uncertainty
Prediction of Model k
Weight of Model k
Prediction of x given data D
86
Stacked Mixtures(Smyth and Wolpert, Machine
Learning, 1999)
Simple idea use cross-validation to estimate
the weights, rather than using a Bayesian
scheme Two-phase learning 1. Learn each model
Mk on Dtrain 2. Learn mixture model weights on
Dvalidation - components are fixed - EM just
learns the weights Outperforms any single model
selection technique Even outperforms cheating

87
Model Averaging for Queries
Best model is a function of (a) data (b)
query distribution l(Q) Minimize
Weight of Model k
Distribution over queries
Long run frequency with which Q occurs
88
Stacking for Query Model Combining
Conjunctive Queries on Microsoft Web Data, 32k
records, 294 attributes Available online at UCI
KDD Archive
89
Other Work in Our Group

Pattern recognition in time series
semi-Markov models for time-series pattern
matching
Ge and Smyth, KDD 2000
applications
semiconductor manufacturing
NASA space station data
Pattern discovery in categorical sequences
unsupervised hidden Markov learning of patterns
embedded in background
preliminary results at KDD 2001 temporal DM
workshop

90
Other Work in Our Group

Trajectory modeling and mixtures
general frameworks for modeling/clustering/predict
ion with sets of trajectories
application to cyclone-tracking and fluid flow
data
Gaffney and Smyth, KDD 1999
Spatial data models for pattern classification
learn priors from human-labeled images
applications
biological cell image segmentation
detecting double-bent galaxies
Medical diagnosis with mixtures
see Cadez et al., Machine Learning, in press

Write a Comment

User Comments (0)

About PowerShow.com

Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu - PowerPoint PPT Presentation

Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

Large Data Set Analysis using Mixture Models. Seminar at IBM Watson Research Center ... Each mixture component is a multidimensional Gaussian with its own mean mk and ... – PowerPoint PPT presentation