Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu - PowerPoint PPT Presentation

About This Presentation
Title:

Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

Description:

Large Data Set Analysis using Mixture Models. Seminar at IBM Watson Research Center ... Each mixture component is a multidimensional Gaussian with its own mean mk and ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 91
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: Large Data Set Analysis using Mixture Models Seminar at IBM Watson Research Center June 27th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu


1
Large Data Set Analysis using Mixture
ModelsSeminar at IBM Watson Research
CenterJune 27th 2001 Padhraic
SmythInformation and Computer ScienceUniversity
of California, Irvinewww.datalab.uci.edu
2
Outline
  • Part 1 Basic concepts in mixture modeling
  • representational capabilities of mixtures
  • learning mixtures from data
  • extensions of mixtures to non-vector data
  • Part 2 New applications of mixtures
  • 1. Visualization and clustering of Web log data
  • 2. predictive profiles from transaction data
  • 3. query approximation problems

3
Acknowledgements
  • Students
  • Igor Cadez, Scott Gaffney, Xianping Ge, Dima
    Pavlov
  • Collaborators
  • David Heckerman, Chris Meek, Heikki Mannila,
    Christine McLaren, Geoff McLachlan, David Wolpert
  • Funding
  • NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center,
    Microsoft Research, IBM Research, HNC Software.

4
Finite Mixture Models
5
Finite Mixture Models
6
Finite Mixture Models
7
Finite Mixture Models
Weightk
ComponentModelk
Parametersk
8
Example Mixture of Gaussians
  • Gaussian mixtures


9
Example Mixture of Gaussians
  • Gaussian mixtures

Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk
10
Example Mixture of Gaussians
  • Gaussian mixtures

Each mixture component is a multidimensional
Gaussian with its own mean mk and covariance
shape Sk e.g., K2, 1-dim q, a m1 ,
s1 , m2 , s2 , a1
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
Example Mixture of Naïve Bayes

15
Example Mixture of Naïve Bayes

16
Example Mixture of Naïve Bayes

Conditional Independence model for each
component (often quite useful to first-order)
17
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
18
Mixtures of Naïve Bayes
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
19
Interpretation of Mixtures
  • 1. C has a direct (physical) interpretation
  • e.g., C age of fish, C male, female

20
Interpretation of Mixtures
  • 1. C has a direct (physical) interpretation
  • e.g., C age of fish, C male, female
  • 2. C might have an interpretation
  • e.g., clusters of Web surfers

21
Interpretation of Mixtures
  • 1. C has a direct (physical) interpretation
  • e.g., C age of fish, C male, female
  • 2. C might have an interpretation
  • e.g., clusters of Web surfers
  • 3. C is just a convenient latent variable
  • e.g., flexible density estimation

22
Learning Mixtures from Data
  • Consider fixed K
  • e.g., Unknown parameters Q m1 , s1 , m2 , s2 ,
    a1
  • Given data D x1,.xN, we want to find the
    parameters Q that best fit the data

23
Maximum Likelihood Principle
  • Fisher, 1922
  • assume a probabilistic model
  • likelihood p(data parameters, model)
  • find the parameters that make the data most likely

24
Maximum Likelihood Principle
  • Fisher, 1922
  • assume a probabilistic model
  • likelihood p(data parameters, model)
  • find the parameters that make the data most likely

25
(No Transcript)
26
Example of a Log-Likelihood Surface

Mean 2
Log Scale for Sigma 2
27
1977 The EM Algorithm
  • Dempster, Laird, and Rubin
  • general framework for likelihood-based parameter
    estimation with missing data
  • start with initial guesses of parameters
  • Estep estimate memberships given params
  • Mstep estimate params given memberships
  • Repeat until convergence
  • converges to a (local) maximum of likelihood
  • Estep and Mstep are often computationally simple
  • generalizes to maximum a posteriori (with priors)

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Control Group
Anemia Group
37
Alternatives to EM
  • Direct optimization
  • e.g., gradient descent, Newton methods
  • EM is simpler to implement
  • Sampling (e.g., MCMC)
  • computationally intensive
  • Minimum distance, e.g.,

38
How many components?
  • 2 general approaches
  • 1. Best density estimator
  • e.g., what predicts best on new data
  • 2. True number of components
  • typically cannot be done with data alone

39
K1 Model Class
40
Data-generating process (truth)
K1 Model Class
41
Data-generating process (truth)
Closest model in terms of logp scores on new
data
K1 Model Class
42
Data-generating process (truth)
Closest model in terms of logp scores on new
data
K1 Model Class
Best model is relatively far from truth gt High
Bias
43
Data-generating process (truth)
K1 Model Class
K10 Model Class
44
Data-generating process (truth)
K1 Model Class
Best model is closer to Truth gt Low Bias
K10 Model Class
45
However,. This could be the model that best fits
the observed data gt High Variance
Data-generating process (truth)
K1 Model Class
K10 Model Class
46
Prescriptions for Model Selection
  • Minimize distance to truth
  • Method 1 Predictive logp scores
  • calculate log p(test data model k)
  • select model that predicts best
  • Method 2 Bayesian techniques
  • p(kdata) impossible to compute exactly
  • closed-form approximations
  • BIC, Autoclass, MDL, etc
  • sampling
  • Monte Carlo techniques quite tricky for mixtures

47
Mixtures of non-vector data
  • Example
  • N individuals, and sets of sequences for each
  • e.g., Web session data
  • Clustering of the N individuals?
  • Vectorize data and apply vector methods?
  • Estimate parameters for each sequence and cluster
    in parameter space?
  • Pairwise distances of sequences?

48
Mixtures of Sequences, Curves,
49
Mixtures of Sequences, Curves,
Generative Model - pick individual i - select a
component ck for individual i - generate data
according to p(Di ck) - p(Di ck) can be
very general - e.g., sets of sequences, spatial
patterns, etc Note given p(Di ck), we can
define an EM algorithm
50
Outline
  • Part 1 Basic concepts in mixture modeling
  • representational capabilities of mixtures
  • learning mixtures from data
  • extensions of mixtures to non-vector data
  • Part 2 New applications of mixtures
  • 1. predictive profiles from transaction data
  • 2. sequence clustering with mixtures of Markov
    models
  • 3. query approximation problems

51
Application 1 Web Log Visualizationand
Clustering (Cadez, Heckerman, Meek, Smyth,
White, KDD 2000)
52
Web Log Visualization
  • MSNBC Web logs
  • 2 million individuals per day
  • different session lengths per individual
  • difficult visualization and clustering problem
  • WebCanvas
  • uses mixtures of finite state machines to cluster
    individuals
  • software tool EM mixture modeling
    visualization

53
(No Transcript)
54
Web Log Files
128.195.36.195, -, 3/22/00, 103511, W3SVC,
SRVR1, 128.200.39.181, 781, 363, 875, 200, 0,
GET, /top.html, -, 128.195.36.195, -, 3/22/00,
103516, W3SVC, SRVR1, 128.200.39.181, 5288,
524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 103517, W3SVC,
SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.195.36.101, -,
3/22/00, 161850, W3SVC, SRVR1, 128.200.39.181,
60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 161858, W3SVC,
SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0,
POST, /spt/main.html, -, 128.195.36.101, -,
3/22/00, 161859, W3SVC, SRVR1, 128.200.39.181,
0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205437, W3SVC,
SRVR1, 128.200.39.181, 140, 199, 875, 200, 0,
GET, /top.html, -, 128.200.39.17, -, 3/22/00,
205455, W3SVC, SRVR1, 128.200.39.181, 17766,
365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205455, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205507, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205536, W3SVC,
SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0,
POST, /spt/main.html, -, 128.200.39.17, -,
3/22/00, 205536, W3SVC, SRVR1, 128.200.39.181,
0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 205539, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205603, W3SVC, SRVR1, 128.200.39.181,
1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 205604, W3SVC,
SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET,
/spt/images/bk1.jpg, -, 128.200.39.17, -,
3/22/00, 205633, W3SVC, SRVR1, 128.200.39.181,
0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 205652, W3SVC,
SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0,
POST, /spt/main.html, -,
3
3
3
3
1
3
1
1
1
3
3
3
2
2
3
2
User 1
1
1
1
3
3
3
User 2
User 3
7
7
7
7
7
7
7
7
1
1
1
1
1
1
5
1
5
1
1
1
5
1
User 4
5
1
1
5
User 5


55
Model Mixtures of SFSMs
  • SFSM stochastic finite state machine
  • Simple model for traversal on a Web site
  • (equivalent to first-order Markov with end-state)
  • Generative model for large sets of Web users
  • - different behaviors ltgt mixture of SFSMs
  • EM algorithm is quite simple weighted counts

56
(No Transcript)
57
Timing Results
58
WebCanvas Cadez et al, KDD 2000
59
Application 2 Building Predictive Profiles from
Transaction Data (Cadez, Smyth, Mannila, KDD
2001)
60
Example of Transaction Data
61
Profiling Approaches
  • Predictive profile
  • predictive model of an individuals behavior
  • Histograms
  • Simple but inefficient
  • No generalization p(product you did not buy) 0
  • Collaborative filtering
  • Your profile function(k most similar other
    individuals)
  • Ad hoc no statistical basis (e.g., cannot
    incorporate covariates, seasonality, etc)
  • Proposed approach generative probabilistic
    models
  • mixtures of baskets (captures heterogeneity)
  • hierarchical Bayes (helps with sparseness)

62
The Nature of Transaction Data
  • Large and sparse
  • N number of individuals, can be order of
    millions
  • P number of items, can be in the 1000s
  • Very sparse
  • Each transaction may only have a few items
  • Most individuals only have a few transactions
  • Implications for modeling
  • Effectively modeling the joint distribution of a
    set of very high-dimensional binary/categorical
    random variables
  • Relatively little information on any single
    individual
  • Typically want to start inferring a profile even
    after an individual purchases a single item
  • Volume of data, nature of applications (e.g.,
    ecommerce) dictates that inference methods must
    be computationally efficient

63
Mixture-based Profiles
Predictive Profile for Individual i
Basket model for Component k (multinomial, empha
sizes certain products)
Probability that Individual i engages
in behavior k, given that they enter the store
64
Hierarchical Model
Prior on Mixture Weights
Individual 1
Individual i
Individual N
B1
B1
B2
B3
B1
B2
Individuals with little data get shrunk to the
prior Individuals with a lot of data are more
data-driven
65
Inference Algorithms
  • MAP/Empirical v. Full Bayes
  • Full Bayesian analysis is computationally
    impractical
  • 59k transactions even for this small study data
    set
  • We use a maximum a posterior (MAP) inference
    approach
  • prior is matched to data, i.e., empirical Bayes

66
Inference Algorithm
  • 3-phase estimation algorithm
  • 1. Use EM (MAP version) to learn a K-component
    mixture model
  • Ignore individual grouping, just find K
    components for all transactions
  • 2. Empirical Bayes prior
  • Use the resulting global mixture weights to
    determine the mean of the population prior
    (Dirichlet)
  • 3. Fitting of individual weights (k for each
    individual)
  • Use EM (MAP) again on each individual, with
    population prior
  • Mixture components are fixed, just use EM to find
    the weights (very fast)

67
Experiments on Real Data
  • Retail transaction data set
  • 2 years worth of transactions from chain of 9
    stores
  • 1 million transactions in total
  • 200,000 individuals, product tree of 50k items
  • Experiments described here
  • Data used for model training (months 1 to 6)
  • 4300 individuals with at least 10 transactions
    (10 store visits)
  • 58,886 transactions, 164,000 items purchased
  • Out-of-sample data used for model test (months 7
    and 8)
  • 4040 individuals, 25,292 transactions, and 69,103
    items
  • Predictive accuracy on out-of-sample data
  • Logp log p score on new data higher is
    better
  • -Logp/n predictive entropy, lower is better,

68
Transaction Data
69
Examples of Mixture Components
Components encode typical combinations of
clothes
70
Data and Profile Example
71
Data and Profile Example
72
Data and Profile Example
73
Data and Profile Example
74
Data and Profile Example
No Training Data for 14
No Purchases above Dept 25
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
Time taken to fit amodel with 4300 individuals
and 59,000 transactions, and 164,000 items
79
Ongoing Work
  • Applications
  • interactive visualization and exploration tool
  • early identification of high value customers
  • segmentation
  • Extensions
  • factored mixtures multiple behaviors in one
    transaction
  • time-rate of purchases (e.g., Poisson, seasonal)
  • covariate information (e.g., demographics, etc)
  • outlier detection, clustering, forecasting,
    cross-selling
  • Other Applications
  • sequential profiles for Web users
  • component models integrate time and content
  • hierarchical models

80
Summary of Transaction Results
  • Predictive performance out-of-sample
  • hierarchical mixtures are better than both
  • global mixture weights
  • hierarchical multinomials
  • predictive power continues to improve up to about
    K50, 100 mixture components
  • Computational efficiency
  • model fitting is relatively fast
  • estimation time scales roughly as 10 to 100
    transactions per second
  • Predictive profiles are interpretable, fast, and
    accurate

81
Application 3 Fast Approximate Querying
(Pavlov and Smyth, KDD 2001)
82
Query Approximation
Large Database
Approximate Models
Query Generator
83
Query Approximation
Large Database
Approximate Models
Query Generator
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
84
Query Approximation
Large Database
Query Generator
Approximate Models
Construct Probability Models Offline e.g.,
mixtures, belief networks, etc
Provide Fast Query Answers Online
85
Model Averaging
Bayesian model averaging for p(x) - since we
dont know which model (if any) is the true one,
average out this uncertainty
Prediction of Model k
Weight of Model k
Prediction of x given data D
86
Stacked Mixtures(Smyth and Wolpert, Machine
Learning, 1999)
Simple idea use cross-validation to estimate
the weights, rather than using a Bayesian
scheme Two-phase learning 1. Learn each model
Mk on Dtrain 2. Learn mixture model weights on
Dvalidation - components are fixed - EM just
learns the weights Outperforms any single model
selection technique Even outperforms cheating

87
Model Averaging for Queries
Best model is a function of (a) data (b)
query distribution l(Q) Minimize
Weight of Model k
Distribution over queries
Long run frequency with which Q occurs
88
Stacking for Query Model Combining
Conjunctive Queries on Microsoft Web Data, 32k
records, 294 attributes Available online at UCI
KDD Archive
89
Other Work in Our Group
  • Pattern recognition in time series
  • semi-Markov models for time-series pattern
    matching
  • Ge and Smyth, KDD 2000
  • applications
  • semiconductor manufacturing
  • NASA space station data
  • Pattern discovery in categorical sequences
  • unsupervised hidden Markov learning of patterns
    embedded in background
  • preliminary results at KDD 2001 temporal DM
    workshop

90
Other Work in Our Group
  • Trajectory modeling and mixtures
  • general frameworks for modeling/clustering/predict
    ion with sets of trajectories
  • application to cyclone-tracking and fluid flow
    data
  • Gaffney and Smyth, KDD 1999
  • Spatial data models for pattern classification
  • learn priors from human-labeled images
    applications
  • biological cell image segmentation
  • detecting double-bent galaxies
  • Medical diagnosis with mixtures
  • see Cadez et al., Machine Learning, in press
Write a Comment
User Comments (0)
About PowerShow.com