Model Compression - PowerPoint PPT Presentation

1 / 108
About This Presentation
Title:

Model Compression

Description:

... many models outperforms boosting, bagging, random forests, SVMs, and neural ... 5 bagged trees (100 decision trees in each model) ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 109
Provided by: richc153
Category:

less

Transcript and Presenter's Notes

Title: Model Compression


1
Model Compression
  • Rich Caruana
  • Computer Science
  • Cornell University
  • joint work with Cristian Bucila Alex
    Niculescu-Mizil

2
Outline
  • Motivation
  • Ensemble learning usually most accurate
  • Ensemble models can be large and slow
  • Model compression
  • Where does data come from?
  • Experimental results
  • Related work
  • Future work
  • Summary

3
Supervised Learning
  • Major Goals
  • Accurate Models
  • Easy to train
  • Fast to train
  • Can deal with many data types
  • Can deal with many performance criteria
  • Does not require too much human expertise
  • Compact, easy to use models
  • Intelligible models
  • Fast predictions
  • Confidences for predictions
  • Explanations for predictions

4
Normalized Scores for ES
5
Ensemble Selection Works,But Is It Worth It?
  • Best of best of best yields 20 reduction in
    loss compared to boosted trees
  • Accuracy or AUC increase from 88 to 90
  • RMS decrease from 0.25 to 0.20
  • Typically 10 reduction in loss compared to best
    model above
  • Accuracy or AUC increase from 90 to 91
  • RMS decrease from 0.20 to 0.18
  • Overall reduction in loss can be 30, which is
    significant

6
Computational Cost
  • Have to train multiple models anyway
  • models can be trained in parallel
  • different packages, different machines, at
    different times, by different people
  • just generate and collect (no optimization
    necessary, no test sets)
  • saves human effort -- no need to examine/optimize
    models
  • 48 hours on 10 workstations to train 2000
    models with 5k train sets
  • model library can be built before optimization
    metric is known
  • anytime selection -- no need to wait for all
    models
  • Ensemble Selection is cheap
  • each iteration, consider adding 2000 models to
    ensemble
  • adding model is simple unweighted averaging of
    predictions
  • caching makes this very efficient
  • compute performance metric when each model is
    added
  • for 250 iterations, evaluate 2502000 500,000
    ensembles
  • 1 minute on workstation if metric is not
    expensive

7
Ensemble Selection
  • Good news
  • A carefully selected ensemble that combines many
    models outperforms boosting, bagging, random
    forests, SVMs, and neural nets, (because it
    builds on top of them)
  • Bad news
  • The ensembles are too big, too slow, too
    cumbersome to use for most applications

8
Best Ensembles are Big Ugly!
  • Best ensemble for one problem/metric has 422
    models
  • 72 boosted trees (28,642 individual decision
    trees!)
  • 1 random forest (1024 decision trees)
  • 5 bagged trees (100 decision trees in each model)
  • 44 neural nets (2,200 hidden units,total,
    gt100,000 weights)
  • 115 knn models (both large and expensive!)
  • 38 SVMs (100s of support vectors in each model)
  • 26 boosted stump models (36,184 stumps total --
    could compress)
  • 122 individual decision trees

9
Best Ensembles are Big Slow!
  • Size
  • Best single models 1.41 Mb
  • Ensemble selection 550.29 Mb
  • Speed (to classify 10,000 examples)
  • Best single model 93.37 secs / 10k
  • Ensemble selection 5396.27 secs / 10k

10
  • Cant we make the ensembles smaller, faster, and
    easier to use by eliminating some base-level
    models?

11
What Models are Used in Ensembles?
12
What Models are Used in Ensembles?
13
Summary of Models Used by ES
  • Most ensembles use 10-100 of the 2000 models
  • Different models are selected for different
    problems
  • Different models are selected for different
    metrics
  • Most ensembles use a diversity of model types
  • Most ensembles use different parameter settings
  • Selected Models often make sense
  • Neural nets for RMS, Cross-Entropy
  • Max-margin methods for Accuracy
  • Large k in knn for AUC

14
Motivation Model Compression
  • Unfortunately, not suitable for many
    applications
  • PDAs (storage space is important)
  • Cell phones (storage space)
  • Hearing aids (storage space speed is important
    because of power restrictions)
  • Search engines like Google (speed)
  • Image recognition applications (speed)
  • Our solution Model Compression
  • Models perform as well as the best ensembles, but
    small and fast enough to be used

15
Solution Model Compression
  • Train simple model to mimic the complex model
  • Pass large amounts of unlabeled data (synthetic
    data points or real unlabeled data) through
    ensemble and collect predictions
  • 100,000 to 10,000,000 synthetic training points
  • Extensional representation of the ensemble model
  • Train copycat model on this large synthetic train
    set to mimic the high-performance ensemble
  • Train neural net to mimic ensemble
  • Potential to not only perform as well as target
    ensemble, but possibly outperform it

16
Why Mimic with Neural Nets?
  • Decision trees do not work well
  • synthetic data must be very large because of
    recursive partitioning
  • mimic decision trees are enormous (depth gt 1000
    and gt 106 nodes) making them expensive to store
    and compute
  • single tree does not seem to model ensemble
    accurately enough
  • SVMs
  • number of support vectors increases quickly with
    complexity
  • Artificial Neural nets
  • can model complex functions with modest of
    hidden units
  • can compress millions of training cases into
    thousands of weights
  • expensive to train, but execution cost low (just
    matrix multiplies)
  • models with few thousand weights have small
    footprint

17
Unlabeled Data?
  • Assume original labeled training set is small
  • But we need a large train set to train the mimic
    ANN
  • Should come from same distribution as train data
  • Learned model must focus on most important
    regions in space
  • For some domains unlabeled data is available
  • Text, web, images,
  • If not available, we need to generate synthetic
    data
  • Random
  • Nbe
  • Munge

18
Synthetic Data True Distribution
19
Synthetic Data Small Sample
20
Synthetic Data Random
  • Values for attributes are generated randomly from
    their univariate distribution

21
Synthetic Data Random
  • Values for attributes are generated randomly from
    their univariate distribution

22
Synthetic Data Random
  • Values for attributes are generated randomly from
    their univariate distribution
  • The conditional structure of the data is lost
  • Many generated examples cover uninteresting
    regions of the space

23
Synthetic Data NBE
  • Estimate the joint distribution from the train set

24
Synthetic Data NBE
  • Estimate the joint distribution from the train
    set
  • NBE (Naïve Bayes Estimation) algorithm
  • Lowd and Domingos, 2005
  • Code for learning and sampling available

25
  • These dont work well enough.
  • Had to develop a new, better method.

26
  • These dont work well enough.
  • Had to develop a new, better method.
  • Munging
  • 1. To imperfectly transform information. 2. To
    modify data in a way that cannot be described
    succinctly.

27
Munging
28
Munging
29
Munging
30
Munging
31
Munging
32
Munging
33
Munging
34
Munging
35
Munging
36
Munging
37
Munging
38
Munging
39
Munging
40
Munging
41
Munging
42
Munging
43
Munging
44
Munging
45
Munging
46
Munging
47
Munging
48
Munging
49
Munging
50
Munging
51
Munging
52
Munging
53
Munging
54
Munging
55
Synthetic Data Munge
56
Synthetic Data Munge
57
Synthetic Data Munge
58
Synthetic Data Munge
59
Synthetic Data
60
Now That We Have a Method to Generate
Data,Lets Do Some Compression
61
Experimental Setup Datasets
62
Experimental Setup
  • Target model Ensemble Selection
  • Mimic model neural net
  • Up to 256 hidden units
  • Synthetic data
  • Up to 400,000 examples
  • Methods
  • Random
  • NBE
  • Munge
  • Unlabeled vs. Synthetic

63
Average Results by Size
64
Average Results by Size
65
Average Results by Size
66
Average Results by Size
67
Average Results by Size
68
Average Results by Size
69
Letter.P1 Results
70
Hs Results
71
Average Results by HU
72
Letter.P1 Results
73
Letter.P2 Results
74
Letter Results
  • Letter.p1 Distinguish letter O from the rest
  • Letter.p2 Distinguish letters A-M from N-Z

75
It Doesnt Always WorkAs Well As Wed Like,Yet!
76
Covtype Results
77
Covtype Results
78
Covtype Results
79
Covtype Results
  • More hidden units necessary to get a better
    mimic model
  • More Munge data also needed
  • Performance on TRUE DIST data is very good, so
    may get better performance if better
    synthetic data can be generated

80
Adult Results
81
Adult Results
82
Adult Results
  • More Munge data or more hidden units doesnt
    seem to help much
  • Adult has a few high arity nominal attributes
    that when binarized increase the number of
    attributes from 14 to 104 sparse binary
    attributes
  • Neural nets may not be well suited for this
    problem?
  • Munge may not be effective in generating good
    pseudo data for adult?

83
RMSE Results 400K, 256 HU
RATIO (MUNGE ANN) / (ENSEMBLE ANN)
84
Were Retaining 97 of Accuracy of Target
Model,but How Are We Doing on Compression?
85
Size of Models (MB)
RATIO ENSEMBLE / MUNGE
86
Execution Time of Models
Time in seconds to classify 10,000 examples
RATIO ENSEMBLE / MUNGE
87
Summary of Compression Results
  • Neural nets trained to mimic high performing
    ensemble selection models
  • on average, captures more than 97 performance of
    target model
  • perform much better than any ANN we could train
    on original data
  • More than 2000 times smaller than target ensemble
  • More than 1000 times faster than target ensemble

88
Related Work
  • Neural Nets Approximator Zeng and Martinez,
    2000
  • Used same general approach
  • Only pseudo data used to train the neural net
  • Trained a neural net to model ensemble of neural
    nets
  • Target model not nearly as complex as ES

89
Related Work
  • CMM (Combine Multiple Models) Domingos, 1997
  • Goal improve accuracy and stability of base
    classifier (C4.5 rules) without losing
    comprehensibility
  • Create ensemble of base classifiers
  • Train a base classifier on original data extra
    data
  • Generate extra data to be labeled by ensemble
  • Method for generated extra data specific for C4.5
    rules

90
Related Work
  • TREPAN Craven and Shavlik, 1996
  • Extract tree-structured representations of
    trained neural nets
  • Used the original train set the nets were trained
    on
  • Generated synthetic data at every node in the
    tree
  • Learning rules from neural nets
  • Towell and Shavlik 1992, Craven and Shavlik,
    1993,1994

91
Related Work
  • Pruning adaptive boosting Margineantu and
    Dietterich 2000
  • To compress the ensemble, retain only some of the
    models it contains
  • DECORATE Melville and Mooney, 2003
  • Use extra data to increase the diversity of base
    classifiers in order to build a better ensemble
  • Data generated randomly from each attributes
    marginal distribution (similarly to our Random
    algorithm)

92
What Still Needs to Be Done?
93
Future Work Other Mimic Models
  • Neural nets are not only possible mimic models
  • Other learning methods may provide insight into
    effectiveness of model compression
  • Things to do
  • Use Decision Trees, SVMs, k-nearest neighbor
    models to mimic Ensemble Selection
  • Expect to see
  • Decision trees grow too large, need too much data
  • Knn too slow
  • SVMs need too many support vectors

94
Future Work Other Target Metrics
  • Key feature of Ensemble Selection can be
    optimized for different metrics (RMSE, ROC, ACC,
    Precision, )
  • Important that compressed models good on target
    metric
  • If the squared error between target model and
    mimic neural net is small enough, performance on
    target metric should be similar
  • Things to do
  • Use neural nets to mimic ES optimized for
    accuracy, area under ROC curve
  • May need to adapt the model compression approach
    for metrics other than RMSE
  • Expect to see
  • good performance for other metrics as well

95
Future Work Model Complexity
  • Complexity of model varies from problem to
    problem
  • To accurately approximate a model, the mimic
    model needs to have similar complexity
  • For neural nets, number of hidden units is a
    measure of complexity
  • Things to do
  • For some problems, experiments with more hidden
    units
  • Experiments with more than one hidden layer
    (ADULT)
  • Expect to see
  • For some problems, more hidden units will help
  • For ADULT ???

96
Future Work Munge
  • Two free parameters that must be set
  • We might not have picked optimal values
  • Different problems may have different optimal
    values
  • Compression experiments are very expensive
  • Things to do
  • Experiment with different parameter values
  • Try to find distance metric between datasets that
    expresses quality of data generated
  • Expect to see
  • Better synthetic data yields better compression
    with less data

97
Future Work Active Learning
  • Too many examples ? labeling is expensive
  • Too many examples ? training is expensive
  • Things to do
  • Choosing the most important synthetic examples
  • Retain only non redundant examples generated by
    Munge
  • Modify Munge so that it generates less redundant
    examples
  • Expect to see
  • Active learning reduces amount of train data
    needed

98
Summary
  • Ensemble learning yields most accurate models
  • Ensemble selection is best ensemble method
  • Ensembles sometimes are too big and too slow
  • Compress complex ensemble into simpler ANN
  • 97 of accuracy retained
  • 2000 times smaller
  • 1000 times faster
  • Potentially useful measure of model complexity?
  • Compression separates how function is learned
    from data and the model used at runtime to make
    predictions

99
Thank You.Questions?
100
(No Transcript)
101
Hs Results
102
Letter.P2 Results
103
Medis Results
104
Medis Results
105
Mg Results
106
Mg Results
107
Slac Results
108
Slac Results
Write a Comment
User Comments (0)
About PowerShow.com