Multitask Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: Multitask Learning

1
Spooky Stuff in Metric Space
2
Spooky StuffData Mining in Metric Space

Rich Caruana
Alex Niculescu
Cornell University

3
Motivation 1
4
Motivation 1 Pneumonia Risk Prediction
5
Motivation 1 Many Learning Algorithms

Neural nets
Logistic regression
Linear perceptron
K-nearest neighbor
Decision trees
ILP (Inductive Logic Programming)
SVMs (Support Vector Machines)
Bagging X
Boosting X
Rule learners (C2, )
Ripper
Random Forests (forests of decision trees)
Gaussian Processes
Bayes Nets
No one/few learning methods dominates the others

6
Motivation 2
7
Motivation 2 SLAC B/Bbar

Particle accelerator generates B/Bbar particles
Use machine learning to classify tracks as B or
Bbar
Domain specific performance measure SLQ-Score
5 increase in SLQ can save 1M in accelerator
time
SLAC researchers tried various DM/ML methods
Good, but not great, SLQ performance
We tried standard methods, got similar results
We studied SLQ metric
similar to probability calibration
tried bagged probabilistic decision trees (good
on C-Section)

8
Motivation 2 Bagged Probabilistic Trees

Draw N bootstrap samples of data
Train tree on each sample gt N trees
Final prediction average prediction of N trees

Average prediction (0.23 0.19 0.34 0.22
0.26 0.31) / Trees 0.24
9
Motivation 2 Improves Calibration Order of
Magnitude
single tree
Poor Calibration
100 bagged trees
Excellent Calibration
10
Motivation 2 Significantly Improves SLQ
100 bagged trees
single tree
11
Motivation 2

Can we automate this analysis of performance
metrics so that its easier to recognize which
metrics are similar to each other?

12
Motivation 3
13
Motivation 3
14
Scary Stuff

In ideal world
Learn model that predicts correct conditional
probabilities (Bayes optimal)
Yield optimal performance on any reasonable
metric
In real world
Finite data
0/1 targets instead of conditional probabilities
Hard to learn this ideal model
Dont have good metrics for recognizing ideal
model
Ideal model isnt always needed
In practice
Do learning using many different metrics ACC,
AUC, CXE, RMS,
Each metric represents different tradeoffs
Because of this, usually important to optimize to
appropriate metric

15
Scary Stuff
16
Scary Stuff
17
In this work we compare nine commonly used
performance metrics by applying data mining to
the results of a massive empirical study

Goals
Discover relationships between performance
metrics
Are the metrics really that different?
If you optimize to metric X, also get good perf
on metric Y?
Need to optimize to metric Y, which metric X
should you optimize to?
Which metrics are more/less robust?
Design new, better metrics?

18
10 Binary Classification Performance Metrics

Threshold Metrics
Accuracy
F-Score
Lift
Ordering/Ranking Metrics
ROC Area
Average Precision
Precision/Recall Break-Even Point
Probability Metrics
Root-Mean-Squared-Error
Cross-Entropy
Probability Calibration
SAR ((1 - Squared Error) Accuracy ROC Area)
/ 3

19
Accuracy
Predicted 1 Predicted 0
correct
a
b
True 0 True 1
c
d
incorrect
threshold
accuracy (ad) / (abcd)
20
Lift

not interested in accuracy on entire dataset
want accurate predictions for 5, 10, or 20 of
dataset
dont care about remaining 95, 90, 80, resp.
typical application marketing
how much better than random prediction on the
fraction of the dataset predicted true (f(x) gt
threshold)

21
Lift
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold
22
lift 3.5 if mailings sent to 20 of the
customers
23
Precision/Recall, F, Break-Even Pt
harmonic average of precision and recall
24
better performance
worse performance
25
Predicted 1 Predicted 0
Predicted 1 Predicted 0
true positive
false negative
FN
TP
True 0 True 1
True 0 True 1
false positive
true negative
TN
FP
Predicted 1 Predicted 0
Predicted 1 Predicted 0
misses
P(pr0tr1)
hits
P(pr1tr1)
True 0 True 1
True 0 True 1
false alarms
correct rejections
P(pr0tr0)
P(pr1tr0)
26
ROC Plot and ROC Area

Receiver Operator Characteristic
Developed in WWII to statistically model false
positive and false negative detections of radar
operators
Better statistical foundations than most other
measures
Standard measure in medicine and biology
Becoming more popular in ML
Sweep threshold and plot
TPR vs. FPR
Sensitivity vs. 1-Specificity
P(truetrue) vs. P(truefalse)
Sensitivity a/(ab) Recall LIFT numerator
1 - Specificity 1 - d/(cd)

27
diagonal line is random prediction
28
Calibration

Good calibration
If 1000 xs have pred(x) 0.2, 200 should be
positive

29
Calibration

Model can be accurate but poorly calibrated
good threshold with uncalibrated probabilities
Model can have good ROC but be poorly calibrated
ROC insensitive to scaling/stretching
only ordering has to be correct, not
probabilities themselves
Model can have very high variance, but be well
calibrated
Model can be stupid, but be well calibrated
Calibration is a real oddball

30
Measuring Calibration

Bucket method
In each bucket
measure observed c-sec rate
predicted c-sec rate (average of probabilities)
if observed csec rate similar to predicted csec
rate gt good calibration in that bucket

0.05 0.15 0.25 0.35
0.45 0.55 0.65 0.75
0.85 0.95

0.0 0.1 0.2 0.3
0.4 0.5 0.6 0.7
0.8 0.9 1.0
31
Calibration Plot
32
Experiments
33
Base-Level Learning Methods

Decision trees
K-nearest neighbor
Neural nets
SVMs
Bagged Decision Trees
Boosted Decision Trees
Boosted Stumps
Each optimizes different things
Each best in different regimes
Each algorithm has many variations and free
parameters
Generate about 2000 models on each test problem

34
Data Sets

7 binary classification data sets
Adult
Cover Type
Letter.p1 (balanced)
Letter.p2 (unbalanced)
Pneumonia (University of Pittsburgh)
Hyper Spectral (NASA Goddard Space Center)
Particle Physics (Stanford Linear Accelerator)
4 k train sets
Large final test sets (usually 20k)

35
Massive Empirical Comparison

7 base-level learning methods
X
100s of parameter settings per method
2000 models per problem
X
7 test problems
14,000 models
X
10 performance metrics
140,000 model performance evaluations

36
COVTYPE Calibration vs. Accuracy
37
Multi Dimensional Scaling
38
Scaling, Ranking, and Normalizing

Problem
some metrics, 1.00 is best (e.g. ACC)
some metrics, 0.00 is best (e.g. RMS)
some metrics, baseline is 0.50 (e.g. AUC)
some problems/metrics, 0.60 is excellent
performance
some problems/metrics, 0.99 is poor performance
Solution 1 Normalized Scores
baseline performance gt 0.00
best observed performance gt 1.00 (proxy for
Bayes optimal)
puts all metrics on equal footing
Solution 2 Scale by Standard Deviation
Solution 3 Rank Correlation

39
Multi Dimensional Scaling

Find low-dimension embedding of 10x14,000 data
The 10 metrics span a 2-5 dimension subspace

40
Multi Dimensional Scaling

Look at 2-D MDS plots
Scaled by standard deviation
Normalized scores
MDS of rank correlations
MDS on each problem individually
MDS averaged across all problems

41
2-D Multi-Dimensional Scaling
42
2-D Multi-Dimensional Scaling
Normalized Scores Scaling
Rank-Correlation Distance
43
Adult Covertype
Hyper-Spectral
Letter Medis
SLAC
44
Correlation Analysis

2000 performances for each metric on each problem
Correlation between all pairs of metrics
10 metrics
45 pairwise correlations
Average of correlations over 7 test problems
Standard correlation
Rank correlation
Present rank correlation here

45
Rank Correlations

Correlation analysis consistent with MDS analysis
Ordering metrics have high correlations to each
other
ACC, AUC, RMS have best correlations of metrics
in each metric class
RMS has good correlation to other metrics
SAR has best correlation to other metrics

46
Summary

10 metrics span 2-5 Dim subspace
Consistent results across problems and scalings
Ordering Metrics Cluster AUC APR BEP
CAL far from Ordering Metrics
CAL nearest to RMS/MXE
RMS MXE, but RMS much more centrally located
Threshold Metrics ACC and FSC do not cluster as
tightly as ordering metrics and RMS/MXE
Lift behaves more like Ordering than Threshold
metrics
Old friends ACC, AUC, and RMS most representative
New SAR metric is good, but not much better than
RMS

47
New Resources

Want to borrow 14,000 models?
margin analysis
comparison to new algorithm X
PERF code software that calculates 2 dozen
performance metrics
Accuracy (at different thresholds)
ROC Area and ROC plots
Precision and Recall plots
Break-even-point, F-score, Average Precision
Squared Error
Cross-Entropy
Lift
Currently, most metrics are for boolean
classification problems
We are willing to add new metrics and new
capabilities
Available at http//www.cs.cornell.edu/caruan
a

48
Future Work
49
Future/Related Work

Ensemble method optimizes any metric (ICML04)
Get good probs from Boosted Trees (AISTATS05)
Comparison of learning algs on metrics (ICML06)
First step in analyzing different performance
metrics
Develop new metrics with better properties
SAR is a good general purpose metric
Does optimizing to SAR yield better models?
but RMS nearly as good
attempts to make SAR better did not help much
Extend to multi-class or hierarchical problems
where evaluating performance is more difficult

50
Thank You.
51
Spooky Stuff in Metric Space
52
Which learning methods perform best on each
metric?
53
Normalized Scores Best Single Models

SVM predictions transformed to posterior
probabilities via Platt Scaling
SVM and ANN tied for first place Bagged Trees
nearly as good
Boosted Trees win 5 of 6 Threshold Rank
metrics, but yield lousy probs!
Boosting weaker stumps does not compare to
boosting full trees
KNN and Plain Decision Trees usually not
competitive (with 4k train sets)
Other interesting things. See papers.

54
Platt Scaling

SVM predictions -inf, inf
Probability metrics require 0,1
Platt scaling transforms SVM preds by fitting a
sigmoid
This gives SVM good probability performance

55
Outline

Motivation The One True Model
Ten Performance Metrics
Experiments
Multidimensional Scaling (MDS) Analysis
Correlation Analysis
Learning Algorithm vs. Metric
Summary

56
Base-Level Learners

Each optimizes different things
ANN minimize squared error or cross-entropy
(good for probs)
SVM, Boosting optimize margin (good for
accuracy, poor for probs)
DT optimize info gain
KNN ?
Each best in different regimes
SVM high dimensional data
DT, KNN large data sets
ANN non-linear prediction from many correlated
features
Each algorithm has many variations and free
parameters
SVM margin parameter, kernel, kernel parameters
(gamma, )
ANN hidden units, hidden layers, learning
rate, early stopping point
DT splitting criterion, pruning options,
smoothing options,
KNN K, distance metric, distance weighted
averaging,
Generate about 2000 models on each test problem

57
Motivation

Holy Grail of Supervised Learning
One True Model (a.k.a. Bayes Optimal Model)
Predicts correct conditional probability for each
case
Yields optimal performance on all reasonable
metrics
Hard to learn given finite data
train sets rarely have conditional probs, usually
just 0/1 targets
Isnt always necessary
Many Different Performance Metrics
ACC, AUC, CXE, RMS, PRE/REC
Each represents different tradeoffs
Usually important to optimize to appropriate
metric
Not all metric created equal

58
Motivation

In an ideal world
Learn model that predicts correct conditional
probabilities
Yield optimal performance on any reasonable
metric
In real world
Finite data
0/1 targets instead of conditional probabilities
Hard to learn this ideal model
Dont have good metrics for recognizing ideal
model
Ideal model isnt always necessary
In practice
Do learning using many different metrics ACC,
AUC, CXE, RMS,
Each metric represents different tradeoffs
Because of this, usually important to optimize to
appropriate metric

59
Accuracy

Target 0/1, -1/1, True/False,
Prediction f(inputs) f(x) 0/1 or Real
Threshold f(x) gt thresh gt 1, else gt 0
threshold(f(x)) 0/1
right / total
p(correct) p(threshold(f(x)) target)

60
Precision and Recall

Typically used in document retrieval
Precision
how many of the returned documents are correct
precision(threshold)
Recall
how many of the positives does the model return
recall(threshold)
Precision/Recall Curve sweep thresholds

61
Precision/Recall
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold
62
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Multitask Learning PowerPoint PPT Presentation