Loading...

PPT – Multitask Learning PowerPoint presentation | free to view - id: f5fbe-MTYyN

The Adobe Flash plugin is needed to view this content

Spooky Stuff in Metric Space

Spooky StuffData Mining in Metric Space

- Rich Caruana
- Alex Niculescu
- Cornell University

Motivation 1

Motivation 1 Pneumonia Risk Prediction

Motivation 1 Many Learning Algorithms

- Neural nets
- Logistic regression
- Linear perceptron
- K-nearest neighbor
- Decision trees
- ILP (Inductive Logic Programming)
- SVMs (Support Vector Machines)
- Bagging X
- Boosting X
- Rule learners (C2, )
- Ripper
- Random Forests (forests of decision trees)
- Gaussian Processes
- Bayes Nets
- No one/few learning methods dominates the others

Motivation 2

Motivation 2 SLAC B/Bbar

- Particle accelerator generates B/Bbar particles
- Use machine learning to classify tracks as B or

Bbar - Domain specific performance measure SLQ-Score
- 5 increase in SLQ can save 1M in accelerator

time - SLAC researchers tried various DM/ML methods
- Good, but not great, SLQ performance
- We tried standard methods, got similar results
- We studied SLQ metric
- similar to probability calibration
- tried bagged probabilistic decision trees (good

on C-Section)

Motivation 2 Bagged Probabilistic Trees

- Draw N bootstrap samples of data
- Train tree on each sample gt N trees
- Final prediction average prediction of N trees

Average prediction (0.23 0.19 0.34 0.22

0.26 0.31) / Trees 0.24

Motivation 2 Improves Calibration Order of

Magnitude

single tree

Poor Calibration

100 bagged trees

Excellent Calibration

Motivation 2 Significantly Improves SLQ

100 bagged trees

single tree

Motivation 2

- Can we automate this analysis of performance

metrics so that its easier to recognize which

metrics are similar to each other?

Motivation 3

Motivation 3

Scary Stuff

- In ideal world
- Learn model that predicts correct conditional

probabilities (Bayes optimal) - Yield optimal performance on any reasonable

metric - In real world
- Finite data
- 0/1 targets instead of conditional probabilities
- Hard to learn this ideal model
- Dont have good metrics for recognizing ideal

model - Ideal model isnt always needed
- In practice
- Do learning using many different metrics ACC,

AUC, CXE, RMS, - Each metric represents different tradeoffs
- Because of this, usually important to optimize to

appropriate metric

Scary Stuff

Scary Stuff

In this work we compare nine commonly used

performance metrics by applying data mining to

the results of a massive empirical study

- Goals
- Discover relationships between performance

metrics - Are the metrics really that different?
- If you optimize to metric X, also get good perf

on metric Y? - Need to optimize to metric Y, which metric X

should you optimize to? - Which metrics are more/less robust?
- Design new, better metrics?

10 Binary Classification Performance Metrics

- Threshold Metrics
- Accuracy
- F-Score
- Lift
- Ordering/Ranking Metrics
- ROC Area
- Average Precision
- Precision/Recall Break-Even Point
- Probability Metrics
- Root-Mean-Squared-Error
- Cross-Entropy
- Probability Calibration
- SAR ((1 - Squared Error) Accuracy ROC Area)

/ 3

Accuracy

Predicted 1 Predicted 0

correct

a

b

True 0 True 1

c

d

incorrect

threshold

accuracy (ad) / (abcd)

Lift

- not interested in accuracy on entire dataset
- want accurate predictions for 5, 10, or 20 of

dataset - dont care about remaining 95, 90, 80, resp.
- typical application marketing
- how much better than random prediction on the

fraction of the dataset predicted true (f(x) gt

threshold)

Lift

Predicted 1 Predicted 0

a

b

True 0 True 1

c

d

threshold

lift 3.5 if mailings sent to 20 of the

customers

Precision/Recall, F, Break-Even Pt

harmonic average of precision and recall

better performance

worse performance

Predicted 1 Predicted 0

Predicted 1 Predicted 0

true positive

false negative

FN

TP

True 0 True 1

True 0 True 1

false positive

true negative

TN

FP

Predicted 1 Predicted 0

Predicted 1 Predicted 0

misses

P(pr0tr1)

hits

P(pr1tr1)

True 0 True 1

True 0 True 1

false alarms

correct rejections

P(pr0tr0)

P(pr1tr0)

ROC Plot and ROC Area

- Receiver Operator Characteristic
- Developed in WWII to statistically model false

positive and false negative detections of radar

operators - Better statistical foundations than most other

measures - Standard measure in medicine and biology
- Becoming more popular in ML
- Sweep threshold and plot
- TPR vs. FPR
- Sensitivity vs. 1-Specificity
- P(truetrue) vs. P(truefalse)
- Sensitivity a/(ab) Recall LIFT numerator
- 1 - Specificity 1 - d/(cd)

diagonal line is random prediction

Calibration

- Good calibration
- If 1000 xs have pred(x) 0.2, 200 should be

positive

Calibration

- Model can be accurate but poorly calibrated
- good threshold with uncalibrated probabilities
- Model can have good ROC but be poorly calibrated
- ROC insensitive to scaling/stretching
- only ordering has to be correct, not

probabilities themselves - Model can have very high variance, but be well

calibrated - Model can be stupid, but be well calibrated
- Calibration is a real oddball

Measuring Calibration

- Bucket method
- In each bucket
- measure observed c-sec rate
- predicted c-sec rate (average of probabilities)
- if observed csec rate similar to predicted csec

rate gt good calibration in that bucket

0.05 0.15 0.25 0.35

0.45 0.55 0.65 0.75

0.85 0.95

0.0 0.1 0.2 0.3

0.4 0.5 0.6 0.7

0.8 0.9 1.0

Calibration Plot

Experiments

Base-Level Learning Methods

- Decision trees
- K-nearest neighbor
- Neural nets
- SVMs
- Bagged Decision Trees
- Boosted Decision Trees
- Boosted Stumps
- Each optimizes different things
- Each best in different regimes
- Each algorithm has many variations and free

parameters - Generate about 2000 models on each test problem

Data Sets

- 7 binary classification data sets
- Adult
- Cover Type
- Letter.p1 (balanced)
- Letter.p2 (unbalanced)
- Pneumonia (University of Pittsburgh)
- Hyper Spectral (NASA Goddard Space Center)
- Particle Physics (Stanford Linear Accelerator)
- 4 k train sets
- Large final test sets (usually 20k)

Massive Empirical Comparison

- 7 base-level learning methods
- X
- 100s of parameter settings per method
- 2000 models per problem
- X
- 7 test problems
- 14,000 models
- X
- 10 performance metrics
- 140,000 model performance evaluations

COVTYPE Calibration vs. Accuracy

Multi Dimensional Scaling

Scaling, Ranking, and Normalizing

- Problem
- some metrics, 1.00 is best (e.g. ACC)
- some metrics, 0.00 is best (e.g. RMS)
- some metrics, baseline is 0.50 (e.g. AUC)
- some problems/metrics, 0.60 is excellent

performance - some problems/metrics, 0.99 is poor performance
- Solution 1 Normalized Scores
- baseline performance gt 0.00
- best observed performance gt 1.00 (proxy for

Bayes optimal) - puts all metrics on equal footing
- Solution 2 Scale by Standard Deviation
- Solution 3 Rank Correlation

Multi Dimensional Scaling

- Find low-dimension embedding of 10x14,000 data
- The 10 metrics span a 2-5 dimension subspace

Multi Dimensional Scaling

- Look at 2-D MDS plots
- Scaled by standard deviation
- Normalized scores
- MDS of rank correlations
- MDS on each problem individually
- MDS averaged across all problems

2-D Multi-Dimensional Scaling

2-D Multi-Dimensional Scaling

Normalized Scores Scaling

Rank-Correlation Distance

Adult Covertype

Hyper-Spectral

Letter Medis

SLAC

Correlation Analysis

- 2000 performances for each metric on each problem
- Correlation between all pairs of metrics
- 10 metrics
- 45 pairwise correlations
- Average of correlations over 7 test problems
- Standard correlation
- Rank correlation
- Present rank correlation here

Rank Correlations

- Correlation analysis consistent with MDS analysis
- Ordering metrics have high correlations to each

other - ACC, AUC, RMS have best correlations of metrics

in each metric class - RMS has good correlation to other metrics
- SAR has best correlation to other metrics

Summary

- 10 metrics span 2-5 Dim subspace
- Consistent results across problems and scalings
- Ordering Metrics Cluster AUC APR BEP
- CAL far from Ordering Metrics
- CAL nearest to RMS/MXE
- RMS MXE, but RMS much more centrally located
- Threshold Metrics ACC and FSC do not cluster as

tightly as ordering metrics and RMS/MXE - Lift behaves more like Ordering than Threshold

metrics - Old friends ACC, AUC, and RMS most representative
- New SAR metric is good, but not much better than

RMS

New Resources

- Want to borrow 14,000 models?
- margin analysis
- comparison to new algorithm X
- PERF code software that calculates 2 dozen

performance metrics - Accuracy (at different thresholds)
- ROC Area and ROC plots
- Precision and Recall plots
- Break-even-point, F-score, Average Precision
- Squared Error
- Cross-Entropy
- Lift
- Currently, most metrics are for boolean

classification problems - We are willing to add new metrics and new

capabilities - Available at http//www.cs.cornell.edu/caruan

a

Future Work

Future/Related Work

- Ensemble method optimizes any metric (ICML04)
- Get good probs from Boosted Trees (AISTATS05)
- Comparison of learning algs on metrics (ICML06)
- First step in analyzing different performance

metrics - Develop new metrics with better properties
- SAR is a good general purpose metric
- Does optimizing to SAR yield better models?
- but RMS nearly as good
- attempts to make SAR better did not help much
- Extend to multi-class or hierarchical problems

where evaluating performance is more difficult

Thank You.

Spooky Stuff in Metric Space

Which learning methods perform best on each

metric?

Normalized Scores Best Single Models

- SVM predictions transformed to posterior

probabilities via Platt Scaling - SVM and ANN tied for first place Bagged Trees

nearly as good - Boosted Trees win 5 of 6 Threshold Rank

metrics, but yield lousy probs! - Boosting weaker stumps does not compare to

boosting full trees - KNN and Plain Decision Trees usually not

competitive (with 4k train sets) - Other interesting things. See papers.

Platt Scaling

- SVM predictions -inf, inf
- Probability metrics require 0,1
- Platt scaling transforms SVM preds by fitting a

sigmoid - This gives SVM good probability performance

Outline

- Motivation The One True Model
- Ten Performance Metrics
- Experiments
- Multidimensional Scaling (MDS) Analysis
- Correlation Analysis
- Learning Algorithm vs. Metric
- Summary

Base-Level Learners

- Each optimizes different things
- ANN minimize squared error or cross-entropy

(good for probs) - SVM, Boosting optimize margin (good for

accuracy, poor for probs) - DT optimize info gain
- KNN ?
- Each best in different regimes
- SVM high dimensional data
- DT, KNN large data sets
- ANN non-linear prediction from many correlated

features - Each algorithm has many variations and free

parameters - SVM margin parameter, kernel, kernel parameters

(gamma, ) - ANN hidden units, hidden layers, learning

rate, early stopping point - DT splitting criterion, pruning options,

smoothing options, - KNN K, distance metric, distance weighted

averaging, - Generate about 2000 models on each test problem

Motivation

- Holy Grail of Supervised Learning
- One True Model (a.k.a. Bayes Optimal Model)
- Predicts correct conditional probability for each

case - Yields optimal performance on all reasonable

metrics - Hard to learn given finite data
- train sets rarely have conditional probs, usually

just 0/1 targets - Isnt always necessary
- Many Different Performance Metrics
- ACC, AUC, CXE, RMS, PRE/REC
- Each represents different tradeoffs
- Usually important to optimize to appropriate

metric - Not all metric created equal

Motivation

- In an ideal world
- Learn model that predicts correct conditional

probabilities - Yield optimal performance on any reasonable

metric - In real world
- Finite data
- 0/1 targets instead of conditional probabilities
- Hard to learn this ideal model
- Dont have good metrics for recognizing ideal

model - Ideal model isnt always necessary
- In practice
- Do learning using many different metrics ACC,

AUC, CXE, RMS, - Each metric represents different tradeoffs
- Because of this, usually important to optimize to

appropriate metric

Accuracy

- Target 0/1, -1/1, True/False,
- Prediction f(inputs) f(x) 0/1 or Real
- Threshold f(x) gt thresh gt 1, else gt 0
- threshold(f(x)) 0/1
- right / total
- p(correct) p(threshold(f(x)) target)

Precision and Recall

- Typically used in document retrieval
- Precision
- how many of the returned documents are correct
- precision(threshold)
- Recall
- how many of the positives does the model return
- recall(threshold)
- Precision/Recall Curve sweep thresholds

Precision/Recall

Predicted 1 Predicted 0

a

b

True 0 True 1

c

d

threshold

(No Transcript)