Multitask Learning - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Multitask Learning

Description:

Bagging X. Boosting X. Rule learners (C2, ...) Ripper. Random Forests (forests of decision trees) ... Motivation #2: Bagged Probabilistic Trees. Draw N ... – PowerPoint PPT presentation

Number of Views:777
Avg rating:5.0/5.0
Slides: 63
Provided by: richc153
Category:

less

Transcript and Presenter's Notes

Title: Multitask Learning


1
Spooky Stuff in Metric Space
2
Spooky StuffData Mining in Metric Space
  • Rich Caruana
  • Alex Niculescu
  • Cornell University

3
Motivation 1
4
Motivation 1 Pneumonia Risk Prediction
5
Motivation 1 Many Learning Algorithms
  • Neural nets
  • Logistic regression
  • Linear perceptron
  • K-nearest neighbor
  • Decision trees
  • ILP (Inductive Logic Programming)
  • SVMs (Support Vector Machines)
  • Bagging X
  • Boosting X
  • Rule learners (C2, )
  • Ripper
  • Random Forests (forests of decision trees)
  • Gaussian Processes
  • Bayes Nets
  • No one/few learning methods dominates the others

6
Motivation 2
7
Motivation 2 SLAC B/Bbar
  • Particle accelerator generates B/Bbar particles
  • Use machine learning to classify tracks as B or
    Bbar
  • Domain specific performance measure SLQ-Score
  • 5 increase in SLQ can save 1M in accelerator
    time
  • SLAC researchers tried various DM/ML methods
  • Good, but not great, SLQ performance
  • We tried standard methods, got similar results
  • We studied SLQ metric
  • similar to probability calibration
  • tried bagged probabilistic decision trees (good
    on C-Section)

8
Motivation 2 Bagged Probabilistic Trees
  • Draw N bootstrap samples of data
  • Train tree on each sample gt N trees
  • Final prediction average prediction of N trees


Average prediction (0.23 0.19 0.34 0.22
0.26 0.31) / Trees 0.24
9
Motivation 2 Improves Calibration Order of
Magnitude
single tree
Poor Calibration
100 bagged trees
Excellent Calibration
10
Motivation 2 Significantly Improves SLQ
100 bagged trees
single tree
11
Motivation 2
  • Can we automate this analysis of performance
    metrics so that its easier to recognize which
    metrics are similar to each other?

12
Motivation 3
13
Motivation 3
14
Scary Stuff
  • In ideal world
  • Learn model that predicts correct conditional
    probabilities (Bayes optimal)
  • Yield optimal performance on any reasonable
    metric
  • In real world
  • Finite data
  • 0/1 targets instead of conditional probabilities
  • Hard to learn this ideal model
  • Dont have good metrics for recognizing ideal
    model
  • Ideal model isnt always needed
  • In practice
  • Do learning using many different metrics ACC,
    AUC, CXE, RMS,
  • Each metric represents different tradeoffs
  • Because of this, usually important to optimize to
    appropriate metric

15
Scary Stuff
16
Scary Stuff
17
In this work we compare nine commonly used
performance metrics by applying data mining to
the results of a massive empirical study
  • Goals
  • Discover relationships between performance
    metrics
  • Are the metrics really that different?
  • If you optimize to metric X, also get good perf
    on metric Y?
  • Need to optimize to metric Y, which metric X
    should you optimize to?
  • Which metrics are more/less robust?
  • Design new, better metrics?

18
10 Binary Classification Performance Metrics
  • Threshold Metrics
  • Accuracy
  • F-Score
  • Lift
  • Ordering/Ranking Metrics
  • ROC Area
  • Average Precision
  • Precision/Recall Break-Even Point
  • Probability Metrics
  • Root-Mean-Squared-Error
  • Cross-Entropy
  • Probability Calibration
  • SAR ((1 - Squared Error) Accuracy ROC Area)
    / 3

19
Accuracy
Predicted 1 Predicted 0
correct
a
b
True 0 True 1
c
d
incorrect
threshold
accuracy (ad) / (abcd)
20
Lift
  • not interested in accuracy on entire dataset
  • want accurate predictions for 5, 10, or 20 of
    dataset
  • dont care about remaining 95, 90, 80, resp.
  • typical application marketing
  • how much better than random prediction on the
    fraction of the dataset predicted true (f(x) gt
    threshold)

21
Lift
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold
22
lift 3.5 if mailings sent to 20 of the
customers
23
Precision/Recall, F, Break-Even Pt
harmonic average of precision and recall
24
better performance
worse performance
25
Predicted 1 Predicted 0
Predicted 1 Predicted 0
true positive
false negative
FN
TP
True 0 True 1
True 0 True 1
false positive
true negative
TN
FP
Predicted 1 Predicted 0
Predicted 1 Predicted 0
misses
P(pr0tr1)
hits
P(pr1tr1)
True 0 True 1
True 0 True 1
false alarms
correct rejections
P(pr0tr0)
P(pr1tr0)
26
ROC Plot and ROC Area
  • Receiver Operator Characteristic
  • Developed in WWII to statistically model false
    positive and false negative detections of radar
    operators
  • Better statistical foundations than most other
    measures
  • Standard measure in medicine and biology
  • Becoming more popular in ML
  • Sweep threshold and plot
  • TPR vs. FPR
  • Sensitivity vs. 1-Specificity
  • P(truetrue) vs. P(truefalse)
  • Sensitivity a/(ab) Recall LIFT numerator
  • 1 - Specificity 1 - d/(cd)

27
diagonal line is random prediction
28
Calibration
  • Good calibration
  • If 1000 xs have pred(x) 0.2, 200 should be
    positive

29
Calibration
  • Model can be accurate but poorly calibrated
  • good threshold with uncalibrated probabilities
  • Model can have good ROC but be poorly calibrated
  • ROC insensitive to scaling/stretching
  • only ordering has to be correct, not
    probabilities themselves
  • Model can have very high variance, but be well
    calibrated
  • Model can be stupid, but be well calibrated
  • Calibration is a real oddball

30
Measuring Calibration
  • Bucket method
  • In each bucket
  • measure observed c-sec rate
  • predicted c-sec rate (average of probabilities)
  • if observed csec rate similar to predicted csec
    rate gt good calibration in that bucket

0.05 0.15 0.25 0.35
0.45 0.55 0.65 0.75
0.85 0.95



















0.0 0.1 0.2 0.3
0.4 0.5 0.6 0.7
0.8 0.9 1.0
31
Calibration Plot
32
Experiments
33
Base-Level Learning Methods
  • Decision trees
  • K-nearest neighbor
  • Neural nets
  • SVMs
  • Bagged Decision Trees
  • Boosted Decision Trees
  • Boosted Stumps
  • Each optimizes different things
  • Each best in different regimes
  • Each algorithm has many variations and free
    parameters
  • Generate about 2000 models on each test problem

34
Data Sets
  • 7 binary classification data sets
  • Adult
  • Cover Type
  • Letter.p1 (balanced)
  • Letter.p2 (unbalanced)
  • Pneumonia (University of Pittsburgh)
  • Hyper Spectral (NASA Goddard Space Center)
  • Particle Physics (Stanford Linear Accelerator)
  • 4 k train sets
  • Large final test sets (usually 20k)

35
Massive Empirical Comparison
  • 7 base-level learning methods
  • X
  • 100s of parameter settings per method
  • 2000 models per problem
  • X
  • 7 test problems
  • 14,000 models
  • X
  • 10 performance metrics
  • 140,000 model performance evaluations

36
COVTYPE Calibration vs. Accuracy
37
Multi Dimensional Scaling
38
Scaling, Ranking, and Normalizing
  • Problem
  • some metrics, 1.00 is best (e.g. ACC)
  • some metrics, 0.00 is best (e.g. RMS)
  • some metrics, baseline is 0.50 (e.g. AUC)
  • some problems/metrics, 0.60 is excellent
    performance
  • some problems/metrics, 0.99 is poor performance
  • Solution 1 Normalized Scores
  • baseline performance gt 0.00
  • best observed performance gt 1.00 (proxy for
    Bayes optimal)
  • puts all metrics on equal footing
  • Solution 2 Scale by Standard Deviation
  • Solution 3 Rank Correlation

39
Multi Dimensional Scaling
  • Find low-dimension embedding of 10x14,000 data
  • The 10 metrics span a 2-5 dimension subspace

40
Multi Dimensional Scaling
  • Look at 2-D MDS plots
  • Scaled by standard deviation
  • Normalized scores
  • MDS of rank correlations
  • MDS on each problem individually
  • MDS averaged across all problems

41
2-D Multi-Dimensional Scaling
42
2-D Multi-Dimensional Scaling
Normalized Scores Scaling
Rank-Correlation Distance
43
Adult Covertype
Hyper-Spectral
Letter Medis
SLAC
44
Correlation Analysis
  • 2000 performances for each metric on each problem
  • Correlation between all pairs of metrics
  • 10 metrics
  • 45 pairwise correlations
  • Average of correlations over 7 test problems
  • Standard correlation
  • Rank correlation
  • Present rank correlation here

45
Rank Correlations
  • Correlation analysis consistent with MDS analysis
  • Ordering metrics have high correlations to each
    other
  • ACC, AUC, RMS have best correlations of metrics
    in each metric class
  • RMS has good correlation to other metrics
  • SAR has best correlation to other metrics

46
Summary
  • 10 metrics span 2-5 Dim subspace
  • Consistent results across problems and scalings
  • Ordering Metrics Cluster AUC APR BEP
  • CAL far from Ordering Metrics
  • CAL nearest to RMS/MXE
  • RMS MXE, but RMS much more centrally located
  • Threshold Metrics ACC and FSC do not cluster as
    tightly as ordering metrics and RMS/MXE
  • Lift behaves more like Ordering than Threshold
    metrics
  • Old friends ACC, AUC, and RMS most representative
  • New SAR metric is good, but not much better than
    RMS

47
New Resources
  • Want to borrow 14,000 models?
  • margin analysis
  • comparison to new algorithm X
  • PERF code software that calculates 2 dozen
    performance metrics
  • Accuracy (at different thresholds)
  • ROC Area and ROC plots
  • Precision and Recall plots
  • Break-even-point, F-score, Average Precision
  • Squared Error
  • Cross-Entropy
  • Lift
  • Currently, most metrics are for boolean
    classification problems
  • We are willing to add new metrics and new
    capabilities
  • Available at http//www.cs.cornell.edu/caruan
    a

48
Future Work
49
Future/Related Work
  • Ensemble method optimizes any metric (ICML04)
  • Get good probs from Boosted Trees (AISTATS05)
  • Comparison of learning algs on metrics (ICML06)
  • First step in analyzing different performance
    metrics
  • Develop new metrics with better properties
  • SAR is a good general purpose metric
  • Does optimizing to SAR yield better models?
  • but RMS nearly as good
  • attempts to make SAR better did not help much
  • Extend to multi-class or hierarchical problems
    where evaluating performance is more difficult

50
Thank You.
51
Spooky Stuff in Metric Space
52
Which learning methods perform best on each
metric?
53
Normalized Scores Best Single Models
  • SVM predictions transformed to posterior
    probabilities via Platt Scaling
  • SVM and ANN tied for first place Bagged Trees
    nearly as good
  • Boosted Trees win 5 of 6 Threshold Rank
    metrics, but yield lousy probs!
  • Boosting weaker stumps does not compare to
    boosting full trees
  • KNN and Plain Decision Trees usually not
    competitive (with 4k train sets)
  • Other interesting things. See papers.

54
Platt Scaling
  • SVM predictions -inf, inf
  • Probability metrics require 0,1
  • Platt scaling transforms SVM preds by fitting a
    sigmoid
  • This gives SVM good probability performance

55
Outline
  • Motivation The One True Model
  • Ten Performance Metrics
  • Experiments
  • Multidimensional Scaling (MDS) Analysis
  • Correlation Analysis
  • Learning Algorithm vs. Metric
  • Summary

56
Base-Level Learners
  • Each optimizes different things
  • ANN minimize squared error or cross-entropy
    (good for probs)
  • SVM, Boosting optimize margin (good for
    accuracy, poor for probs)
  • DT optimize info gain
  • KNN ?
  • Each best in different regimes
  • SVM high dimensional data
  • DT, KNN large data sets
  • ANN non-linear prediction from many correlated
    features
  • Each algorithm has many variations and free
    parameters
  • SVM margin parameter, kernel, kernel parameters
    (gamma, )
  • ANN hidden units, hidden layers, learning
    rate, early stopping point
  • DT splitting criterion, pruning options,
    smoothing options,
  • KNN K, distance metric, distance weighted
    averaging,
  • Generate about 2000 models on each test problem

57
Motivation
  • Holy Grail of Supervised Learning
  • One True Model (a.k.a. Bayes Optimal Model)
  • Predicts correct conditional probability for each
    case
  • Yields optimal performance on all reasonable
    metrics
  • Hard to learn given finite data
  • train sets rarely have conditional probs, usually
    just 0/1 targets
  • Isnt always necessary
  • Many Different Performance Metrics
  • ACC, AUC, CXE, RMS, PRE/REC
  • Each represents different tradeoffs
  • Usually important to optimize to appropriate
    metric
  • Not all metric created equal

58
Motivation
  • In an ideal world
  • Learn model that predicts correct conditional
    probabilities
  • Yield optimal performance on any reasonable
    metric
  • In real world
  • Finite data
  • 0/1 targets instead of conditional probabilities
  • Hard to learn this ideal model
  • Dont have good metrics for recognizing ideal
    model
  • Ideal model isnt always necessary
  • In practice
  • Do learning using many different metrics ACC,
    AUC, CXE, RMS,
  • Each metric represents different tradeoffs
  • Because of this, usually important to optimize to
    appropriate metric

59
Accuracy
  • Target 0/1, -1/1, True/False,
  • Prediction f(inputs) f(x) 0/1 or Real
  • Threshold f(x) gt thresh gt 1, else gt 0
  • threshold(f(x)) 0/1
  • right / total
  • p(correct) p(threshold(f(x)) target)

60
Precision and Recall
  • Typically used in document retrieval
  • Precision
  • how many of the returned documents are correct
  • precision(threshold)
  • Recall
  • how many of the positives does the model return
  • recall(threshold)
  • Precision/Recall Curve sweep thresholds

61
Precision/Recall
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold
62
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com