A BlackBox approach to machine learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: A BlackBox approach to machine learning

1
A Black-Box approach to machine learning

Yoav Freund

2
Why do we need learning?

Computers need functions that map highly variable
data
Speech recognition Audio signal -gt words
Image analysis Video signal -gt objects
Bio-Informatics Micro-array Images -gt gene
function
Data Mining Transaction logs -gt customer
classification
For accuracy, functions must be tuned to fit the
data source.
For real-time processing, function computation
has to be very fast.

3
The complexity/accuracy tradeoff
Error
Complexity
4
The speed/flexibility tradeoff
Matlab Code
Java Code
Flexibility
Machine code
Digital Hardware
Analog Hardware
Speed
5
Theory Vs. Practice

Theoretician I want a polynomial-time algorithm
which is guaranteed to perform arbitrarily well
in all situations.
- I prove theorems.
Practitioner I want a real-time algorithm that
performs well on my problem.
- I experiment.
My approach I want combining algorithms whose
performance and speed is guaranteed relative to
the performance and speed of their components.
- I do both.

6
Plan of talk

The black-box approach
Boosting
Alternating decision trees
A commercial application
Boosting the margin
Confidence rated predictions
Online learning

7
The black-box approach

Statistical models are not generators, they are
predictors.
A predictor is a function from observation X to
action Z.
After action is taken, outcome Y is observed
which implies loss L (a real valued number).
Goal find a predictor with small loss(in
expectation, with high probability, cumulative)

8
Main software components
We assume the predictor will be applied to
examples similar to those on which it was trained
9
Learning in a system
Learning System
Sensor Data
Action
10
Special case Classification
Observation X - arbitrary (measurable) space
Usually K2 (binary classification)
11
batch learning for binary classification
12
Boosting

Combining weak learners

13
A weighted training set
14
A weak learner
Weighted training set
Weak Leaner
h
15
The boosting process
16
Adaboost
17
Main property of Adaboost

If advantages of weak rules over random guessing
are g1,g2,..,gT then training error of final
rule is at most

18
Boosting block diagram
19
What is a good weak learner?

The set of weak rules (features) should be
flexible enough to be (weakly) correlated with
most conceivable relations between feature vector
and label.
Simple enough to allow efficient search for a
rule with non-trivial weighted training error.
Small enough to avoid over-fitting.
Calculation of prediction from observations
should be very fast.

20
Alternating decision trees
Freund, Mason 1997
21
Decision Trees
1
-1
Xgt3
Ygt5
-1
-1
1
-1
22
A decision tree as a sum of weak rules.
-0.2
0.1
-0.1
0.2
-0.2
0.1
-0.1
-0.3
0.2
-0.3
23
An alternating decision tree
-0.2
0.7
24
Example Medical Diagnostics

Cleve dataset from UC Irvine database.
Heart disease diagnostics (1healthy,-1sick)
13 features from tests (real valued and
discrete).
303 instances.

25
AD-tree for heart-disease diagnostics
gt0 Healthy lt0 Sick
26
Commercial Deployment.
27
ATT buisosity problem
Freund, Mason, Rogers, Pregibon, Cortes 2000

Distinguish business/residence customers from
call detail information. (time of day, length of
call )
230M telephone numbers, label unknown for 30
260M calls / day
Required computer resources

Huge counting log entries to produce statistics
-- use specialized I/O efficient sorting
algorithms (Hancock).
Significant Calculating the classification for
70M customers.
Negligible Learning (2 Hours on 10K training
examples on an off-line computer).

28
AD-tree for buisosity
29
AD-tree (Detail)
30
Quantifiable results

For accuracy 94 increased coverage from 44 to
56.
Saved ATT 15M in the year 2000 in operations
costs and missed opportunities.

31
Adaboosts resistance to over fitting

Why statisticians find Adaboost interesting.

32
A very curious phenomenon
Boosting decision trees
Using lt10,000 training examples we fit gt2,000,000
parameters
33
Large margins
Thesis large margins gt reliable predictions
Very similar to SVM.
34
Experimental Evidence
35
Theorem
Schapire, Freund, Bartlett Lee / Annals of
statistics 1998
H set of binary functions with VC-dimension d
No dependence on no. of combined functions!!!
36
Idea of Proof
37
Confidence rated predictions

Agreement gives confidence

38
A motivating example
?
?
?
39
The algorithm
Freund, Mansour, Schapire 2001
40
Suggested tuning
Suppose H is a finite set.
Yields
41
Confidence Rating block diagram
42
Face Detection
Viola Jones 1999

Paul Viola and Mike Jones developed a face
detector that can work in real time (15 frames
per second).

43
Using confidence to save time
The detector combines 6000 simple features using
Adaboost.
In most boxes, only 8-9 features are calculated.
Feature 1
Feature 2
44
Using confidence to train car detectors
45
Original Image Vs. difference image
46
Co-training
Blum and Mitchell 98
Partially trained B/W based Classifier
Raw B/W
Hwy Images
Diff Image
Partially trained Diff based Classifier
47
Co-Training Results
Levin, Freund, Viola 2002
48
Selective sampling
Unlabeled data
Partially trained classifier
Query-by-committee, Seung, Opper
Sompolinsky Freund, Seung, Shamir Tishby
49
Online learning

Adapting to changes

50
Online learning
So far, the only statistical assumption was that
data is generated IID.
Can we get rid of that assumption?
Yes, if we consider prediction as a repeating game
Suppose we have a set of experts, we believe one
is good, but we dont know which one.
51
Online prediction game
52
A very simple example

Binary classification
N experts
one expert is known to be perfect
Algorithm predict like the majority of experts
that have made no mistake so far.
Bound

53
History of online learning

Littlestone Warmuth
Vovk
Vovk and Shafers recent bookProbability and
Finance, its only a game!
Innumerable contributions from many fields
Hannan, Blackwell, Davison, Gallager, Cover,
Barron, Foster Vohra, Fudenberg Levin, Feder
Merhav, Starkov, Rissannen, Cesa-Bianchi,
Lugosi, Blum, Freund, Schapire, Valiant, Auer

54
Lossless compression
X - arbitrary input space.
Y - 0,1

Z - 0,1

Entropy, Lossless compression, MDL.

Statistical likelihood, standard probability
theory.

55
Bayesian averaging
Folk theorem in Information Theory
56
Game theoretical loss
X - arbitrary space
57
Learning in games
Freund and Schapire 94
An algorithm which knows T in advance guarantees
58
Multi-arm bandits
Auer, Cesa-Bianchi, Freund, Schapire 95
We describe an algorithm that guarantees
59
Why isnt online learning practical?

Prescriptions too similar to Bayesian approach.
Implementing low-level learning requires a large
number of experts.
Computation increases linearly with the number of
experts.
Potentially very powerful for combining a few
high-level experts.

60
Online learning for detector deployment
Detector can be adaptive!!
OL
61
Summary

By Combining predictors we can
Improve accuracy.
Estimate prediction confidence.
Adapt on-line.
To make machine learning practical
Speed-up the predictors.
Concentrate human feedback on hard cases.
Fuse data from several sources.
Share predictor libraries.

Write a Comment

User Comments (0)

About PowerShow.com

A BlackBox approach to machine learning PowerPoint PPT Presentation