A BlackBox approach to machine learning - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

A BlackBox approach to machine learning

Description:

Freund, Mason, Rogers, Pregibon, Cortes 2000 ... Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities. Score ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 62
Provided by: yoavf
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: A BlackBox approach to machine learning


1
A Black-Box approach to machine learning
  • Yoav Freund

2
Why do we need learning?
  • Computers need functions that map highly variable
    data
  • Speech recognition Audio signal -gt words
  • Image analysis Video signal -gt objects
  • Bio-Informatics Micro-array Images -gt gene
    function
  • Data Mining Transaction logs -gt customer
    classification
  • For accuracy, functions must be tuned to fit the
    data source.
  • For real-time processing, function computation
    has to be very fast.

3
The complexity/accuracy tradeoff
Error
Complexity
4
The speed/flexibility tradeoff
Matlab Code
Java Code
Flexibility
Machine code
Digital Hardware
Analog Hardware
Speed
5
Theory Vs. Practice
  • Theoretician I want a polynomial-time algorithm
    which is guaranteed to perform arbitrarily well
    in all situations.
  • - I prove theorems.
  • Practitioner I want a real-time algorithm that
    performs well on my problem.
  • - I experiment.
  • My approach I want combining algorithms whose
    performance and speed is guaranteed relative to
    the performance and speed of their components.
  • - I do both.

6
Plan of talk
  • The black-box approach
  • Boosting
  • Alternating decision trees
  • A commercial application
  • Boosting the margin
  • Confidence rated predictions
  • Online learning

7
The black-box approach
  • Statistical models are not generators, they are
    predictors.
  • A predictor is a function from observation X to
    action Z.
  • After action is taken, outcome Y is observed
    which implies loss L (a real valued number).
  • Goal find a predictor with small loss(in
    expectation, with high probability, cumulative)

8
Main software components
We assume the predictor will be applied to
examples similar to those on which it was trained
9
Learning in a system
Learning System
Sensor Data
Action
10
Special case Classification
Observation X - arbitrary (measurable) space
Usually K2 (binary classification)
11
batch learning for binary classification
12
Boosting
  • Combining weak learners

13
A weighted training set
14
A weak learner
Weighted training set
Weak Leaner
h
15
The boosting process
16
Adaboost
17
Main property of Adaboost
  • If advantages of weak rules over random guessing
    are g1,g2,..,gT then training error of final
    rule is at most

18
Boosting block diagram
19
What is a good weak learner?
  • The set of weak rules (features) should be
  • flexible enough to be (weakly) correlated with
    most conceivable relations between feature vector
    and label.
  • Simple enough to allow efficient search for a
    rule with non-trivial weighted training error.
  • Small enough to avoid over-fitting.
  • Calculation of prediction from observations
    should be very fast.

20
Alternating decision trees
Freund, Mason 1997
21
Decision Trees
1
-1
Xgt3
Ygt5
-1
-1
1
-1
22
A decision tree as a sum of weak rules.
-0.2
0.1
-0.1
0.2
-0.2
0.1
-0.1
-0.3
0.2
-0.3
23
An alternating decision tree
-0.2
0.7
24
Example Medical Diagnostics
  • Cleve dataset from UC Irvine database.
  • Heart disease diagnostics (1healthy,-1sick)
  • 13 features from tests (real valued and
    discrete).
  • 303 instances.

25
AD-tree for heart-disease diagnostics
gt0 Healthy lt0 Sick
26
Commercial Deployment.
27
ATT buisosity problem
Freund, Mason, Rogers, Pregibon, Cortes 2000
  • Distinguish business/residence customers from
    call detail information. (time of day, length of
    call )
  • 230M telephone numbers, label unknown for 30
  • 260M calls / day
  • Required computer resources
  • Huge counting log entries to produce statistics
    -- use specialized I/O efficient sorting
    algorithms (Hancock).
  • Significant Calculating the classification for
    70M customers.
  • Negligible Learning (2 Hours on 10K training
    examples on an off-line computer).

28
AD-tree for buisosity
29
AD-tree (Detail)
30
Quantifiable results
  • For accuracy 94 increased coverage from 44 to
    56.
  • Saved ATT 15M in the year 2000 in operations
    costs and missed opportunities.

31
Adaboosts resistance to over fitting
  • Why statisticians find Adaboost interesting.

32
A very curious phenomenon
Boosting decision trees
Using lt10,000 training examples we fit gt2,000,000
parameters
33
Large margins
Thesis large margins gt reliable predictions
Very similar to SVM.
34
Experimental Evidence
35
Theorem
Schapire, Freund, Bartlett Lee / Annals of
statistics 1998
H set of binary functions with VC-dimension d
No dependence on no. of combined functions!!!
36
Idea of Proof
37
Confidence rated predictions
  • Agreement gives confidence

38
A motivating example
?
?
?
39
The algorithm
Freund, Mansour, Schapire 2001
40
Suggested tuning
Suppose H is a finite set.
Yields
41
Confidence Rating block diagram
42
Face Detection
Viola Jones 1999
  • Paul Viola and Mike Jones developed a face
    detector that can work in real time (15 frames
    per second).

43
Using confidence to save time
The detector combines 6000 simple features using
Adaboost.
In most boxes, only 8-9 features are calculated.
Feature 1
Feature 2
44
Using confidence to train car detectors
45
Original Image Vs. difference image
46
Co-training
Blum and Mitchell 98
Partially trained B/W based Classifier
Raw B/W
Hwy Images
Diff Image
Partially trained Diff based Classifier
47
Co-Training Results
Levin, Freund, Viola 2002
48
Selective sampling
Unlabeled data
Partially trained classifier
Query-by-committee, Seung, Opper
Sompolinsky Freund, Seung, Shamir Tishby
49
Online learning
  • Adapting to changes

50
Online learning
So far, the only statistical assumption was that
data is generated IID.
Can we get rid of that assumption?
Yes, if we consider prediction as a repeating game
Suppose we have a set of experts, we believe one
is good, but we dont know which one.
51
Online prediction game
52
A very simple example
  • Binary classification
  • N experts
  • one expert is known to be perfect
  • Algorithm predict like the majority of experts
    that have made no mistake so far.
  • Bound

53
History of online learning
  • Littlestone Warmuth
  • Vovk
  • Vovk and Shafers recent bookProbability and
    Finance, its only a game!
  • Innumerable contributions from many fields
    Hannan, Blackwell, Davison, Gallager, Cover,
    Barron, Foster Vohra, Fudenberg Levin, Feder
    Merhav, Starkov, Rissannen, Cesa-Bianchi,
    Lugosi, Blum, Freund, Schapire, Valiant, Auer

54
Lossless compression
X - arbitrary input space.
Y - 0,1
  • Z - 0,1

Entropy, Lossless compression, MDL.
  • Statistical likelihood, standard probability
    theory.

55
Bayesian averaging
Folk theorem in Information Theory
56
Game theoretical loss
X - arbitrary space
57
Learning in games
Freund and Schapire 94
An algorithm which knows T in advance guarantees
58
Multi-arm bandits
Auer, Cesa-Bianchi, Freund, Schapire 95
We describe an algorithm that guarantees
59
Why isnt online learning practical?
  • Prescriptions too similar to Bayesian approach.
  • Implementing low-level learning requires a large
    number of experts.
  • Computation increases linearly with the number of
    experts.
  • Potentially very powerful for combining a few
    high-level experts.

60
Online learning for detector deployment
Detector can be adaptive!!
OL
61
Summary
  • By Combining predictors we can
  • Improve accuracy.
  • Estimate prediction confidence.
  • Adapt on-line.
  • To make machine learning practical
  • Speed-up the predictors.
  • Concentrate human feedback on hard cases.
  • Fuse data from several sources.
  • Share predictor libraries.
Write a Comment
User Comments (0)
About PowerShow.com