CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learn - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learn

Description:

CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learn – PowerPoint PPT presentation

Number of Views:298
Avg rating:3.0/5.0
Slides: 58
Provided by: danr168
Category:

less

Transcript and Presenter's Notes

Title: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learn


1
CS546 Machine Learning and Natural
LanguageLecture 7 Introduction to
Classification Linear Learning Algorithms2009
2
Linear Functions
  • Exclusive-OR

y (x1 ? x2) v (x1 ? x2)
y (x1 ? x2) v (x3 ? x4)
  • Non-trivial DNF

3
Linear Functions
w x ?
-
-
-
-
-
-
-
-
-
-
-
-
-
-
w x 0
-
4
Perceptron learning rule
  • On-line, mistake driven algorithm.
  • Rosenblatt (1959)
  • suggested that when a target output value is
  • provided for a single neuron with fixed
    input, it can
  • incrementally change weights and learn to
    produce the
  • output using the Perceptron learning rule
  • Perceptron Linear Threshold Unit

5
Perceptron learning rule
  • We learn fX?-1,1 represented as f
    sgnw?x)
  • Where X or X w?

6
Footnote About the Threshold
  • On previous slide, Perceptron has no threshold
  • But we dont lose generality

7
Geometric View
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Perceptron learning rule
  • If x is Boolean, only weights of active features
    are updated.

12
Perceptron Learnability
  • Obviously cant learn what it cant represent
  • Only linearly separable functions
  • Minsky and Papert (1969) wrote an influential
    book demonstrating Perceptrons representational
    limitations
  • Parity functions cant be learned (XOR)
  • In vision, if patterns are represented with local
    features, cant represent symmetry, connectivity
  • Research on Neural Networks stopped for years
  • Rosenblatt himself (1959) asked,
  • What pattern recognition problems can be
    transformed so as to become linearly separable?

13
(x1 ? x2) v (x3 ? x4)
y1 ? y2
14
Perceptron Convergence
  • Perceptron Convergence Theorem
  • If there exist a set of weights that are
    consistent with the
  • (I.e., the data is linearly separable) the
    perceptron learning
  • algorithm will converge
  • -- How long would it take to converge ?
  • Perceptron Cycling Theorem If the training
    data is not linearly
  • the perceptron learning algorithm will
    eventually repeat the
  • same set of weights and therefore enter an
    infinite loop.
  • -- How to provide robustness, more
    expressivity ?

15
Perceptron Mistake Bound Theorem
  • Maintains a weight vector w?RN, w0(0,,0).
  • Upon receiving an example x ? RN
  • Predicts according to the linear threshold
    function wx ? 0.
  • Theorem Novikoff,1963 Let (x1 y1),, (xt
    yt), be a sequence of labeled examples with xi
    ?RN, ??xi???R and yi ?-1,1 for all i.
  • Let u?RN, ? gt 0 be such that, u 1 and
    yi u xi ? ? for all i.
  • Then Perceptron makes at most R2 / ? 2
    mistakes on this example sequence.
  • (see additional notes)

Margin Complexity Parameter
16
Perceptron-Mistake Bound
Proof Let vk be the hypothesis before the k-th
mistake. Assume that the k-th mistake occurs on
the input example (xi, yi).
Assumptions v1 0 u 1 yi u xi ? ?
Multiply by u
By definition of u
By induction
Projection
K lt R2 / ? 2
17
Mistake Bound and PAC
  • We discussed theory of generalization in terms
    of ²-?
  • The Probably Approximately Correct (PAC)
    theory.
  • Why are we talking now about Mistake Bound?
  • Whats the relation? Which is weaker?
  • In the mistake bound model we dont know when
    we will make the mistakes.
  • In the PAC model we want dependence on number
    of examples seen and not
  • number of mistakes.
  • The key reasons we talk in terms of Mistake
    Bound
  • Its easier to think about it this way
  • Every Mistake-Bound Algorithm can be converted
    efficiently to a PAC algorithm
  • To convert
  • Wait for a long stretch w/o mistakes (there
    must be one)
  • Use the hypothesis at the end of this stretch.
  • Its PAC behavior is relative to the length of
    the stretch.

Averaged Perceptron is doing basically that.
18
Perceptron for Boolean Functions
  • How many mistakes will the Perceptron
    algorithms make
  • when learning a k-disjunction?
  • It can make O(n) mistakes on k-disjunction on n
    attributes.
  • Our bound R2 / ? 2
  • w 1 / k 1/2 for k components, 0 for
    others,
  • ? difference only in one variable 1 / k ½
  • R n 1/2
  • Thus, we get n k
  • Is it possible to do better?
  • This is important if nthe number of features is
    very large

19
Winnow Algorithm
  • The Winnow Algorithm learns Linear Threshold
    Functions.
  • For the class of disjunction,
  • instead of demotion we can use elimination.

20
Winnow - Example
21
Winnow - Example
  • Notice that the same algorithm will learn a
    conjunction over
  • these variables (w(256,256,0,32,256,256) )

22
Winnow - Mistake Bound
Claim Winnow makes O(k log n) mistakes on
k-disjunctions u - of mistakes on positive
examples (promotions) v - of mistakes on
negative examples (demotions)
23
Winnow - Mistake Bound
Claim Winnow makes O(k log n) mistakes on
k-disjunctions u - of mistakes on positive
examples (promotions) v - of mistakes on
negative examples (demotions) 1. u lt k log(2n)
24
Winnow - Mistake Bound
Claim Winnow makes O(k log n) mistakes on
k-disjunctions u - of mistakes on positive
examples (promotions) v - of mistakes on
negative examples (demotions) 1. u lt k log(2n) A
weight that corresponds to a good variable is
only promoted. When these weights get to n there
will no more mistakes on positives
25
Winnow - Mistake Bound
u - of mistakes on positive examples
(promotions) v - of mistakes on negative
examples (demotions) 2. v lt 2(u 1)
26
Winnow - Mistake Bound
u - of mistakes on positive examples
(promotions) v - of mistakes on negative
examples (demotions) 2. v lt 2(u 1) Total
weight TWn initially
27
Winnow - Mistake Bound
u - of mistakes on positive examples
(promotions) v - of mistakes on negative
examples (demotions) 2. v lt 2(u 1) Total
weight TWn initially Mistake on positive
TW(t1) lt TW(t) n
28
Winnow - Mistake Bound
u - of mistakes on positive examples
(promotions) v - of mistakes on negative
examples (demotions) 2. v lt 2(u 1) Total
weight TWn initially Mistake on positive
TW(t1) lt TW(t) n Mistake on negative
TW(t1) lt TW(t) - n/2
29
Winnow - Mistake Bound
u - of mistakes on positive examples
(promotions) v - of mistakes on negative
examples (demotions) 2. v lt 2(u 1) Total
weight TWn initially Mistake on positive
TW(t1) lt TW(t) n Mistake on negative
TW(t1) lt TW(t) - n/2 0 lt TW lt n u n - v
n/2 ? v lt 2(u1)
30
Winnow - Mistake Bound
u - of mistakes on positive examples
(promotions) v - of mistakes on negative
examples (demotions) of mistakes u v lt
3u 2 O(k log n)
31
Winnow - Extensions
  • This algorithm learns monotone functions
  • in Boolean algebra sense
  • For the general case
  • - Duplicate variables
  • For the negation of variable x, introduce a new
    variable y.
  • Learn monotone functions over 2n variables
  • - Balanced version
  • Keep two weights for each variable effective
    weight is the difference

32
Winnow - A Robust Variation
  • Winnow is robust in the presence of various
    kinds of noise.
  • (classification noise, attribute noise)
  • Importance sometimes we learn under some
    distribution
  • but test under a slightly different one.
  • (e.g., natural language
    applications)

33
Winnow - A Robust Variation
  • Modeling
  • Adversarys turn may change the target concept
    by adding or
  • removing some variable from the target
    disjunction.
  • Cost of each addition move is 1.
  • Learners turn makes prediction on the examples
    given, and is
  • then told the correct answer (according to
    current target function)
  • Winnow-R Same as Winnow, only doesnt let
    weights go below 1/2
  • Claim Winnow-R makes O(c log n) mistakes, (c -
    cost of adversary)
  • (generalization of previous claim)

34
Algorithmic Approaches
  • Focus Two families of algorithms
  • (one of the on-line representative)

Which Algorithm to choose?
35
Algorithm Descriptions
  • Multiplicative weight update algorithm
  • (Winnow, Littlestone, 1988.
    Variations exist)

36
How to Compare?
  • Generalization
  • (since the representation is the same)
  • How many examples are needed
  • to get to a given level of
    accuracy?
  • Efficiency
  • How long does it take to learn a
  • hypothesis and evaluate it
    (per-example)?
  • Robustness Adaptation to a new domain, .

37
Sentence Representation
  • S I dont know whether to laugh or
    cry

- Define a set of features features
are relations that hold in the sentence - Map
a sentence to its feature-based representation
The feature-based representation will give
some of the information in the sentence -
Use this as an example to your algorithm
38
Sentence Representation
  • S I dont know whether to laugh or
    cry
  • - Define a set of features
  • features are relations that hold in the
    sentence
  • - Conceptually, there are two steps in coming up
    with a
  • feature-based representation
  • 1. What are the information sources available?
  • Sensors words, order of words,
    properties (?) of words
  • 2. What features to construct based on these?

Why needed?
39
Embedding
New discriminator in functionally simpler
40
Domain Characteristics
  • The number of potential features is very large
  • The instance space is sparse
  • Decisions depend on a small set of features
    (sparse)
  • Want to learn from a number of examples that
    is
  • small relative to the dimensionality

41
Which Algorithm to Choose?
  • Generalization
  • Multiplicative algorithms
  • Bounds depend on u, the separating hyperplane
  • M 2ln n u12 maxix(i)12/mini(u x(i))2
  • Advantage with few relevant features in concept
  • Additive algorithms
  • Bounds depend on x (Kivinen / Warmuth, 95)
  • M u2 maxix(i)2/mini(u x(i))2
  • Advantage with few active features per example

The l1 norm x1 ?ixi The l2
norm x2 (?1nxi2)1/2 The lp norm xp
(?1nxip )1/p The l1 norm x1
maxixi
42
Generalization
  • Dominated by the sparseness of the function space
  • Most features are irrelevant
  • of examples required by multiplicative
    algorithms
  • depends mostly on of relevant features
  • (Generalization bounds depend on
    w)
  • Lesser issue Sparseness of features space
  • advantage to additive. Generalization depend
    on x
  • (Kivinen/Warmuth 95) see additional
    notes.

43
Mistakes bounds for 10 of 100 of n
Function At least 10 out of fixed 100 variables
are active Dimensionality is n
Perceptron,SVMs
of mistakes to convergence
Winnow
n Total of Variables (Dimensionality)
44
Dual Perceptron
  • We can replace xi xj with K(xi ,xj) which can
    be regarded a dot product in some large (or
    infinite) space
  • K(x,y) - often can be computed efficiently
    without computing mapping to this space

45
Efficiency
  • Dominated by the size of the feature space
  • Most features are functions (e.g., conjunctions)
    of raw attributes
  • Additive algorithms allow the use of Kernels
  • No need to explicitly generate the complex
    features
  • Could be more efficient since work is done in the
    original feature space.
  • In practice explicit Kernels (feature space
    blow-up) is often more efficient.

46
Practical Issues and Extensions
  • There are many extensions that can be made to
    these basic algorithms.
  • Some are necessary for them to perform well.
  • Infinite attribute domain
  • Regularization

47
Extensions Regularization
  • In general regularization is used to bias the
    learner in the direction of a low-expressivity
    (low VC dimension) separator
  • Thick Separator (Perceptron or Winnow)
  • Promote if
  • w x gt ??
  • Demote if
  • w x lt ?-?

w x ?
-
-
-
-
-
-
-
-
-
-
-
-
-
-
w x 0
-
48
Regularization Via Averaged Perceptron
  • An Averaged Perceptron Algorithm is motivated by
    the
  • Mistake BoundPAC conversion.
  • To convert
  • Wait for a long stretch w/o mistakes (there
    must be one)
  • Use the hypothesis at the end of this
    stretch.
  • Its PAC behavior is relative to the length of
    the stretch.
  • Average Perceptron
  • Returns a weighted average of a number of
    earlier hypothesis
  • The weights are a function of the length of
    no-mistakes stretch.

The two most important extensions for
Winnow/Perceptron turns out to be Thick
Separator Averaged Perceptron
49
SNoW
  • A learning architecture that supports several
    linear update rules (Winnow, Perceptron, naïve
    Bayes)
  • Allows regularization voted Winnow/Perceptron
    pruning many options
  • True multi-class classification
  • Variable size examples very good support for
    large scale domains in terms of number of
    examples and number of features.
  • Explicit kernels (blowing up feature space).
  • Very efficient (1-2 order of magnitude faster
    than SVMs)
  • Stand alone, implemented in LBJ
  • Dowload from http//L2R.cs.uiuc.edu/cogcomp

50
COLT approach to explaining Learning
  • No Distributional Assumption
  • Training Distribution is the same as the Test
    Distribution
  • Generalization bounds depend
  • on this view and affects
  • model selection.
  • ErrD(h) lt ErrTR(h)
  • P(VC(H), log(1/),1/m)
  • This is also called the
  • Structural Risk Minimization principle.

51
COLT approach to explaining Learning
  • No Distributional Assumption
  • Training Distribution is the same as the Test
    Distribution
  • Generalization bounds depend on this view and
    affect model selection.
  • ErrD(h) lt ErrTR(h) P(VC(H), log(1/),1/m)
  • As presented, the VC dimension is a combinatorial
    parameter that is associated with a class of
    functions.
  • We know that the class of linear functions has a
    lower VC dimension than the class of quadratic
    functions.
  • But, this notion can be refined to depend on a
    given data set, and this way directly affect the
    hypothesis chosen for this data set.

52
Data Dependent VC dimension
  • Consider the class of linear functions,
    parameterized by their margin.
  • Although both classifiers separate the data, the
    distance with which the separation is achieved is
    different
  • Intuitively, we can agree that Large Margin ?
    Small VC dimension

53
Margin and VC dimension
54
Margin and VC dimension
  • Theorem (Vapnik) If H is the space of all linear
    classifiers in ltn that separate the training data
    with margin at least , then
  • VC(H) R2/2
  • where R is the radius of the smallest sphere (in
    ltn) that contains the data.
  • This is the first observation that will lead to
    algorithmic approach.
  • The second one is that
  • Small w ? Large Margin
  • Consequently the algorithm will be from among
    all those ws that agree with the data, find the
    one with the minimal size w

55
Margin and Weight Vector
  • Consequently the algorithm will be from among
    all those ws that agree with the data, find the
    one with the minimal size w. This leads to
    the SVM optimization algorithm

56
Key Problems
  • Computational Issues
  • A lot of effort has been spent on trying to
    optimize SVMs.
  • Gradually, algorithms became more on-line and
    more similar to Perceptron and Stochastic
    Gradient Descent.
  • Algorithms like SMO have decomposed the quadratic
    programming
  • More recent algorithms have become almost
    identical to earlier algorithms we have seen
  • Is it really optimal?
  • Experimental Results are very good
  • Issues with the tradeoff between of examples
    and of features are similar to other linear
    classifiers.

57
Support Vector Machines
  • SVM Linear Classifier Regularization
    Kernel Trick.
  • This leads to an algorithm from among all those
    ws that agree with the data, find the one with
    the minimal size w
  • Minimize ½ w2
  • Subject to y(w x b) 1, for all x 2 S
  • This is an optimization problem that can be
    solved using techniques from optimization theory.
    By introducing Lagrange multipliers ? we can
    rewrite the dual formulation of this optimization
    problems as
  • w ?i ?i yi xi
  • Where the ?s are such that the following
    functional is maximized
  • L(?) -1/2 ?i ?j ?1 ?j xi xj yi yj ?i ?i
  • The optimum setting of the ?s turns out to be
  • ?i yi (w xi b -1 ) 0 8 i

58
Support Vector Machines
  • SVM Linear Classifier Regularization
    Kernel Trick.
  • Minimize ½ w2
  • Subject to y(w x b) 1, for all x 2 S
  • The dual formulation of this optimization
    problems gives
  • w ?i ?i yi xi
  • Optimum setting of ?s ?i yi
    (w xi b -1 ) 0 8 i
  • That is, ?i 0 only when w xi b -10
  • Those are the points sitting on the margin, call
    support vectors
  • We get
  • f(x,w,b) w x b ?i ?i yi xi x b
  • The value of the function depends on the support
    vectors, and only on their dot product with the
    point of interest x.
  • Dependence on the dot product leads to the
    ability to introduce kernels (just like in
    perceptron)
  • What if the data is not linearly separable?
  • What is the difference from regularized
    perceptron/Winnow?

59
Summary
  • Described examples of linear algorithms
  • Perceptron, Winnow, SVM
  • Additive vs. Multiplicative versions
  • Basic theory behind these methods
  • Robust modifications
Write a Comment
User Comments (0)
About PowerShow.com