Title: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learn
1CS546 Machine Learning and Natural
LanguageLecture 7 Introduction to
Classification Linear Learning Algorithms2009
2Linear Functions
y (x1 ? x2) v (x1 ? x2)
y (x1 ? x2) v (x3 ? x4)
3Linear Functions
w x ?














w x 0

4Perceptron learning rule
 Online, mistake driven algorithm.
 Rosenblatt (1959)
 suggested that when a target output value is
 provided for a single neuron with fixed
input, it can  incrementally change weights and learn to
produce the  output using the Perceptron learning rule
 Perceptron Linear Threshold Unit
5Perceptron learning rule
 We learn fX?1,1 represented as f
sgnw?x)  Where X or X w?
6Footnote About the Threshold
 On previous slide, Perceptron has no threshold
 But we dont lose generality
7Geometric View
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Perceptron learning rule
 If x is Boolean, only weights of active features
are updated.
12Perceptron Learnability
 Obviously cant learn what it cant represent
 Only linearly separable functions
 Minsky and Papert (1969) wrote an influential
book demonstrating Perceptrons representational
limitations  Parity functions cant be learned (XOR)
 In vision, if patterns are represented with local
features, cant represent symmetry, connectivity  Research on Neural Networks stopped for years
 Rosenblatt himself (1959) asked,
 What pattern recognition problems can be
transformed so as to become linearly separable?
13(x1 ? x2) v (x3 ? x4)
y1 ? y2
14Perceptron Convergence
 Perceptron Convergence Theorem
 If there exist a set of weights that are
consistent with the  (I.e., the data is linearly separable) the
perceptron learning  algorithm will converge
  How long would it take to converge ?
 Perceptron Cycling Theorem If the training
data is not linearly  the perceptron learning algorithm will
eventually repeat the  same set of weights and therefore enter an
infinite loop.   How to provide robustness, more
expressivity ?
15Perceptron Mistake Bound Theorem
 Maintains a weight vector w?RN, w0(0,,0).
 Upon receiving an example x ? RN
 Predicts according to the linear threshold
function wx ? 0.  Theorem Novikoff,1963 Let (x1 y1),, (xt
yt), be a sequence of labeled examples with xi
?RN, ??xi???R and yi ?1,1 for all i.  Let u?RN, ? gt 0 be such that, u 1 and
yi u xi ? ? for all i.  Then Perceptron makes at most R2 / ? 2
mistakes on this example sequence.  (see additional notes)
Margin Complexity Parameter
16PerceptronMistake Bound
Proof Let vk be the hypothesis before the kth
mistake. Assume that the kth mistake occurs on
the input example (xi, yi).
Assumptions v1 0 u 1 yi u xi ? ?
Multiply by u
By definition of u
By induction
Projection
K lt R2 / ? 2
17Mistake Bound and PAC
 We discussed theory of generalization in terms
of ²?  The Probably Approximately Correct (PAC)
theory.  Why are we talking now about Mistake Bound?
 Whats the relation? Which is weaker?
 In the mistake bound model we dont know when
we will make the mistakes.  In the PAC model we want dependence on number
of examples seen and not  number of mistakes.
 The key reasons we talk in terms of Mistake
Bound  Its easier to think about it this way
 Every MistakeBound Algorithm can be converted
efficiently to a PAC algorithm  To convert
 Wait for a long stretch w/o mistakes (there
must be one)  Use the hypothesis at the end of this stretch.
 Its PAC behavior is relative to the length of
the stretch.
Averaged Perceptron is doing basically that.
18Perceptron for Boolean Functions
 How many mistakes will the Perceptron
algorithms make  when learning a kdisjunction?
 It can make O(n) mistakes on kdisjunction on n
attributes.  Our bound R2 / ? 2
 w 1 / k 1/2 for k components, 0 for
others,  ? difference only in one variable 1 / k ½
 R n 1/2
 Thus, we get n k
 Is it possible to do better?
 This is important if nthe number of features is
very large
19Winnow Algorithm
 The Winnow Algorithm learns Linear Threshold
Functions.  For the class of disjunction,
 instead of demotion we can use elimination.
20Winnow  Example
21Winnow  Example
 Notice that the same algorithm will learn a
conjunction over  these variables (w(256,256,0,32,256,256) )
22Winnow  Mistake Bound
Claim Winnow makes O(k log n) mistakes on
kdisjunctions u  of mistakes on positive
examples (promotions) v  of mistakes on
negative examples (demotions)
23Winnow  Mistake Bound
Claim Winnow makes O(k log n) mistakes on
kdisjunctions u  of mistakes on positive
examples (promotions) v  of mistakes on
negative examples (demotions) 1. u lt k log(2n)
24Winnow  Mistake Bound
Claim Winnow makes O(k log n) mistakes on
kdisjunctions u  of mistakes on positive
examples (promotions) v  of mistakes on
negative examples (demotions) 1. u lt k log(2n) A
weight that corresponds to a good variable is
only promoted. When these weights get to n there
will no more mistakes on positives
25Winnow  Mistake Bound
u  of mistakes on positive examples
(promotions) v  of mistakes on negative
examples (demotions) 2. v lt 2(u 1)
26Winnow  Mistake Bound
u  of mistakes on positive examples
(promotions) v  of mistakes on negative
examples (demotions) 2. v lt 2(u 1) Total
weight TWn initially
27Winnow  Mistake Bound
u  of mistakes on positive examples
(promotions) v  of mistakes on negative
examples (demotions) 2. v lt 2(u 1) Total
weight TWn initially Mistake on positive
TW(t1) lt TW(t) n
28Winnow  Mistake Bound
u  of mistakes on positive examples
(promotions) v  of mistakes on negative
examples (demotions) 2. v lt 2(u 1) Total
weight TWn initially Mistake on positive
TW(t1) lt TW(t) n Mistake on negative
TW(t1) lt TW(t)  n/2
29Winnow  Mistake Bound
u  of mistakes on positive examples
(promotions) v  of mistakes on negative
examples (demotions) 2. v lt 2(u 1) Total
weight TWn initially Mistake on positive
TW(t1) lt TW(t) n Mistake on negative
TW(t1) lt TW(t)  n/2 0 lt TW lt n u n  v
n/2 ? v lt 2(u1)
30Winnow  Mistake Bound
u  of mistakes on positive examples
(promotions) v  of mistakes on negative
examples (demotions) of mistakes u v lt
3u 2 O(k log n)
31Winnow  Extensions
 This algorithm learns monotone functions
 in Boolean algebra sense
 For the general case
  Duplicate variables
 For the negation of variable x, introduce a new
variable y.  Learn monotone functions over 2n variables
  Balanced version
 Keep two weights for each variable effective
weight is the difference
32Winnow  A Robust Variation
 Winnow is robust in the presence of various
kinds of noise.  (classification noise, attribute noise)
 Importance sometimes we learn under some
distribution  but test under a slightly different one.
 (e.g., natural language
applications)
33Winnow  A Robust Variation
 Modeling
 Adversarys turn may change the target concept
by adding or  removing some variable from the target
disjunction.  Cost of each addition move is 1.
 Learners turn makes prediction on the examples
given, and is  then told the correct answer (according to
current target function) 
 WinnowR Same as Winnow, only doesnt let
weights go below 1/2  Claim WinnowR makes O(c log n) mistakes, (c 
cost of adversary)  (generalization of previous claim)
34Algorithmic Approaches
 Focus Two families of algorithms
 (one of the online representative)
Which Algorithm to choose?
35Algorithm Descriptions
 Multiplicative weight update algorithm
 (Winnow, Littlestone, 1988.
Variations exist)
36How to Compare?
 Generalization
 (since the representation is the same)
 How many examples are needed
 to get to a given level of
accuracy?  Efficiency
 How long does it take to learn a
 hypothesis and evaluate it
(perexample)?  Robustness Adaptation to a new domain, .
37Sentence Representation
 S I dont know whether to laugh or
cry
 Define a set of features features
are relations that hold in the sentence  Map
a sentence to its featurebased representation
The featurebased representation will give
some of the information in the sentence 
Use this as an example to your algorithm
38Sentence Representation
 S I dont know whether to laugh or
cry   Define a set of features
 features are relations that hold in the
sentence   Conceptually, there are two steps in coming up
with a  featurebased representation

 1. What are the information sources available?
 Sensors words, order of words,
properties (?) of words  2. What features to construct based on these?
Why needed?
39Embedding
New discriminator in functionally simpler
40Domain Characteristics
 The number of potential features is very large
 The instance space is sparse
 Decisions depend on a small set of features
(sparse)  Want to learn from a number of examples that
is  small relative to the dimensionality
41Which Algorithm to Choose?
 Generalization
 Multiplicative algorithms
 Bounds depend on u, the separating hyperplane
 M 2ln n u12 maxix(i)12/mini(u x(i))2
 Advantage with few relevant features in concept
 Additive algorithms
 Bounds depend on x (Kivinen / Warmuth, 95)
 M u2 maxix(i)2/mini(u x(i))2
 Advantage with few active features per example
The l1 norm x1 ?ixi The l2
norm x2 (?1nxi2)1/2 The lp norm xp
(?1nxip )1/p The l1 norm x1
maxixi
42Generalization
 Dominated by the sparseness of the function space
 Most features are irrelevant

 of examples required by multiplicative
algorithms  depends mostly on of relevant features
 (Generalization bounds depend on
w)
 Lesser issue Sparseness of features space
 advantage to additive. Generalization depend
on x  (Kivinen/Warmuth 95) see additional
notes.
43Mistakes bounds for 10 of 100 of n
Function At least 10 out of fixed 100 variables
are active Dimensionality is n
Perceptron,SVMs
of mistakes to convergence
Winnow
n Total of Variables (Dimensionality)
44Dual Perceptron
 We can replace xi xj with K(xi ,xj) which can
be regarded a dot product in some large (or
infinite) space  K(x,y)  often can be computed efficiently
without computing mapping to this space
45Efficiency
 Dominated by the size of the feature space
 Most features are functions (e.g., conjunctions)
of raw attributes
 Additive algorithms allow the use of Kernels
 No need to explicitly generate the complex
features
 Could be more efficient since work is done in the
original feature space.  In practice explicit Kernels (feature space
blowup) is often more efficient.
46Practical Issues and Extensions
 There are many extensions that can be made to
these basic algorithms.  Some are necessary for them to perform well.
 Infinite attribute domain
 Regularization
47Extensions Regularization
 In general regularization is used to bias the
learner in the direction of a lowexpressivity
(low VC dimension) separator  Thick Separator (Perceptron or Winnow)
 Promote if
 w x gt ??
 Demote if
 w x lt ??
w x ?














w x 0

48Regularization Via Averaged Perceptron
 An Averaged Perceptron Algorithm is motivated by
the  Mistake BoundPAC conversion.
 To convert
 Wait for a long stretch w/o mistakes (there
must be one)  Use the hypothesis at the end of this
stretch.  Its PAC behavior is relative to the length of
the stretch.  Average Perceptron
 Returns a weighted average of a number of
earlier hypothesis  The weights are a function of the length of
nomistakes stretch.
The two most important extensions for
Winnow/Perceptron turns out to be Thick
Separator Averaged Perceptron
49SNoW
 A learning architecture that supports several
linear update rules (Winnow, Perceptron, naïve
Bayes)  Allows regularization voted Winnow/Perceptron
pruning many options  True multiclass classification
 Variable size examples very good support for
large scale domains in terms of number of
examples and number of features.  Explicit kernels (blowing up feature space).
 Very efficient (12 order of magnitude faster
than SVMs)  Stand alone, implemented in LBJ
 Dowload from http//L2R.cs.uiuc.edu/cogcomp
50COLT approach to explaining Learning
 No Distributional Assumption
 Training Distribution is the same as the Test
Distribution 
 Generalization bounds depend
 on this view and affects
 model selection.
 ErrD(h) lt ErrTR(h)
 P(VC(H), log(1/),1/m)
 This is also called the
 Structural Risk Minimization principle.
51COLT approach to explaining Learning
 No Distributional Assumption
 Training Distribution is the same as the Test
Distribution 
 Generalization bounds depend on this view and
affect model selection.  ErrD(h) lt ErrTR(h) P(VC(H), log(1/),1/m)
 As presented, the VC dimension is a combinatorial
parameter that is associated with a class of
functions.  We know that the class of linear functions has a
lower VC dimension than the class of quadratic
functions.  But, this notion can be refined to depend on a
given data set, and this way directly affect the
hypothesis chosen for this data set.
52Data Dependent VC dimension
 Consider the class of linear functions,
parameterized by their margin.  Although both classifiers separate the data, the
distance with which the separation is achieved is
different  Intuitively, we can agree that Large Margin ?
Small VC dimension
53Margin and VC dimension
54Margin and VC dimension
 Theorem (Vapnik) If H is the space of all linear
classifiers in ltn that separate the training data
with margin at least , then  VC(H) R2/2
 where R is the radius of the smallest sphere (in
ltn) that contains the data.  This is the first observation that will lead to
algorithmic approach.  The second one is that
 Small w ? Large Margin
 Consequently the algorithm will be from among
all those ws that agree with the data, find the
one with the minimal size w
55Margin and Weight Vector
 Consequently the algorithm will be from among
all those ws that agree with the data, find the
one with the minimal size w. This leads to
the SVM optimization algorithm
56Key Problems
 Computational Issues
 A lot of effort has been spent on trying to
optimize SVMs.  Gradually, algorithms became more online and
more similar to Perceptron and Stochastic
Gradient Descent.  Algorithms like SMO have decomposed the quadratic
programming  More recent algorithms have become almost
identical to earlier algorithms we have seen  Is it really optimal?
 Experimental Results are very good
 Issues with the tradeoff between of examples
and of features are similar to other linear
classifiers.
57Support Vector Machines
 SVM Linear Classifier Regularization
Kernel Trick.  This leads to an algorithm from among all those
ws that agree with the data, find the one with
the minimal size w  Minimize ½ w2
 Subject to y(w x b) 1, for all x 2 S
 This is an optimization problem that can be
solved using techniques from optimization theory.
By introducing Lagrange multipliers ? we can
rewrite the dual formulation of this optimization
problems as  w ?i ?i yi xi
 Where the ?s are such that the following
functional is maximized  L(?) 1/2 ?i ?j ?1 ?j xi xj yi yj ?i ?i
 The optimum setting of the ?s turns out to be
 ?i yi (w xi b 1 ) 0 8 i
58Support Vector Machines
 SVM Linear Classifier Regularization
Kernel Trick.  Minimize ½ w2
 Subject to y(w x b) 1, for all x 2 S
 The dual formulation of this optimization
problems gives  w ?i ?i yi xi
 Optimum setting of ?s ?i yi
(w xi b 1 ) 0 8 i  That is, ?i 0 only when w xi b 10
 Those are the points sitting on the margin, call
support vectors  We get
 f(x,w,b) w x b ?i ?i yi xi x b
 The value of the function depends on the support
vectors, and only on their dot product with the
point of interest x.
 Dependence on the dot product leads to the
ability to introduce kernels (just like in
perceptron)  What if the data is not linearly separable?
 What is the difference from regularized
perceptron/Winnow?
59Summary
 Described examples of linear algorithms
 Perceptron, Winnow, SVM
 Additive vs. Multiplicative versions
 Basic theory behind these methods
 Robust modifications