Title: Machine Learning
1Learning
Improving the performance of the agent w.r.t.
the external performance measure
Dimensions What can be learned? Any of
the boxes representing the agents
knowledge action description, effect
probabilities, causal relations in the
world (and the probabilities of
causation), utility models (sort of through
credit assignment), sensor data
interpretation models What feedback is
available? Supervised, unsupervised,
reinforcement learning Credit
assignment problem What prior knowledge is
available?  Tabularasa (agents head is
a blank slate) or preexisting knowledge
2(No Transcript)
3Dimensions of Learning
 Representation of the knowledge
 Degree of Guidance
 Supervised
 Teacher provides training examples solutions
 E.g. Classification
 Unsupervised
 No assistance from teacher
 E.g. Clustering Inducing hidden variables
 Inbetween
 Either feedback is given only for some of the
examples  Semisupervised Learning
 Or feedback is provided after a sequence of
decision are made  Reinforcement Learning
 Degree of Background Knowledge
 Tabula Rasa
 No background knowledge other than the training
examples  Knowledgebased learning
 Examples are interpreted in the context of
existing knowledge  Knowledge Level vs. Speedup Learning
 If you do have background knowledge, then a
question is whether the learned knowledge is
entailed by the background knowledge or not  (Entailment can be logical or probabilistic)
 If it is entailed, then it is called deductive
learning  If it is not entailed, then it is called
inductive learning
4Inductive Learning(Classification Learning)
 Given a set of labeled examples
 Find the rule that underlies the labeling
 (so you can use it to predict future unlabeled
examples)  Tabularasa, fully supervised
 Too hard as given.. Need to constrain the space
of rules  Bias Start with a specific form of hypothesis
space  With a given bias, Inductive learning reduces to
winnowing through the hypotheses spacechecking
to see which of them fit the data best
similar to predicting credit card fraud,
predicting who are likely to respond to junk
mail predicting what items you are likely to
buy
Closely related to Function learning
or curvefitting (regression)
5Inductive Learning(Classification Learning)
 Given a set of labeled training examples
 Find the rule that underlies the labeling
 (so you can use it to predict future unlabeled
examples)  Tabula Rasa, fully supervised
 Qns
 How do we test a learner?
 Can learning ever work?
 How do we compare learners?
similar to predicting credit card fraud,
predicting who are likely to respond to junk
mail predicting what items you are likely to
buy
Closely related to Function learning
or curvefitting (regression)
6Inductive Learning(Classification Learning)
 How are learners tested?
 Performance on the test data (not the training
data)  Performance measured in terms of positive
 (when) Can learning work?
 Training and test examples the same?
 Training and test examples have no connection?
 Training and Test examples from the same
distribution
7Uses different biases in predicting Russels
waiting habbits
Decision Trees Examples are used to Learn
topology Order of questions
Knearest neighbors
If patronsfull and dayFriday then wait
(0.3/0.7) If waitgt60 and Reservationno then
wait (0.4/0.9)
Association rules Examples are used to
Learn support and confidence of
association rules
SVMs
Neural Nets Examples are used to Learn
topology Learn edge weights
Naïve bayes (bayesnet learning) Examples are
used to Learn topology Learn CPTs
8Inductive Learning(Classification Learning)
 Given a set of labeled examples, and a space of
hypotheses  Find the rule that underlies the labeling
 (so you can use it to predict future unlabeled
examples)  Tabularasa, fully supervised
 Idea
 Loop through all hypotheses
 Rank each hypothesis in terms of its match to
data  Pick the best hypothesis
 Main variations
 Bias the sort of rule are you looking for?
 If you are looking for only conjunctive
hypotheses, there are just 3n  Search
 Greedy search
 Decision tree learner
 Systematic search
 Version space learner
 Iterative search
 Neural net learner
It can be shown that sample complexity of PAC
learning is proportional to 1/e, 1/d AND log H
The main problem is that the space of
hypotheses is too large Given examples described
in terms of n boolean variables There are 2
different hypotheses For 6 features, there are
18,446,744,073,709,551,616 hypotheses
2n
9A good hypothesis will have fewest false
positives (Fh) and fewest false negatives
(Fh) Ideally, we want them to be zero Rank(h)
f(Fh, Fh) f depends on the domain
by default fSum but can give different
weights to different errors (Costbased
learning)
False ve The learner classifies the example
as ve, but it is actually ve
Ranking hypotheses
Medical domain Higher cost for F But
also high cost for F Spam Mailer Very low
cost for F higher cost for
F Terrorist/Criminal Identification High
cost for F (for the individual) High cost
for F (for the society)
H1 Russell waits only in italian restaurants
false ves X10, false ves
X1,X3,X4,X8,X12 H2 Russell waits only in cheap
french restaurants False ves False
ves X1,X3,X4,X6,X8,X12
10KNearest Neighbor
 An unseen instances class is determined by its
nearest neighbor  Or the majority label of its nearest k neighbors
 Real Issue Getting the right distance metric to
decide who are the neighbors  One of the most obvious classification algorithms
 Skips the middle stage and lets the examples be
their own pattern  A variation is to cluster the training examples
and remember the prototypes for each cluster
(reduces the number of things remembered)
11A classification learning example Predicting when
Rusell will wait for a table
similar to book preferences, predicting credit
card fraud, predicting when people are likely
to respond to junk mail
1211/28
Final exam options (vote) Option 1. Takehome
(that will be given after reading day
and will be due ¼
10th) Option 2. In class exam (on 10th)
Option 3. No exam. Rao gets rest, and everybody
gets A
 ?Blog questions
 ?Decision Tree learning
 ?Perceptron learning
13What is a reasonable goal in designing a learner?
Complexity measured in number of Samples
required to PAClearn
 (Idea) Learner must classify all new instances
(test cases) correctly always  Any test cases?
 Test cases drawn from the same distribution as
the training cases  Always?
 May be the training samples are not completely
representative of the test samples  So, we go with probably
 Correctly?
 May be impossible if the training data has noise
(the teacher may make mistakes too)  So, we go with approximately
 The goal of a learner then is to produce a
probably approximately correct (PAC) hypothesis,
for a given approximation (error rate) e and
probability d.  When is a learner A better than learner B?
 For the same e,d bounds, A needs fewer trailing
samples than B to reach PAC.
N 1/² ( log 1/ log H)
14PAC learning
Note This result only holds for finite
hypothesis spaces (e.g. not valid for the space
of line hypotheses!)
N 1/² ( log 1/ log H)
15Deriving Sample Complexity for PAC Learning
 IDEA We want to compute the probability that a
bad hypothesis (that makes more than ² error on
the test cases) is chosen for being consistent
with the training examples, and constrain it to
be less than  Probability that an hb 2 Hbad is consistent with
a single training example is (1  ²) (since
error rate of hb gt ²).  This holds ONLY because we assume training and
test instances are drawn with the same
distribution  The probability that it is consistent with all N
training examples is (1²)N  The probability that at least one bad hypothesis
does this is Hbad (1²)N H (1²)N ( since
Hbad H)  We want this probability be less than .
 That is H (1²)N
 Since (1  ²) e² we can have it if H e²N
or N 1/² (log 1/ log H)
hb

hb











16Uses different biases in predicting Russels
waiting habbits
Decision Trees Examples are used to Learn
topology Order of questions
Knearest neighbors
If patronsfull and dayFriday then wait
(0.3/0.7) If waitgt60 and Reservationno then
wait (0.4/0.9)
Association rules Examples are used to
Learn support and confidence of
association rules
SVMs
Neural Nets Examples are used to Learn
topology Learn edge weights
Naïve bayes (bayesnet learning) Examples are
used to Learn topology Learn CPTs
17Inductive Learning(Classification Learning)
 Given a set of labeled examples, and a space of
hypotheses  Find the rule that underlies the labeling
 (so you can use it to predict future unlabeled
examples)  Tabularasa, fully supervised
 Idea
 Loop through all hypotheses
 Rank each hypothesis in terms of its match to
data  Pick the best hypothesis
 Main variations
 Bias the sort of rule are you looking for?
 If you are looking for only conjunctive
hypotheses, there are just 3n  Search
 Greedy search
 Decision tree learner
 Systematic search
 Version space learner
 Iterative search
 Neural net learner
It can be shown that sample complexity of PAC
learning is proportional to 1/e, 1/d AND log H
The main problem is that the space of
hypotheses is too large Given examples described
in terms of n boolean variables There are 2
different hypotheses For 6 features, there are
18,446,744,073,709,551,616 hypotheses
2n
18Learning Decision TreesHow?
Basic Idea Pick an attribute Split
examples in terms of that attribute
If all examples are ve label Yes.
Terminate If all examples are ve
label No. Terminate If some are ve, some
are ve continue splitting
recursively (Special case Decision Stumps If
you dont feel like splitting any further,
return the majority label )
20 Questions AI Style
19Decision Trees Sample Complexity
 Decision Trees can Represent any boolean function
 ..So PAClearning decision trees should be
exponentially hard (since there are 22n
hypotheses)  ..however, decision tree learning algorithms use
greedy approaches for learning a good (rather
than the optimal) decision tree  Thus, using greedy rather than exhaustive search
of hypotheses space is another way of keeping
complexity low (at the expense of losing PAC
guarantees)
20Depending on the order we pick, we can get
smaller or bigger trees
Which tree is better? Why do you think so??
21Basic Idea Pick an attribute Split
examples in terms of that attribute
If all examples are ve label Yes.
Terminate If all examples are ve
label No. Terminate If some are ve, some
are ve continue splitting recursively
if no attributes left to split?
(label with majority element)
22The Information Gain Computation
P N /(NN) P N /(NN) I(P ,, P)
P log(P)  P log(P )
The difference is the information gain So, pick
the feature with the largest Info Gain I.e.
smallest residual info
Given k mutually exclusive and exhaustive events
E1.Ek whose probabilities are p1.pk The
information content (entropy) is defined as
S i pi log2 pi A split is good if it
reduces the entropy..
23The Information Gain Computation
P N /(NN) P N /(NN) I(P ,, P)
P log(P)  P log(P )
The difference is the information gain So, pick
the feature with the largest Info Gain I.e.
smallest residual info
Given k mutually exclusive and exhaustive events
E1.Ek whose probabilities are p1.pk The
information content (entropy) is defined as
S i pi log2 pi A split is good if it
reduces the entropy..
24I(1/2,1/2) 1/2 log 1/2 1/2 log 1/2
1/2 1/2 1 I(1,0) 1log 1 0
log 0 0
A simple example
V(M) 2/4 I(1/2,1/2) 2/4 I(1/2,1/2)
1 V(A) 2/4 I(1,0) 2/4 I(0,1)
0 V(N) 2/4 I(1/2,1/2) 2/4
I(1/2,1/2) 1
So Anxious is the best attribute to split on Once
you split on Anxious, the problem is solved
25(No Transcript)
26mfold crossvalidation Split N examples into
m equal sized parts for i1..m train with
all parts except ith test with the ith part
Evaluating the Decision Trees
Lesson Every bias makes some concepts easier
to learn and others harder to learn
Learning curves Given N examples, partition
them into Ntr the training set and Ntest the test
instances Loop for i1 to Ntr Loop for
Ns in subsets of Ntr of size I Train the
learner over Ns Test the learned
pattern over Ntest and compute the accuracy
(correct)
27Decision Stumps
This was used in the class but the next one is
the correct replacement
 Decision stumps are decision trees where the leaf
nodes do not necessarily have all ve or all ve
training examples  In general, with each leaf node, we can associate
a probability p that if we reach that leaf node,
the example is classified ve  When you reach that node, you toss a biased coin
(whose probability of heads is p and output ve
if the coin comes heads)  In normal decision trees, p is 0 or 1
 In decision stumps, 0 lt p lt 1
Splitting on feature fk
P N1 / N1N1
Majority vote is better than tossing coin
Sometimes, the best decision tree for a problem
could be a decision stump (see coin toss example
next)
28Problems with Info. Gain. Heuristics
 Feature correlation We are splitting on one
feature at a time  The Costanza party problem
 No obvious easy solution
 Overfitting We may look too hard for patterns
where there are none  E.g. Coin tosses classified by the day of the
week, the shirt I was wearing, the time of the
day etc.  Solution Dont consider splitting if the
information gain given by the best feature is
below a minimum threshold  Can use the c2 test for statistical significance
 Will also help when we have noisy samples
 We may prefer features with very high branching
 e.g. Branch on the universal time string for
Russell restaurant example  Branch on social security number to look
for patterns on who will get A  Solution gain ratio ratio of information
gain with the attribute A to the information
content of answering the question What is the
value of A?  The denominator is smaller for attributes with
smaller domains.
29Decision Stumps
 Decision stumps are decision trees where the leaf
nodes do not necessarily have all ve or all ve
training examples  Could happen either because examples are noisy
and misclassified or because you want to stop
before reaching pure leafs  When you reach that node, you return the majority
label as the decision.  (We can associate a confidence with that decision
using the P and P)
Splitting on feature fk
P N1 / N1N1
Sometimes, the best decision tree for a problem
could be a decision stump (see coin toss example
next)
30Bias Learning Accuracy
 Having weak bias (large hypothesis space)
 Allows us to capture more concepts
 ..increases learning cost
 May lead to overfitting
31Tastes Great/Less Filling
 Biases are essential for survival of an agent!
 You must need biases to just make learning
tractable  Whole object bias used by kids in language
acquisition  Biases put blinders on the learnerfiltering away
(possibly more accurate) hypotheses  God doesnt play dice with the universe
(Einstein)  Color of Skin relevant to predicting crime
(Billy BennettFormer Education Secretary)
32Uses different biases in predicting Russels
waiting habbits
Decision Trees Examples are used to Learn
topology Order of questions
If patronsfull and dayFriday then wait
(0.3/0.7) If waitgt60 and Reservationno then
wait (0.4/0.9)
Association rules Examples are used to
Learn support and confidence of
association rules
Neural Nets Examples are used to Learn
topology Learn edge weights
Naïve bayes (bayesnet learning) Examples are
used to Learn topology Learn CPTs
33Mirror, Mirror, on the wall Which learning
bias is the best of all?
Well, there is no such thing, silly! Each
bias makes it easier to learn some patterns and
harder (or impossible) to learn others A
linefitter can fit the best line to the data
very fast but wont know what to do if the data
doesnt fall on a line A curve fitter can
fit lines as well as curves but takes longer
time to fit lines than a line fitter. 
Different types of bias classes (Decision trees,
NNs etc) provide different ways of naturally
carving up the space of all possible
hypotheses So a more reasonable question is 
What is the bias class that has a specialization
corresponding to the type of patterns that
underlie my data? ?Bias can be seen as a
sneaky way of letting background knowledge
in..  In this bias class, what is the most
restrictive bias that still can capture the true
pattern in the data?
Decision trees can capture all boolean
functions but are faster at capturing
conjunctive boolean functions Neural nets can
capture all boolean or realvalued functions
but are faster at capturing linearly separable
functions Bayesian learning can capture all
probabilistic dependencies But are faster at
capturing single level dependencies (naïve bayes
classifier)
34Fitting test cases vs. predicting future
cases The BIG TENSION.
Review
2
1
3
Why not the 3rd?
3512/3 The Last Class??
 ?Fill Return the participation forms
 ?Todays Agenda Perceptrons (until 1130)
 ?Interactive Review
 ?Take Home Final will be delivered by email Wed
 Will be due 12/10
36Uses different biases in predicting Russels
waiting habbits
Decision Trees Examples are used to Learn
topology Order of questions
Knearest neighbors
If patronsfull and dayFriday then wait
(0.3/0.7) If waitgt60 and Reservationno then
wait (0.4/0.9)
Association rules Examples are used to
Learn support and confidence of
association rules
SVMs
Neural Nets Examples are used to Learn
topology Learn edge weights
Naïve bayes (bayesnet learning) Examples are
used to Learn topology Learn CPTs
37Decision Surface Learning(aka Neural Network
Learning)
 Idea Since classification is really a question
of finding a surface to separate the ve examples
from the ve examples, why not directly search in
the space of possible surfaces?  Mathematically, a surface is a function
 Need a way of learning functions
 Threshold units
38Neural Net is a collection of with
interconnections
threshold units
differentiable
39The Brain Connection
A Threshold Unit
Threshold Functions
differentiable
is sort of like a neuron
40Perceptron Networks
What happened to the Threshold? Can model
as an extra weight with static input
41Perceptron Learning
 Perceptron learning algorithm
 Loop through training examples
 If the activation level of the output unit is 1
when it should be 0, reduce the weight on the
link to the jth input unit by aIj, where Ii is
the ith input value and a a learning rate  If the activation level of the output unit is 0
when it should be 1, increase the weight on the
link to the ith input unit by aIj  Otherwise, do nothing
 Until convergence
Iterative search! node gt network weights
goodness gt error Actually a gradient
descent search
A nice applet at
http//neuron.eng.wayne.edu/java/Perceptron/New38.
html
42Perceptron Learning as Gradient Descent Search in
the weightspace
Often a constant learning rate parameter is
used instead
Ij
I
43Perceptron Training in Action
A nice applet at
http//neuron.eng.wayne.edu/java/Perceptron/New38.
html
44Can Perceptrons Learn All Boolean Functions?
Are all boolean functions linearly separable?
45Comparing Perceptrons and Decision Trees in
Majority Function and Russell Domain
Decision Trees
Perceptron
Decision Trees
Perceptron
Majority function
Russell Domain
Majority function is linearly seperable..
Russell domain is apparently not....
Encoding one input unit per attribute. The unit
takes as many distinct real values as the size
of attribute domain
46MaxMargin Classification Support Vector
Machines
 Any line that separates the ve ve examples
is a solution  And perceptron learning finds one of them
 But could we have a preference among these?
 may want to get the line that provides maximum
margin (equidistant from the nearest ve/ve)  The nereast ve and ve holding up the line are
called support vectors  This changes the problem into an optimization
one  Quadratic Programming can be used to directly
find such a line
Learning is Optimization after all!
47Lagrangian Dual
48Two ways to learn nonlinear decision surfaces
 First transform the data into higher dimensional
space  Find a linear surface
 Which is guaranteed to exist
 Transform it back to the original space
 TRICK is to do this without explicitly doing a
transformation
 Learn nonlinear surfaces directly (as
multilayer neural nets)  Trick is to do training efficiently
 Back Propagation to the rescue..
49Linear Separability in High Dimensions
Kernels allow us to consider separating
surfaces in highD without first converting
all points to highD
50Kernelized Support Vector Machines
 Turns out that it is not always necessary to
first map the data into highD, and then do
linear separation  The quadratic programming formulation for SVM
winds up using only the pairwise dot product of
training vectors  Dot product is a form of similarity metric
between points  If you replace that dot product by any nonlinear
function, you will, in essence, be transforming
data into some highdimensional space and then
finding the maxmargin linear classifier in that
space  Which will correspond to some wiggly surface in
the original dimension  The trick is to find the RIGHT similarity
function  Which is a form of prior knowledge
51Kernelized Support Vector Machines
 Turns out that it is not always necessary to
first map the data into highD, and then do
linear separation  The quadratic programming formulation for SVM
winds up using only the pairwise dot product of
training vectors  Dot product is a form of similarity metric
between points  If you replace that dot product by any nonlinear
function, you will, in essence, be tranforming
data into some highdimensional space and then
finding the maxmargin linear classifier in that
space  Which will correspond to some wiggly surface in
the original dimension  The trick is to find the RIGHT similarity
function  Which is a form of prior knowledge
52Domainknowledge Learning
Those who ignore easily available domain
knowledge are doomed to relearn it
Santayanas brother
 Classification learning is a problem addressed by
both people from AI (machine learning) and
Statistics  Statistics folks tend to distrust
domainspecific bias.  Let the data speak for itself
 ..but this is often futile. The very act of
describing the data points introduces bias (in
terms of the features you decided to use to
describe them..)  but much human learning occurs because of strong
domainspecific bias..  Machine learning is torn by these competing
influences..  In most current state of the art algorithms,
domain knowledge is allowed to influence
learning only through relatively narrow
avenues/formats (E.g. through kernels)  Okay in domains where there is very little (if
any) prior knowledge (e.g. what part of proteins
are doing what cellular function)  ..restrictive in domains where there already
exists human expertise..
53Multilayer Neural Nets
How come backprop doesnt get stuck in local
minima? One answer It is actually hard for
local minimas to form in highD, as the
trough has to be closed in all dimensions
54(No Transcript)
55MultiNetwork Learning can learn Russell Domains
Decision Trees
Decision Trees
Multilayer networks
Perceptron
Russell Domain
but does it slowly
56Practical Issues in Multilayer network learning
 For multilayer networks, we need to learn both
the weights and the network topology  Topology is fixed for perceptrons
 If we go with too many layers and connections, we
can get overfitting as well as sloooow
convergence  Optimal brain damage
 Start with more than needed hidden layers as well
as connections after a network is learned,
remove the nodes and connections that have very
low weights retrain
57Humans make 0.2 Neumans (postmen) make 2
Other impressive applications nohands
across america learning to speak
Knearestneighbor The test examples class is
determined by the class of the majority of
its k nearest neighbors Need to define an
appropriate distance measure sort of easy
for real valued vectors harder for
categorical attributes
58Decision Trees vs. Neural Nets
 Can handle realvalued attributes
 Can learn any nonlinear decision surface
 Incremental as new examples arrive, the network
can adapt.  Good at handling noise
 Convergence is quite slow
 Faster at learning linear ones
 Learned concept is represented by the weights and
topology of the network (so hard to understand)  Consider understanding Einstein by dissecting his
brain.  Double edged argumentthere are many learning
tasks for whion ch we do not know how to
articulated what we have learned. Eg. Face
recognition word recognition
 Work well for discrete attributes.
 Converge fast for conjunctive concepts
 Nonincremental (looks at all the examples at
once)  Not very good at handling noise
 Generally good at avoiding irrelevant attributes
 Easy to understand the learned concept
Why is it important to understand what is
learned? The military hidden tank photos
example
59(No Transcript)
60(No Transcript)
61(No Transcript)
62True hypothesis eventually dominates
probability of indefinitely producing
uncharacteristic data ?0
63Bayesian prediction is optimal (Given the
hypothesis prior, all other predictions are
less likely)
64Also, remember the Economist article that shows
that humans have strong priors..
65..note that the Economist article says humans
are able to learn from few examples only because
of priors..
66So, BN learning is just probability estimation!
(as long as data is complete!)
67How Well (and WHY) DOES NBC WORK?
 Naïve bayes classifier is darned easy to
implement  Good learning speed, classification speed
 Modest space storage
 Supports incrementality
 It seems to work very well in many scenarios
 Lots of recommender systems (e.g. Amazon books
recommender) use it  Peter Norvig, the director of Machine Learning at
GOOGLE said, when asked about what sort of
technology they use Naïve bayes  But WHY?
 NBCs estimate of class probability is quite bad
 BUT classification accuracy is different from
probability estimate accuracy  Domingoes/Pazzani 1996 analyze this
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72Bayes Network Learning
 Bias The relation between the class label and
class attributes is specified by a Bayes Network.  Approach
 Guess Topology
 Estimate CPTs
 Simplest case Naïve Bayes
 Topology of the network is class label causes
all the attribute values independently  So, all we need to do is estimate CPTs
P(attribClass)  In Russell domain, P(Patronswillwait)
 P(Patronsfullwillwaityes)
 training examples where patronsfull and
will waityes  training examples where will waityes
 Given a new case, we use bayes rule to compute
the class label
Class label is the disease attributes are
symptoms
73Naïve Bayesian Classification
 Problem Classify a given example E into one of
the classes among C1, C2 ,, Cn  E has k attributes A1, A2 ,, Ak and each Ai can
take d different values  Bayes Classification Assign E to class Ci that
maximizes P(Ci E)  P(Ci E) P(E Ci) P(Ci) / P(E)
 P(Ci) and P(E) are a priori knowledge (or can be
easily extracted from the set of data)  Estimating P(ECi) is harder
 Requires P(A1v1 A2v2.AkvkCi)
 Assuming d values per attribute, we will need ndk
probabilities  Naïve Bayes Assumption Assume all attributes are
independent P(E Ci) P P(Aivj Ci )  The assumption is BOGUS, but it seems to WORK
(and needs only ndk probabilities
74NBC in terms of BAYES networks..
NBC assumption
More realistic assumption
75Estimating the probabilities for NBC
 Given an example E described as A1v1
A2v2.Akvk we want to compute the class of E  Calculate P(Ci A1v1 A2v2.Akvk) for all
classes Ci and say that the class of E is the
one for which P(.) is maximum  P(Ci A1v1 A2v2.Akvk)
 P P(vj Ci ) P(Ci) / P(A1v1
A2v2.Akvk)  Given a set of training N examples that have
already been classified into n classes Ci  Let (Ci) be the number of
examples that are labeled as Ci  Let (Ci, Aivi) be the number of
examples labeled as Ci  that have attribute Ai
set to value vj  P(Ci) (Ci)/N
 P(Aivj Ci) (Ci, Aivi) /
(Ci) 
76Example
P(willwaityes) 6/12 .5 P(Patronsfullwillw
aityes) 2/60.333 P(Patronssomewillwaityes
) 4/60.666
Similarly we can show that P(Patronsfullwillw
aitno) 0.6666
P(willwaityesPatronsfull) P(patronsfullwill
waityes) P(willwaityes)


P(Patronsfull)
k
.333.5 P(willwaitnoPatronsfull) k 0.666.5
77Using Mestimates to improve probablity estimates
 The simple frequency based estimation of
P(AivjCk) can be inaccurate, especially when
the true value is close to zero, and the number
of training examples is small (so the probability
that your examples dont contain rare cases is
quite high)  Solution Use Mestimate
 P(Aivj Ci) (Ci, Aivi)
mp / (Ci) m  p is the prior probability of Ai taking the value
vi  If we dont have any background information,
assume uniform probability (that is 1/d if Ai can
take d values)  m is a constantcalled equivalent sample size
 If we believe that our sample set is large
enough, we can keep m small. Otherwise, keep it
large.  Essentially we are augmenting the (Ci) normal
samples with m more virtual samples drawn
according to the prior probability on how Ai
takes values  Popular values p1/V and mV where V is the
size of the vocabulary
Also, to avoid overflow errors do addition of
logarithms of probabilities (instead of
multiplication of probabilities)
78How Well (and WHY) DOES NBC WORK?
 Naïve bayes classifier is darned easy to
implement  Good learning speed, classification speed
 Modest space storage
 Supports incrementality
 It seems to work very well in many scenarios
 Lots of recommender systems (e.g. Amazon books
recommender) use it  Peter Norvig, the director of Machine Learning at
GOOGLE said, when asked about what sort of
technology they use Naïve bayes  But WHY?
 NBCs estimate of class probability is quite bad
 BUT classification accuracy is different from
probability estimate accuracy  Domingoes/Pazzani 1996 analyze this
79Sahami et als Solution for SPAM detection
 use standard Term Vector Space model developed
by Information Retrieval field (similar to
AdEater)  1 email message ? single fixedwidth feature
vector  have 1 bit in this vector for each term that
occurs in some message in E (plus a bunch of
domainspecific featureseg, when message was
sent)  learning algorithm
 use standard Naive Bayes algorithm
80Feature Selection
 A problem  too many features  each vector x
contains several thousand features.  Most come from word features  include a word
if any email contains it (eg, every x contains
an opossum feature even though this word occurs
in only one message).  Slows down learning and predictoins
 May cause lower performance
 The Naïve Bayes Classifier makes a huge
assumption  the independence assumption.  A good strategy is to have few features, to
minimize the chance that the assumption is
violated.  Ideally, discard all features that violate the
assumption. (But if we knew these features, we
wouldnt need to make the naive independence
assumption!)  Feature selection a few thousand ? 500
features
81FeatureSelection approach
 Lots of ways to perform feature selection
 FEATURE SELECTION DIMENSIONALITY REDUCTION
 One simple strategy mutual information
 Suppose we have two random variables A and B.
 Mutual information MI(A,B) is a numeric measure
of what we can conclude about A if we know B, and
viceversa.  MI(A,B) Pr(AB) log(Pr(AB)/(Pr(A)Pr(B)))
 Example If A and B are independent, then we
cant conclude anything MI(A, B) 0  Note that MI can be calculated without needing
conditional probabilities.
82Mutual Information, continued
 Check our intuition independence gt MI(A,B)0
MI(A,B) Pr(AB) log(Pr(AB)/(Pr(A)Pr(B)))
Pr(AB) log(Pr(A)Pr(B)/(Pr(A)Pr(B
))) Pr(AB) log 1
0  Fully correlated, it becomes the information
content  MI(A,A)  Pr(A)log(Pr(A))
 it depends on how uncertain the event is
notice that the expression becomes maximum (1)
when Pr(A).5 this makes sense since the most
uncertain event is one whose probability is .5
(if it is .3 then we know it is likely not to
happen if it is .7 we know it is likely to
happen).
83MI and Feature Selection
 Back to feature selection Pick features Xi that
have high mutual information with the junk/legit
classification C.  These are exactly the features that are good for
prediction  Pick 500 features Xi with highest value MI(Xi, C)
 NOTE NBCs estimate of probabilities is
actually quite a bit wrong but they still got by
with those..  Also, note that this analysis looks at each
feature in isolation and may thus miss highly
predictive word groups whose individual words are
quite nonpredictive  e.g. free and money may have low MI, but
Free money may have higher MI.  A way to handle this is to look at MI of not just
words but subsets of words  (in the worst case, you will need to compute 2n
MIs ?)  So instead, Sahami et. Al. add domain specific
phrases separately..  Note Theres no reason that the highestMI
features are the ones that least violate the
independence assumption  this is just a
heuristic!
84MI based feature selection vs. LSI
 Both MI and LSI are dimensionality reduction
techniques  MI is looking to reduce dimensions by looking at
a subset of the original dimensions  LSI looks instead at a linear combination of the
subset of the original dimensions (Good Can
automatically capture sets of dimensions that are
more predictive. Bad the new features may not
have any significance to the user)  MI does feature selection w.r.t. a classification
task (MI is being computed between a feature and
a class)  LSI does dimensionality reduction independent of
the classes (just looks at data variance)
85Reinforcement Learning
 Based on slides from Bill Smart
 http//www.cse.wustl.edu/wds/
86What is RL?
 a way of programming agents by reward and
punishment without needing to specify how the
task is to be achieved  Kaelbling, Littman, Moore, 96
87Basic RL Model
 Observe state, st
 Decide on an action, at
 Perform action
 Observe new state, st1
 Observe reward, rt1
 Learn from experience
 Repeat
 Goal Find a control policy that will maximize
the observed rewards over the lifetime of the
agent
A
S
R
88An Example Gridworld
 Canonical RL domain
 States are grid cells
 4 actions N, S, E, W
 Reward for entering top right cell
 0.01 for every other move
 Minimizing sum of rewards ? Shortest path
 In this instance
1
89The Promise of Learning
90The Promise of RL
 Specify what to do, but not how to do it
 Through the reward function
 Learning fills in the details
 Better final solutions
 Based of actual experiences, not programmer
assumptions  Less (human) time needed for a good solution
91Learning Value Functions
 We still want to learn a value function
 Were forced to approximate it iteratively
 Based on direct experience of the world
 Four main algorithms
 Certainty equivalence
 Temporal Difference (TD) learning
 Qlearning
 SARSA
92Certainty Equivalence
 Collect experience by moving through the world
 s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, a3, r4,
s4, a4, r5, s5, ...  Use these to estimate the underlying MDP
 Transition function, T S?A ? S
 Reward function, R S?A?S ? ?
 Compute the optimal value function for this MDP
 And then compute the optimal policy from it
93Temporal Difference (TD)
Sutton, 88
 TDlearning estimates the value function directly
 Dont try to learn the underlying MDP
 Keep an estimate of Vp(s) in a table
 Update these estimates as we gather more
experience  Estimates depend on exploration policy, p
 TD is an onpolicy method
94TDLearning Algorithm
 Initialize Vp(s) to 0, ?s
 Observe state, s
 Perform action, p(s)
 Observe new state, s, and reward, r
 Vp(s) ? (1a)Vp(s) a(r gVp(s))
 Go to 2
 0 a 1 is the learning rate
 How much attention do we pay to new experiences
95TDLearning
 Vp(s) is guaranteed to converge to V(s)
 After an infinite number of experiences
 If we decay the learning rate
 will work
 In practice, we often dont need value
convergence  Policy convergence generally happens sooner
96ActorCritic Methods
Barto, Sutton, Anderson, 83
 TD only evaluates a particular policy
 Does not learn a better policy
 We can change the policy as we learn V
 Policy is the actor
 Valuefunction estimate is the critic
 Success is generally dependent on the starting
policy being good enough
Policy (actor)
a
V
Value Function (critic)
r
s
World
97QLearning
Watkins Dayan, 92
 Qlearning iteratively approximates the
stateaction value function, Q  Again, were not going to estimate the MDP
directly  Learns the value function and policy
simultaneously  Keep an estimate of Q(s, a) in a table
 Update these estimates as we gather more
experience  Estimates do not depend on exploration policy
 Qlearning is an offpolicy method
98QLearning Algorithm
 Initialize Q(s, a) to small random values, ?s, a
 Observe state, s
 Pick an action, a, and do it
 Observe next state, s, and reward, r
 Q(s, a) ? (1  a)Q(s, a) a(r gmaxaQ(s, a))
 Go to 2
 0 a 1 is the learning rate
 We need to decay this, just like TD
99Picking Actions
 We want to pick good actions most of the time,
but also do some exploration  Exploring means that we can learn better policies
 But, we want to balance known good actions with
exploratory ones  This is called the exploration/exploitation
problem
100Picking Actions
 egreedy
 Pick best (greedy) action with probability e
 Otherwise, pick a random action
 Boltzmann (SoftMax)
 Pick an action based on its Qvalue
 , where t is
the temperature
101SARSA
 SARSA iteratively approximates the stateaction
value function, Q  Like Qlearning, SARSA learns the policy and the
value function simultaneously  Keep an estimate of Q(s, a) in a table
 Update these estimates based on experiences
 Estimates depend on the exploration policy
 SARSA is an onpolicy method
 Policy is derived from current value estimates
102SARSA Algorithm
 Initialize Q(s, a) to small random values, ?s, a
 Observe state, s
 Pick an action, a, and do it (just like
Qlearning)  Observe next state, s, and reward, r
 Q(s, a) ? (1a)Q(s, a) a(r gQ(s, p(s)))
 Go to 2
 0 a 1 is the learning rate
 We need to decay this, just like TD
103OnPolicy vs. Off Policy
 Onpolicy algorithms
 Final policy is influenced by the exploration
policy  Generally, the exploration policy needs to be
close to the final policy  Can get stuck in local maxima
 Offpolicy algorithms
 Final policy is independent of exploration policy
 Can use arbitrary exploration policies
 Will not get stuck in local maxima
Given enough experience
104Convergence Guarantees
 The convergence guarantees for RL are in the
limit  The word infinite crops up several times
 Dont let this put you off
 Value convergence is different than policy
convergence  Were more interested in policy convergence
 If one action is really better than the others,
policy convergence will happen relatively quickly
105Rewards
 Rewards measure how well the policy is doing
 Often correspond to events in the world
 Current load on a machine
 Reaching the coffee machine
 Program crashing
 Everything else gets a 0 reward
 Things work better if the rewards are incremental
 For example, distance to goal at each step
 These reward functions are often hard to design
These are sparse rewards
These are dense rewards
106The Markov Property
 RL needs a set of states that are Markov
 Everything you need to know to make a decision is
included in the state  Not allowed to consult the past
 Ruleofthumb
 If you can calculate the reward
function from the state without
any additional information,
youre OK
K
S
G
107But, Whats the Catch?
 RL will solve all of your problems, but
 We need lots of experience to train from
 Taking random actions can be dangerous
 It can take a long time to learn
 Not all problems fit into the MDP framework
108Learning Policies Directly
 An alternative approach to RL is to reward whole
policies, rather than individual actions  Run whole policy, then receive a single reward
 Reward measures success of the whole policy
 If there are a small number of policies, we can
exhaustively try them all  However, this is not possible in most interesting
problems
109Policy Gradient Methods
 Assume that our policy, p, has a set of n
realvalued parameters, q q1, q2, q3, ... , qn
 Running the policy with a particular q results in
a reward, rq  Estimate the reward gradient, , for each
qi 
110Policy Gradient Methods
 This results in hillclimbing in policy space
 So, its subject to all the problems of
hillclimbing  But, we can also use tricks from search, like
random restarts and momentum terms  This is a good approach if you have a
parameterized policy  Typically faster than valuebased methods
 Safe exploration, if you have a good policy
 Learns locallybest parameters for that policy
111An Example Learning to Walk
Kohl Stone, 04
 RoboCup legged league
 Walking quickly is a big advantage
 Robots have a parameterized gait controller
 11 parameters
 Controls step length, height, etc.
 Robots walk across soccer pitch and are timed
 Reward is a function of the time taken
112An Example Learning to Walk
 Basic idea
 Pick an initial q q1, q2, ... , q11
 Generate N testing parameter settings by
perturbing q  qj q1 d1, q2 d2, ... , q11 d11, di ?
e, 0, e  Test each setting, and observe rewards
 qj ? rj
 For each qi ? q
 Calculate q1, q10, q1 and set
 Set q ? q, and go to 2
Average reward when qni qi  di
113An Example Learning to Walk
Initial
Final
Video Nate Kohl Peter Stone, UT Austin
114Value Function or Policy Gradient?
 When should I use policy gradient?
 When theres a parameterized policy
 When theres a highdimensional state space
 When we expect the gradient to be smooth
 When should I use a valuebased method?
 When there is no parameterized policy
 When we have no idea how to solve the problem
115Summary for Part I
 Background
 MDPs, and how to solve them
 Solving MDPs with dynamic programming
 How RL is different from DP
 Algorithms
 Certainty equivalence
 TD
 Qlearning
 SARSA
 Policy gradient