Machine Learning Chapter 6. Bayesian Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Machine Learning Chapter 6. Bayesian Learning

Description:

Provides 'gold standard' for evaluating other learning algorithms ... rec.sport.hockey. alt.atheism. soc.religion.christian. talk.religion.misc. talk.politics.mideast ... – PowerPoint PPT presentation

Number of Views:3453
Avg rating:3.0/5.0
Slides: 50
Provided by: borameCs
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning Chapter 6. Bayesian Learning


1
Machine LearningChapter 6. Bayesian Learning
  • Tom M. Mitchell

2
Bayesian Learning
  • Bayes Theorem
  • MAP, ML hypotheses
  • MAP learners
  • Minimum description length principle
  • Bayes optimal classifier
  • Naive Bayes learner
  • Example Learning over text data
  • Bayesian belief networks
  • Expectation Maximization algorithm

3
Two Roles for Bayesian Methods
  • Provides practical learning algorithms
  • Naive Bayes learning
  • Bayesian belief network learning
  • Combine prior knowledge (prior probabilities)
    with observed data
  • Requires prior probabilities
  • Provides useful conceptual framework
  • Provides gold standard for evaluating other
    learning algorithms
  • Additional insight into Occams razor

4
Bayes Theorem
5
Choosing Hypotheses
  • Generally want the most probable hypothesis given
    the training data
  • Maximum a posteriori hypothesis hMAP
  • If assume P(hi) P(hj) then can further
    simplify, and choose the Maximum likelihood (ML)
    hypothesis

6
Bayes Theorem
  • Does patient have cancer or not?
  • A patient takes a lab test and the result comes
    back positive. The test returns a correct
    positive result in only 98 of the cases in which
    the disease is actually present, and a correct
    negative result in only 97 of the cases in which
    the disease is not present. Furthermore, .008 of
    the entire population have this cancer.
  • P(cancer) P(?cancer)
  • P(?cancer) P(?cancer)
  • P(??cancer) P(??cancer)

7
Basic Formulas for Probabilities
  • Product Rule probability P(A ? B) of a
    conjunction of two events A and B
  • P(A ? B) P(A B) P(B) P(B A) P(A)
  • Sum Rule probability of a disjunction of two
    events A and B
  • P(A ? B) P(A) P(B) - P(A ? B)
  • Theorem of total probability if events A1,, An
    are mutually exclusive with , then

8
Brute Force MAP Hypothesis Learner
  • For each hypothesis h in H, calculate the
    posterior probability
  • Output the hypothesis hMAP with the highest
    posterior probability

9
Relation to Concept Learning(1/2)
  • Consider our usual concept learning task
  • instance space X, hypothesis space H, training
  • examples D
  • consider the FindS learning algorithm (outputs
    most specific hypothesis from the version space V
    SH,D)
  • What would Bayes rule produce as the MAP
    hypothesis?
  • Does FindS output a MAP hypothesis??

10
Relation to Concept Learning(2/2)
  • Assume fixed set of instances ltx1,, xmgt
  • Assume D is the set of classifications D
    ltc(x1),,c(xm)gt
  • Choose P(Dh)
  • P(Dh) 1 if h consistent with D
  • P(Dh) 0 otherwise
  • Choose P(h) to be uniform distribution
  • P(h) 1/H for all h in H
  • Then,

11
Evolution of Posterior Probabilities
12
Characterizing Learning Algorithms by Equivalent
MAP Learners
13
Learning A Real Valued Function(1/2)
  • Consider any real-valued target function f
  • Training examples ltxi, digt, where di is noisy
    training value
  • di f(xi) ei
  • ei is random variable (noise) drawn independently
    for each xi according to some Gaussian
    distribution with mean0
  • Then the maximum likelihood hypothesis hML is the
    one that minimizes
  • the sum of squared errors

14
Learning A Real Valued Function(2/2)
  • Maximize natural log of this instead...

15
Learning to Predict Probabilities
  • Consider predicting survival probability from
    patient data
  • Training examples ltxi, digt, where di is 1 or 0
  • Want to train neural network to output a
    probability given xi (not a 0 or 1)
  • In this case can show
  • Weight update rule for a sigmoid unit
  • where

16
Minimum Description Length Principle (1/2)
  • Occams razor prefer the shortest hypothesis
  • MDL prefer the hypothesis h that minimizes
  • where LC(x) is the description length of x under
    encoding C
  • Example H decision trees, D training data
    labels
  • LC1(h) is bits to describe tree h
  • LC2(Dh) is bits to describe D given h
  • Note LC2(Dh) 0 if examples classified
    perfectly by h. Need only describe exceptions
  • Hence hMDL trades off tree size for training
    errors

17
Minimum Description Length Principle (2/2)
  • Interesting fact from information theory
  • The optimal (shortest expected coding length)
    code for an event with
  • probability p is log2p bits.
  • So interpret (1)
  • log2P(h) is length of h under optimal code
  • log2P(Dh) is length of D given h under optimal
    code
  • ? prefer the hypothesis that minimizes
  • length(h) length(misclassifications)

18
Most Probable Classification of New Instances
  • So far weve sought the most probable hypothesis
    given the data D (i.e., hMAP)
  • Given new instance x, what is its most probable
    classification?
  • hMAP(x) is not the most probable classification!
  • Consider
  • Three possible hypotheses
  • P(h1D) .4, P(h2D) .3, P(h3D) .3
  • Given new instance x,
  • h1(x) , h2(x) ?, h3(x) ?
  • Whats most probable classification of x?

19
Bayes Optimal Classifier
  • Bayes optimal classification
  • Example
  • P(h1D) .4, P(?h1) 0, P(h1) 1
  • P(h2D) .3, P(?h2) 1, P(h2) 0
  • P(h3D) .3, P(?h3) 1, P(h3) 0
  • therefore
  • and

20
Gibbs Classifier
  • Bayes optimal classifier provides best result,
    but can be expensive if many hypotheses.
  • Gibbs algorithm
  • 1. Choose one hypothesis at random, according to
    P(hD)
  • 2. Use this to classify new instance
  • Surprising fact Assume target concepts are drawn
    at random from H according to priors on H. Then
  • EerrorGibbs ? 2E errorBayesOptional
  • Suppose correct, uniform prior distribution over
    H, then
  • Pick any hypothesis from VS, with uniform
    probability
  • Its expected error no worse than twice Bayes
    optimal

21
Naive Bayes Classifier (1/2)
  • Along with decision trees, neural networks,
    nearest nbr, one of the most practical learning
    methods.
  • When to use
  • Moderate or large training set available
  • Attributes that describe instances are
    conditionally independent given classification
  • Successful applications
  • Diagnosis
  • Classifying text documents

22
Naive Bayes Classifier (2/2)
  • Assume target function f X ? V, where each
    instance x described by attributes lta1, a2 angt.
  • Most probable value of f(x) is
  • Naive Bayes assumption
  • which gives
  • Naive Bayes classifier

23
Naive Bayes Algorithm
  • Naive Bayes Learn(examples)
  • For each target value vj
  • P(vj) ? estimate P(vj)
  • For each attribute value ai of each attribute a
  • P(ai vj) ? estimate P(ai vj)
  • Classify New Instance(x)



24
Naive Bayes Example
  • Consider PlayTennis again, and new instance
  • ltOutlk sun, Temp cool, Humid high, Wind
    stronggt
  • Want to compute
  • P(y) P(suny) P(cooly) P(highy) P(strongy)
    .005
  • P(n) P(sunn) P(cooln) P(highn) P(strongn)
    .021
  • ? vNB n

25
Naive Bayes Subtleties (1/2)
  • 1. Conditional independence assumption is often
    violated
  • ...but it works surprisingly well anyway. Note
    dont need estimated posteriors to be
    correct need only that
  • see Domingos Pazzani, 1996 for analysis
  • Naive Bayes posteriors often unrealistically
    close to 1 or 0

26
Naive Bayes Subtleties (2/2)
  • 2. what if none of the training instances with
    target value vj have attribute value ai? Then
  • Typical solution is Bayesian estimate for
  • where
  • n is number of training examples for which v
    vi,
  • nc number of examples for which v vj and a ai
  • p is prior estimate for
  • m is weight given to prior (i.e. number of
    virtual examples)

27
Learning to Classify Text (1/4)
  • Why?
  • Learn which news articles are of interest
  • Learn to classify web pages by topic
  • Naive Bayes is among most effective algorithms
  • What attributes shall we use to represent text
    documents??

28
Learning to Classify Text (2/4)
  • Target concept Interesting? Document ??, ?
  • 1. Represent each document by vector of words
  • one attribute per word position in document
  • 2. Learning Use training examples to estimate
  • P(?) ? P(?)
  • P(doc?) ? P(doc?)
  • Naive Bayes conditional independence assumption
  • where P(ai wk vj) is probability that word in
    position i is
  • wk, given vj
  • one more assumption

29
Learning to Classify Text (3/4)
  • LEARN_NAIVE_BAYES_TEXT (Examples, V)
  • 1. collect all words and other tokens that occur
    in Examples
  • Vocabulary ? all distinct words and other tokens
    in Examples
  • 2. calculate the required P(vj) and P(wk vj)
    probability terms
  • For each target value vj in V do
  • docsj ? subset of Examples for which the target
    value is vj
  • Textj ? a single document created by
    concatenating all members of docsj

30
Learning to Classify Text (4/4)
  • n ? total number of words in Textj (counting
    duplicate words multiple times)
  • for each word wk in Vocabulary
  • nk ? number of times word wk occurs in Textj
  • CLASSIFY_NAIVE_BAYES_TEXT (Doc)
  • positions ? all word positions in Doc that
    contain tokens found in Vocabulary
  • Return vNB where

31
Twenty NewsGroups
  • Given 1000 training documents from each group
    Learn to classify new documents according to
    which newsgroup it came from
  • Naive Bayes 89 classification accuracy

comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns sci.space sci.crypt sci.electronics sci.med
32
Learning Curve for 20 Newsgroups
  • Accuracy vs. Training set size (1/3 withheld for
    test)

33
Bayesian Belief Networks
  • Interesting because
  • Naive Bayes assumption of conditional
    independence too restrictive
  • But its intractable without some such
    assumptions...
  • Bayesian Belief networks describe conditional
    independence among subsets of variables
  • ? allows combining prior knowledge about
    (in)dependencies among variables with observed
    training data
  • (also called Bayes Nets)

34
Conditional Independence
  • Definition X is conditionally independent of Y
    given Z if the probability distribution governing
    X is independent of the value of Y given the
    value of Z that is, if
  • (?xi, yj, zk) P(X xiY yj, Z zk) P(X xiZ
    zk)
  • more compactly, we write
  • P(XY, Z) P(XZ)
  • Example Thunder is conditionally independent of
    Rain, given Lightning
  • P(ThunderRain, Lightning) P(ThunderLightning)
  • Naive Bayes uses cond. indep. to justify
  • P(X, YZ) P(XY, Z) P(YZ) P(XZ) P(YZ)

35
Bayesian Belief Network (1/2)
  • Network represents a set of conditional
    independence assertions
  • Each node is asserted to be conditionally
    independent of its nondescendants, given its
    immediate predecessors.
  • Directed acyclic graph

36
Bayesian Belief Network (2/2)
  • Represents joint probability distribution over
    all variables
  • e.g., P(Storm, BusTourGroup, . . . , ForestFire)
  • in general,
  • where Parents(Yi) denotes immediate predecessors
    of Yi in graph
  • so, joint distribution is fully defined by graph,
    plus the P(yiParents(Yi))

37
Inference in Bayesian Networks
  • How can one infer the (probabilities of) values
    of one or more network variables, given observed
    values of others?
  • Bayes net contains all information needed for
    this inference
  • If only one variable with unknown value, easy to
    infer it
  • In general case, problem is NP hard
  • In practice, can succeed in many cases
  • Exact inference methods work well for some
    network structures
  • Monte Carlo methods simulate the network
    randomly to calculate approximate solutions

38
Learning of Bayesian Networks
  • Several variants of this learning task
  • Network structure might be known or unknown
  • Training examples might provide values of all
    network variables, or just some
  • If structure known and observe all variables
  • Then its easy as training a Naive Bayes
    classifier

39
Learning Bayes Nets
  • Suppose structure known, variables partially
    observable
  • e.g., observe ForestFire, Storm, BusTourGroup,
    Thunder, but not Lightning, Campfire...
  • Similar to training neural network with hidden
    units
  • In fact, can learn network conditional
    probability tables using gradient ascent!
  • Converge to network h that (locally) maximizes
    P(Dh)

40
Gradient Ascent for Bayes Nets
  • Let wijk denote one entry in the conditional
    probability table for variable Yi in the network
  • wijk P(Yi yijParents(Yi) the list uik of
    values)
  • e.g., if Yi Campfire, then uik might be
  • ltStorm T, BusTourGroup F gt
  • Perform gradient ascent by repeatedly
  • 1. update all wijk using training data D
  • 2. then, renormalize the to wijk assure
  • ?j wijk 1 ? 0 ? wijk ? 1

41
More on Learning Bayes Nets
  • EM algorithm can also be used. Repeatedly
  • 1. Calculate probabilities of unobserved
    variables, assuming h
  • 2. Calculate new wijk to maximize Eln P(Dh)
    where D now includes both observed and
    (calculated probabilities of) unobserved
    variables
  • When structure unknown...
  • Algorithms use greedy search to add/substract
    edges and nodes
  • Active research topic

42
Summary Bayesian Belief Networks
  • Combine prior knowledge with observed data
  • Impact of prior knowledge (when correct!) is to
    lower the sample complexity
  • Active research area
  • Extend from boolean to real-valued variables
  • Parameterized distributions instead of tables
  • Extend to first-order instead of propositional
    systems
  • More effective inference methods

43
Expectation Maximization (EM)
  • When to use
  • Data is only partially observable
  • Unsupervised clustering (target value
    unobservable)
  • Supervised learning (some instance attributes
    unobservable)
  • Some uses
  • Train Bayesian Belief Networks
  • Unsupervised clustering (AUTOCLASS)
  • Learning Hidden Markov Models

44
Generating Data from Mixture of k Gaussians
  • Each instance x generated by
  • 1. Choosing one of the k Gaussians with uniform
    probability
  • 2. Generating an instance at random according to
    that Gaussian

45
EM for Estimating k Means (1/2)
  • Given
  • Instances from X generated by mixture of k
    Gaussian distributions
  • Unknown means lt?1,,?k gt of the k Gaussians
  • Dont know which instance xi was generated by
    which Gaussian
  • Determine
  • Maximum likelihood estimates of lt?1,,?k gt
  • Think of full description of each instance as
  • yi lt xi, zi1, zi2gt where
  • zij is 1 if xi generated by jth Gaussian
  • xi observable
  • zij unobservable

46
EM for Estimating k Means (2/2)
  • EM Algorithm Pick random initial h lt?1, ?2gt
    then iterate
  • E step Calculate the expected value Ezij of
    each
  • hidden variable zij, assuming the current
  • hypothesis
  • h lt?1, ?2gt holds.
  • M step Calculate a new maximum likelihood
    hypothesis
  • h' lt?'1, ?'2gt, assuming the value taken on by
    each hidden variable zij is its expected value
    Ezij calculated above. Replace h lt?1, ?2gt
    by h' lt?'1, ?'2gt.

47
EM Algorithm
  • Converges to local maximum likelihood h and
    provides estimates of hidden variables zij
  • In fact, local maximum in Eln P(Yh)
  • Y is complete (observable plus unobservable
    variables) data
  • Expected value is taken over possible values of
    unobserved variables in Y

48
General EM Problem
  • Given
  • Observed data X x1,, xm
  • Unobserved data Z z1,, zm
  • Parameterized probability distribution P(Yh),
    where
  • Y y1,, ym is the full data yi xi ? zi
  • h are the parameters
  • Determine h that (locally) maximizes Eln
    P(Yh)
  • Many uses
  • Train Bayesian belief networks
  • Unsupervised clustering (e.g., k means)
  • Hidden Markov Models

49
General EM Method
  • Define likelihood function Q(h'h) which
    calculates
  • Y X ? Z using observed X and current
    parameters h to estimate Z
  • Q(h'h) ? Eln P(Y h')h, X
  • EM Algorithm
  • Estimation (E) step Calculate Q(h'h) using the
    current hypothesis h and the observed data X to
    estimate the probability distribution over Y .
  • Q(h'h) ? Eln P(Y h')h, X
  • Maximization (M) step Replace hypothesis h by
    the hypothesis h' that maximizes this Q function.
Write a Comment
User Comments (0)
About PowerShow.com