Computer Science Department - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Computer Science Department

Description:

Computer Science Department. CS 9633 Machine Learning. Important Features ... A Bayesian Network provides a way to describe the joint probability distribution ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 65
Provided by: brid157
Category:

less

Transcript and Presenter's Notes

Title: Computer Science Department


1
Bayesian Learning
2
Bayesian Learning
  • Probabilistic approach to inference
  • Assumption
  • Quantities of interest are governed by
    probability distribution
  • Optimal decisions can be made by reasoning about
    probabilities and observations
  • Provides quantitative approach to weighing how
    evidence supports alternative hypotheses

3
Why is Bayesian Learning Important?
  • Some Bayesian approaches (like naive Bayes) are
    very practical learning approaches and
    competitive with other approaches
  • Provides a useful perspective for understanding
    many learning algorithms that do not explicitly
    manipulate probabilities

4
Important Features
  • Model is incrementally updated with training
    examples
  • Prior knowledge can be combined with observed
    data to determine the final probability of the
    hypothesis
  • Asserting prior probability of candidate
    hypotheses
  • Asserting a probability distribution over
    observations for each hypothesis
  • Can accommodate methods that make probabilistic
    predictions
  • New instances can be classified by combining
    predictions of multiple hypotheses
  • Can provide a gold standard for evaluating
    hypotheses

5
Practical Problems
  • Typically require initial knowledge of many
    probabilities. Can be estimated by
  • Background knowledge
  • Previously available data
  • Assumptions about distribution
  • Significant computational cost of determining
    Bayes optimal hypothesis in general
  • linear in number of hypotheses in general case
  • Significantly lower for certain situations

6
Bayes Theorem
  • Goal learn the best hypothesis
  • Assumption in Bayes learning the best
    hypothesis is the most probable hypothesis
  • Bayes theorem allows computation of most probable
    hypothesis based on
  • Prior probability of hypothesis
  • Probability of observing certain data given the
    hypothesis
  • Observed data itself

7
Notation
  • P(h) Prior probability of h
  • P(D) Prior probability of D
  • P(Dh) Probability of D given h
  • posterior probability of D given h
  • likelihood of Data given h
  • P(hD) Probability that h holds, given the data

8
Bayes Theorem
  • Based on definitions of P(Dh) and P(hD)

D
h
9
Maximum A Posteriori Hypothesis
  • Many learning algorithms try to identify the most
    probable hypothesis h ? H given observations D
  • This is the maximum a posteriori hypothesis (MAP
    hypothesis)

10
Identifying the MAP Hypothesis using Bayes
Theorem
11
Equally Probable Hypotheses
Any hypothesis that maximizes P(Dh) is a Maximum
Likelihood (ML) hypothesis
12
Bayes Theorem and Concept Learning
  • Concept Learning Task
  • H Hypothesis space
  • X Instance space
  • c X?0,1

13
Brute-Force MAP Learning Algorithm
  • For each hypothesis h in H, calculate the
    posterior probability
  • Output the hypothesis with the highest posterior
    probability

14
To Apply Brute Force MAP Learning
  • Specify P(h)
  • Specify P(Dh)

15
An Example
  • Assume
  • Training data D is noise free (di c(xi))
  • The target concept is contained in H
  • We have no a priori reason to believe one
    hypothesis is more likely than any other

16
Probability of Data Given Hypothesis
17
Apply the algorithm
  • Step 1 (2 cases)
  • Case 1 (D is inconsistent with h)
  • Case 2 (D is consistent with h)

18
Step 2
  • Every consistent hypothesis has probability
    1/VSH,D
  • Every inconsistent hypothesis has probability 0

19
MAP hypothesis and consistent learners
  • FIND-S (finds maximally specific consistent
    hypothesis)
  • Candidate-Elimination (finds all consistent
    hypotheses.

20
Maximum Likelihood and Least-Squared Error
Learning
  • New problem learning a continuous-valued target
    function
  • Will show that under certain assumptions, any
    learning algorithm that minimized the squared
    error between output hypotheses on training data
    will output a maximum likelihood hypothesis.

21
Problem Setting
  • Learner L
  • Instance space X
  • Hypothesis space H h X?R
  • Task of L is to learn unknown target function
    f X?R
  • Have m examples
  • Target value for each example is corrupted by
    random noise drawn from Normal distribution

22
Work Through Derivation
23
Why Normal Distribution for Noise?
  • Its easy to work with
  • Good approximation of many physical processes
  • Important point we are only dealing with noise
    in the target functionnot the attribute values.

24
Bayes Optimal Classifier
  • Two Questions
  • What is the most probable hypothesis given the
    training data?
  • Find MAP hypothesis
  • What is the most probable classification given
    the training data?

25
Example
  • Three hypotheses
  • P(h1D) 0.35
  • P(h2D) 0.45
  • P(h3D) 0.20
  • New instance x
  • h1 predicts negative
  • h2 predicts positive
  • h3 predicts negative
  • What is the predicted class using hMAP?
  • What is the predicted class using all hypotheses?

26
Bayes Optimal Classification
  • The most probable classification of a new
    instance is obtained by combining the predictions
    of all hypotheses, weighted by their posterior
    probabilities.
  • Suppose set of values for classification is from
    set V (each possible value is vj)
  • Probability that vj is the correct classification
    for new instance is
  • Pick the vj with the max probability as the
    predicted class

27
Bayes Optimal Classifier
Apply this to the previous example
28
Bayes Optimal Classification
  • Gives the optimal error-minimizing solution to
    prediction and classification problems.
  • Requires probability of exact combination of
    evidence
  • All classification methods can be viewed as
    approximations of Bayes rule with varying
    assumptions about conditional probabilities
  • Assume they come from some distribution
  • Assume conditional independence
  • Assume underlying model of specific format
    (linear combination of evidence, decision tree)

29
Simplifications of Bayes Rule
  • Given observations of attribute values a1, a2,
    an,, compute the most probable target value vMAP
  • Use Bayes Theorem to rewrite

30
Naïve Bayes
  • The most usual simplification of Bayes Rule is to
    assume conditional independence of the
    observations
  • Because it is approximately true
  • Because it is computationally convenient
  • Assume the probability of observing the
    conjunction a1, a2, an is the product of the
    probabilities of the individual attributes
  • Learning consists of estimating probabilities

31
Simple Example
  • Two classes C1 and C2.
  • Two features
  • a1 Male, Female
  • a2 Blue eyes, Brown eyes
  • Instance (Male with blue eyes) What is the
    class?

Probability C1 C2
P(Ci) 0.4 0.6
P(MaleCj) 0.1 0.2
P(BlueEyesCj) 0.3 0.2
32
Estimating Probabilities(Classifying Executables)
  • Two Classes (Malicious, Benign)
  • Features
  • a1 GUI present (yes/no)
  • a2 Deletes files (yes/no)
  • a3 Allocates memory (yes/no)
  • a4 Length (lt 1K, 1-10 K, gt 10K)

33
Instance a1 a2 a3 a4 Class
1 Yes No No Yes B
2 Yes No No No B
3 No Yes Yes No M
4 No No Yes Yes M
5 Yes No No Yes B
6 Yes No No No M
7 Yes Yes Yes No M
8 Yes Yes No Yes M
9 No No No Yes B
10 No No Yes No M
34
Classify the Following Instance
  • ltYes, No, Yes, Yesgt

35
Estimating Probabilities
  • To estimate P(CD)
  • Let n be the number of training examples labeled
    D
  • Let nc be the number labeled D that are also
    labeled C
  • P(CD) was estimated as nc/n
  • Problems
  • This is a biased underestimate of the probability
  • When the term is 0, it dominates all others

36
Use m-estimate of probability
  • p is prior of what we are trying to estimate
    (often assume attribute values equally probable)
  • m is a constant (called equivalent sample size)
    view this augmenting with a virtual sample

37
Repeat Estimates
  • Use equal priors for attribute values
  • Use m value of 1

38
Bayesian Belief Networks
  • Naïve Bayes is based on assumption of conditional
    independence
  • Bayesian networks provide a tractable method for
    specifying dependencies among variables

39
Terminology
  • A Bayesian Belief Network describes the
    probability distribution over a set of random
    variables Y1, Y2, Yn
  • Each variable Yi can take on the set of values
    V(Yi)
  • The joint space of the set of variables Y is the
    cross product
  • V(Y1) ? V(Y2) ? ? V(Yn)
  • Each item in the joint space corresponds to one
    possible assignment of values to the tuple of
    variables ltY1, Yngt
  • Joint probability distribution specifies the
    probabilities of the items in the joint space
  • A Bayesian Network provides a way to describe the
    joint probability distribution in a compact
    manner.

40
Conditional Independence
  • Let X, Y, and Z be three discrete-valued random
    variables.
  • We say that X is conditionally independent of Y
    given Z if the probability distribution governing
    X is independent of the value of Y given a value
    for Z

41
Bayesian Belief Network
  • A set of random variables makes up the nodes of
    the network
  • A set of directed links or arrows connects pairs
    of nodes. The intuitive meaning of an arrow from
    X to Y is that X has a direct influence on Y.
  • Each node has a conditional probability table
    that quantifies the effects that the parents have
    on the node. The parents of a node are all those
    nodes that have arrows pointing to it.
  • The graph has no directed cycles (it is a DAG)

42
Example (from Judea Pearl)
  • You have a new burglar alarm installed at home.
    It is fairly reliable at detecting a burglary,
    but also responds on occasion to minor
    earthquakes. You also have two neighbors, John
    and Mary, who have promised to call you at work
    when they hear the alarm. John always calls when
    he hears the alarm, but sometimes confuses the
    telephone ringing with the alarm and calls then,
    too. Mary, on the other hand, likes rather loud
    music and sometimes misses the alarm altogether.
    Given the evidence of who has or has not called,
    we would like to estimate the probability of a
    burglary.

43
Step 1
  • Determine what the propositional (random)
    variables should be
  • Determine causal (or another type of influence)
    relationships and develop the topology of the
    network

44
Topology of Belief Network
Burglary
Earthquake
Alarm
JohnCalls
MaryCalls
45
Step 2
  • Specify a conditional probability table or CPT
    for each node.
  • Each row in the table contains the conditional
    probability of each node value for a conditioning
    case (possible combinations of values for parent
    nodes).
  • In the example, the possible values for each node
    are true/false.
  • The sum of the probabilities for each value of a
    node given a particular conditioning case is 1.

46
ExampleCPT for Alarm Node
P(AlarmBurglary,Earthquake) True
False
Earthquake
Burglary
True True
0.950 0.050 True False
0.940 0.060 False
True 0.290
0.710 False False
0.001 0.999
47
Complete Belief Network
P(B) 0.001
P(E) 0.002
Burglary
Earthquake
B E P(AB,E) T T 0.95 T
F 0.94 F T 0.29 F
F 0.01
Alarm
A P(JA) T 0.90 F 0.05
A P(MA) T 0.70 F 0.01
JohnCalls
MaryCalls
48
Semantics of Belief Networks
  • View 1 A belief network is a representation of
    the joint probability distribution (joint) of a
    domain.
  • The joint completely specifies an agents
    probability assignments to all propositions in
    the domain (both simple and complex.)

49
Network as representation of joint
  • A generic entry in the joint probability
    distribution is the probability of a conjunction
    of particular assignments to each variable, such
    as
  • Each entry in the joint is represented by the
    product of appropriate elements of the CPTs in
    the belief network.

50
Example Calculation
  • Calculate the probability of the event that the
    alarm has sounded but neither a burglary nor an
    earthquake has occurred, and both John and Mary
    call.
  • P(J M A B E)
  • P(JA) P(MA) P(AB,E) P(B) P(E)
  • 0.90 0.70 0.001 0.999 0.998
  • 0.00062

51
Semantics
  • View 2 Encoding of a collection of conditional
    independence statements.
  • JohnCalls is conditionally independent of other
    variables in the network given the value of Alarm
  • This view is useful for understanding inference
    procedures for the networks.

52
Inference Methods for Bayesian Networks
  • We may want to infer the value of some target
    variable (Burglary) given observed values for
    other variables.
  • What we generally want is the probability
    distribution
  • Inference straightforward if all other values in
    network known
  • More general case, if we know a subset of the
    values of variables, we can infer a probability
    distribution over other variables.
  • NP-Hard problem
  • But approximations work well

53
Learning Bayesian Belief Networks
  • Focus of a great deal of research
  • Several situations of varying complexity
  • Network structure may be given or not
  • All variables may be observable or you may have
    some variables that cannot be observed
  • If the network structure is known and all
    variables can be observed, the CPTs can be
    computed like they were for Naïve Bayes

54
Gradient Ascent Training of Bayesian Networks
  • Method developed by Russell
  • Maximizes P(Dh) by following the gradient of
  • ln P(Dh)
  • Let wijk be a single entry in CPT table that
    variable Yi will take on value yij given that its
    immediate parent is Ui takes on values given by
    uik

55
Illustration
Uiuik
wijk P(YiyijUiuik)
Yi yij
56
Result
57
Example
Burglary
Earthquake
To compute P(AB,E) we would need P(A,B,Ed) for
each training example
Alarm
JohnCalls
MaryCalls
58
EM Algorithm
  • The EM algorithm is a general purpose algorithm
    that is used in many settings including
  • Unsupervised learning
  • Learning CPTs for Bayesian networks
  • Learning Hidden Markov models
  • Two-step algorithm for learning hidden variables

59
Two Step Process
  • For a specific problem with have three quantities
  • X observed data for instances
  • Z unobserved data for instances (this is usually
    what we are trying to learn)
  • Y full data
  • General approach
  • Determine initial hypothesis for values for Z
  • Step 1 Estimation
  • Compute a function Q(hh) using current
    hypothesis h and the observed data X to estimate
    the probability distribution over Y.
  • Step 2 Maximization
  • Revise hypothesis h with h that maximizes the Q
    function

60
K-means algorithm
Assume that data comes from 2 Gaussian
distributions. Means (?) are unknown
P(x)
x
61
Generation of data
  • Select one of the normal distributions at random
  • Generate a single random instance xi using this
    distribution

62
Example Select initial values for h
h lt?1, ?2gt
?2
X
?1
Y
63
E-step Compute the probability that datum xi
generated by component i
h lt?1, ?2gt
?2
X
?1
Y
64
M-step Replace hypothesis h with h that
maximizes Q
h lt?1, ?2gt
?1
X
?2
Y
Write a Comment
User Comments (0)
About PowerShow.com