Bayesian Decision Theory - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Bayesian Decision Theory

Description:

is an ordering relationship on degrees of confidence (belief) ... means that each belief is independent of its predecessors in the BN given its parents ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 68
Provided by: isabellebi
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Decision Theory


1
Bayesian Decision Theory
2
Learning Objectives
  • Review probabilistic framework
  • Understand what bayesian networks are
  • Understand how bayesian networks work
  • Understand inferences in bayesian networks

3
Other Names
  • Belief networks
  • Probabilistic networks
  • Graphical models
  • Causal networks

4
Probabilistic Framework
  • Bayesian inference is a method based on
    probabilities to draw conclusions in the presence
    of uncertainty.
  • It is an inductive method (pair
    deduction/induction)
  • A ? B, IF A (is true) THEN B (is true)
    (deduction)IF B (is true) THEN A is plausible
    (induction)

5
Probabilistic Framework
  • Statistical framework
  • State hypotheses and models (sets of hypotheses)
  • Assign prior probabilities to the hypotheses
  • Draw inferences using probability calculus
    evaluate posterior probabilities (or degrees of
    belief) for the hypotheses given the data
    available, derive unique answers.

6
Probabilistic Framework
  • Bayes theorem
  • Complementary events
  • Union of two events

7
Probabilistic Framework
  • Conditional probabilities (reads probability of X
    knowing Y)
  • Events X and Y are said to be independent if

or
8
Probabilistic Framework
  • More generally, with mutually independent events
    X1, , Xk

9
Probabilistic Framework
  • Axioms of probabilities
  • gt is an ordering relationship on degrees of
    confidence (belief) P(XI) gt P(YI) AND P(YI) gt
    P(ZI) ? P(XI) gt P(ZI)
  • Complementary eventsP(AI) P(AI) 1
  • There is a function such that P(X,Y I) G(
    P(X I), P(Y X,I) )
  • Bayes theorem

10
Bayesian Inference
  • Bayesian inference and induction consists in
    deriving a model from a set of data
  • M is a model (hypotheses), D are the data
  • P(MD) is the posterior - updated belief that M
    is correct
  • P(M) is our estimate that M is correct prior to
    any data
  • P(DM) is the likelihood
  • Logarithms are helpful to represent small numbers
  • Permits to combine prior knowledge with the
    observed data.

11
Bayesian Inference
  • To be able to infer a model, we need to evaluate
  • The prior P(M)
  • The likelihood P(D/M)
  • P(D) is calculated as the sum of all the
    numerators P(D/M)P(M) over all the hypotheses H,
    and thus we do not need to calculate it

12
Bayesian Inference
  • Example (from MacKay, Bayesian Methods for Neural
    Networks Theory and Applications, 1995)Given
    the sequence of numbers -1, 3, 7, 11predict
    what the next two numbers are likely to be, and
    infer what the underlying process probably was,
    that gave rise to this sequence.

H1
  • 15, 19 (add four to the preceding number)
  • -19.9, 1043.8 (evaluate x3/119/11x223/11
    where x is the preceding number)

H2
13
Bayesian Inference
  • H1 arithmetic progression add n
  • H2 cubic function x ? cx3 dx2 e
  • PRIOR
  • H2 is less plausible than H1 because it is less
    simple, and less frequent, so we could give a
    higher odd to H1.
  • But we decide to give the same prior probability
    to both H1 and H2.

14
Bayesian Inference
  • Likelihoodwe suppose that our numbers can vary
    between 50 and 50 (101 range).
  • H1 depends on a first number, and n.
  • H2 depends on c,d,e, and the first number. Each
    parameter c,d,e can be represented by a fraction
    where numerator is in -50, 50 integer range and
    denominator in 1,50 range.-1/11 has four
    representations (-1/11, -2/22, -3/33, -4/44),
    9/11 has four, and 23/11 had two)

15
Bayesian Inference
  • Thus the ratio between the two likelihood
    isis 40 million to one in favor of H1.
  • Bayesian inference favors simple models because
    they have less parameters, each parameter
    introducing sources of variation and error.
  • Occams razor

16
Probabilistic Belief
  • There are several possible worlds that
    areindistinguishable to a system given some
    priorevidence.
  • The system believes that a logic sentence B is
    True with probability p and False with
    probability 1-p. B is called a belief
  • In the frequency interpretation of probabilities,
    this means that the system believes that the
    fraction of possible worlds that satisfy B is p
  • The distribution (p,1-p) is the strength of B

17
Problem
  • At a certain time t, the knowledge of a system is
    some collection of beliefs
  • At time t systems gets an observation that
    changes the strength of one of its beliefs
  • How should it update the strength of its other
    beliefs?

18
Toothache Example
  • A certain dentist is only interested in two
    things about any patient, whether he has a
    toothache and whether he has a cavity
  • Over years of practice, she has constructed the
    following joint distribution

19
Toothache Example
  • Using the joint distribution, the dentist can
    compute the strength of any logic sentence built
    with the proposition Toothache and Cavity

20
New Evidence
  • She now makes an observation E that indicates
    that a specific patient x has high probability
    (0.8) of having a toothache, but is not directly
    related to whether he has a cavity

21
Adjusting Joint Distribution
  • She now makes an observation E that indicates
    that a specific patient x has high probability
    (0.8) of having a toothache, but is not directly
    related to whether he has a cavity
  • She can use this additional information to create
    a joint distribution (specific for x) conditional
    to E, by keeping the same probability ratios
    between Cavity and ?Cavity

22
Corresponding Calculus
  • P(CT) P(C?T)/P(T) 0.04/0.05

23
Corresponding Calculus
  • P(CT) P(C?T)/P(T) 0.04/0.05
  • P(C?TE) P(CT,E) P(TE)
    P(CT) P(TE)

24
Corresponding Calculus
  • P(CT) P(C?T)/P(T) 0.04/0.05
  • P(C?TE) P(CT,E) P(TE)
    P(CT) P(TE) (0.04/0.05)0.8
    0.64

25
Generalization
  • n beliefs X1,,Xn
  • The joint distribution can be used to update
    probabilities when new evidence arrives
  • But
  • The joint distribution contains 2n probabilities
  • Useful independence is not made explicit

26
Purpose of Belief Networks
  • Facilitate the description of a collection of
    beliefs by making explicit causality relations
    and conditional independence among beliefs
  • Provide a more efficient way (than by using joint
    distribution tables) to update belief strengths
    when new evidence is observed

27
Alarm Example
  • Five beliefs
  • A Alarm
  • B Burglary
  • E Earthquake
  • J JohnCalls
  • M MaryCalls

28
A Simple Belief Network
Intuitive meaning of arrow from x to y x has
direct influence on y
Directed acyclicgraph (DAG)
Nodes are random variables
29
Assigning Probabilities to Roots
30
Conditional Probability Tables
Size of the CPT for a node with k parents 2k
31
Conditional Probability Tables
32
What the BN Means
P(x1,x2,,xn) Pi1,,nP(xiParents(Xi))
33
Calculation of Joint Probability
P(J?M?A??B??E) P(JA)P(MA)P(A?B,?E)P(?B)P(?E)
0.9 x 0.7 x 0.001 x 0.999 x 0.998 0.00062
34
What The BN Encodes
  • Each of the beliefs JohnCalls and MaryCalls is
    independent of Burglary and Earthquake given
    Alarm or ?Alarm
  • The beliefs JohnCalls and MaryCalls are
    independent given Alarm or ?Alarm

35
What The BN Encodes
  • Each of the beliefs JohnCalls and MaryCalls is
    independent of Burglary and Earthquake given
    Alarm or ?Alarm
  • The beliefs JohnCalls and MaryCalls are
    independent given Alarm or ?Alarm

36
Structure of BN
  • The relation P(x1,x2,,xn)
    Pi1,,nP(xiParents(Xi))means that each belief
    is independent of its predecessors in the BN
    given its parents
  • Said otherwise, the parents of a belief Xi are
    all the beliefs that directly influence Xi
  • Usually (but not always) the parents of Xi are
    its causes and Xi is the effect of these causes

E.g., JohnCalls is influenced by Burglary, but
not directly. JohnCalls is directly influenced
by Alarm
37
Construction of BN
  • Choose the relevant sentences (random variables)
    that describe the domain
  • Select an ordering X1,,Xn, so that all the
    beliefs that directly influence Xi are before Xi
  • For j1,,n do
  • Add a node in the network labeled by Xj
  • Connect the node of its parents to Xj
  • Define the CPT of Xj
  • The ordering guarantees that the BN will have
    no cycles
  • The CPT guarantees that exactly the correct
    number of probabilities will be defined no
    missing, no extra

Use canonical distribution, e.g., noisy-OR, to
fill CPTs
38
Locally Structured Domain
  • Size of CPT 2k, where k is the number of parents
  • In a locally structured domain, each belief is
    directly influenced by relatively few other
    beliefs and k is small
  • BN are better suited for locally structured
    domains

39
Inference In BN
P(Xobs) Se P(Xe) P(eobs) where e is an
assignment of values to the evidence variables
  • Set E of evidence variables that are observed
    with new probability distribution, e.g.,
    JohnCalls,MaryCalls
  • Query variable X, e.g., Burglary, for which we
    would like to know the posterior probability
    distribution P(XE)

40
Inference Patterns
  • Basic use of a BN Given new
  • observations, compute the newstrengths of some
    (or all) beliefs
  • Other use Given the strength of
  • a belief, which observation should
  • we gather to make the greatest
  • change in this beliefs strength

41
Singly Connected BN
  • A BN is singly connected if there is at most one
    undirected path between any two nodes

is singly connected
42
Types Of Nodes On A Path
43
Independence Relations In BN
Given a set E of evidence nodes, two beliefs
connected by an undirected path are independent
if one of the following three conditions
holds 1. A node on the path is linear and in
E 2. A node on the path is diverging and in E 3.
A node on the path is converging and neither
this node, nor any descendant is in E
44
Independence Relations In BN
Given a set E of evidence nodes, two beliefs
connected by an undirected path are independent
if one of the following three conditions
holds 1. A node on the path is linear and in
E 2. A node on the path is diverging and in E 3.
A node on the path is converging and neither
this node, nor any descendant is in E
Gas and Radio are independent given evidence on
SparkPlugs
45
Independence Relations In BN
Given a set E of evidence nodes, two beliefs
connected by an undirected path are independent
if one of the following three conditions
holds 1. A node on the path is linear and in
E 2. A node on the path is diverging and in E 3.
A node on the path is converging and neither
this node, nor any descendant is in E
Gas and Radio are independent given evidence on
Battery
46
Independence Relations In BN
Given a set E of evidence nodes, two beliefs
connected by an undirected path are independent
if one of the following three conditions
holds 1. A node on the path is linear and in
E 2. A node on the path is diverging and in E 3.
A node on the path is converging and neither
this node, nor any descendant is in E
Gas and Radio are independent given no evidence,
but they aredependent given evidence on Starts
or Moves
47
BN Inference
  • Simplest Case

P(B) ???
B
A
P(B) P(a)P(Ba) P(a)P(Ba)
P(C) ???
48
BN Inference
  • Chain


X2
X1
Xn
What is time complexity to compute P(Xn)?
What is time complexity if we computed the full
joint?
49
Inference Ex. 2
Algorithm is computing not individual probabilitie
s, but entire tables
  • Two ideas crucial to avoiding exponential blowup
  • because of the structure of the BN,
    somesubexpression in the joint depend only on a
    small numberof variable
  • By computing them once and caching the result,
    wecan avoid generating them exponentially many
    times

50
Variable Elimination
  • General idea
  • Write query in the form
  • Iteratively
  • Move all irrelevant terms outside of innermost
    sum
  • Perform innermost sum, getting a new term
  • Insert the new term into the product

51
A More Complex Example
  • Asia network

52
  • We want to compute P(d)
  • Need to eliminate v,s,x,t,l,a,b
  • Initial factors

53
  • We want to compute P(d)
  • Need to eliminate v,s,x,t,l,a,b
  • Initial factors

Eliminate v
Note fv(t) P(t) In general, result of
elimination is not necessarily a probability term
54
  • We want to compute P(d)
  • Need to eliminate s,x,t,l,a,b
  • Initial factors

Eliminate s
Summing on s results in a factor with two
arguments fs(b,l) In general, result of
elimination may be a function of several variables
55
  • We want to compute P(d)
  • Need to eliminate x,t,l,a,b
  • Initial factors

Eliminate x
Note fx(a) 1 for all values of a !!
56
  • We want to compute P(d)
  • Need to eliminate t,l,a,b
  • Initial factors

Eliminate t
57
  • We want to compute P(d)
  • Need to eliminate l,a,b
  • Initial factors

Eliminate l
58
  • We want to compute P(d)
  • Need to eliminate b
  • Initial factors

Eliminate a,b
59
Variable Elimination
  • We now understand variable elimination as a
    sequence of rewriting operations
  • Actual computation is done in elimination step
  • Computation depends on order of elimination

60
Dealing with evidence
  • How do we deal with evidence?
  • Suppose get evidence V t, S f, D t
  • We want to compute P(L, V t, S f, D t)

61
Dealing with Evidence
  • We start by writing the factors
  • Since we know that V t, we dont need to
    eliminate V
  • Instead, we can replace the factors P(V) and
    P(TV) with
  • These select the appropriate parts of the
    original factors given the evidence
  • Note that fp(V) is a constant, and thus does not
    appear in elimination of other variables

62
Dealing with Evidence
  • Given evidence V t, S f, D t
  • Compute P(L, V t, S f, D t )
  • Initial factors, after setting evidence

63
Variable Elimination Algorithm
  • Let X1,, Xm be an ordering on the non-query
    variables
  • For I m, , 1
  • Leave in the summation for Xi only factors
    mentioning Xi
  • Multiply the factors, getting a factor that
    contains a number for each value of the variables
    mentioned, including Xi
  • Sum out Xi, getting a factor f that contains a
    number for each value of the variables mentioned,
    not including Xi
  • Replace the multiplied factor in the summation

64
Understanding Variable Elimination
  • We want to select good elimination orderings
    that reduce complexity
  • This can be done be examining a graph theoretic
    property of the induced graph we will not
    cover this in class.
  • This reduces the problem of finding good ordering
    to graph-theoretic operation that is
    well-understoodunfortunately computing it is
    NP-hard!

65
Bayesian Networks Classification
Bayes rule inverts the arc
diagnostic P (C x )
66
Naive Bayes Classifier
Given C, xj are independent p(xC) p(x1C)
p(x2C) ... p(xdC)
67
Summary
  • Probabilistic framework
  • Role of conditional independence
  • Belief networks
  • Causality ordering
  • Inference in BN
  • Naïve Bayes classifier
Write a Comment
User Comments (0)
About PowerShow.com