Bayesian Network - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian Network

Description:

Bayesian Network CVPR Winter seminar Jaemin Kim – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 107
Provided by: DavidG364
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Network


1
Bayesian Network
  • CVPR Winter seminar
  • Jaemin Kim

2
Outline
  • Concepts in Probability
  • Probability
  • Random variables
  • Basic properties (Bayes rule)
  • Bayesian Networks
  • Inference
  • Decision making
  • Learning networks from data
  • Reasoning over time
  • Applications

3
Probabilities
  • Probability distribution P(Xx)
  • X is a random variable
  • Discrete
  • Continuous
  • x is background state of information

4
Discrete Random Variables
  • Finite set of possible outcomes

X binary
5
Continuous Random Variables
  • Probability distribution (density function) over
    continuous values

5 7
6
More Probabilities
  • Joint
  • Probability that both Xx and Yy
  • Conditional
  • Probability that Xx given we know that Yy

7
Rule of Probabilities
  • Product Rule
  • Marginalization

X binary
8
Bayes Rule
9
Graph Model
  • ??
  • ?? variable? ?? ?? (probability
    distribution)? ????? ?? ?? variables ?? ???? ??
  • Definition
  • A collection of variables (nodes) with a set of
    dependencies (edges) between the variables, and
  • a set of probability distribution functions
    for each variable
  • A Bayesian network is a special type of graph
    model which is a directed acyclic graph (DAG)

10
Bayesian Networks
  • A Graph
  • nodes represent the random variables
  • directed edges (arrows) between pairs of nodes
  • it must be a Directed Acyclic Graph (DAG)
  • the graph represents relationships between
    variables
  • Conditional probability specifications
  • the conditional probability distribution (CPD)
    of each variable
  • given its parents
  • discrete variable table (CPT)

11
Bayesian Networks (Belief Networks)
  • A Graph
  • directed edges (arrows) between pairs of nodes
  • causality A causes B
  • AI an statistics communities

Markov Random fields (MRF)
  • A Graph
  • undirected edges (arrows) between pairs of
    nodes
  • a simple definition of independence
  • If all paths between the nodes in A and B
    are separated by a node c
  • A and B are conditionally independent given a
    third set C
  • physics and vision communities

12
Bayesian Networks
13
Bayesian networks
  • Basics
  • Structured representation
  • Conditional independence
  • Naïve Bayes model
  • Independence facts

14
Bayesian networks
Smoking
Cancer
P(S)
P(CS)
15
Product Rule
  • P(C,S) P(CS) P(S)

P(Cnone Sno) P(Cnone Sno)P(Sno)
0.960.8 0.768
16
Product Rule
  • P(C,S) P(CS) P(S)

P(Cnone Sno) P(Cnone Sno)P(Sno)
0.960.8 0.768
17
Marginalization
P(Smoke)
P(Cancer)
P(Sno) P(Sno Cno) P(Sno Cbe)
P(Sno Cmal)
P(Cmal) P(Cmal Sno) P(Cmal Slight)
P(Cmal Sheavy)
18
Bayes Rule Revisited
P(SC)
19
A Bayesian Network
Gender
Age
Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
20
Problems with Large Instances
  • The joint probability distribution,
    P(A,G,E,S,C,L,SC)
  • For five binary variables there are 27 128
    values in the joint distribution (for 100
    variables there are over 1030 values)
  • How are these values to be obtained?
  • Inference
  • To obtain posterior distributions once some
    evidence is available requires summation over an
    exponential number of terms eg 22 in the
    calculation of

which increases to 297 if there are 100 variables.
21
Independence
Age and Gender are independent.
Gender
Age
P(A,G) P(G)P(A)
P(AG) P(A) A G P(GA) P(G) G A
P(A,G) P(GA) P(A) P(G)P(A) P(A,G) P(AG)
P(G) P(A)P(G)
22
Conditional Independence
Cancer is independent of Age and Gender given
Smoking.
Gender
Age
Smoking
P(CA,G,S) P(CS) C A,G S
Cancer
  • (Smokingheavy)??? Age? Gender? ????? ??
  • (Smokingheavy)??? cancer? ????? ??
  • (Smokingheavy)????? cancer? age? gender? ??

23
More Conditional IndependenceNaïve Bayes
Serum Calcium and Lung Tumor are dependent
Cancer
Serum Calcium
Lung Tumor
??
24
More Conditional IndependenceExplaining Away
Exposure to Toxics and Smoking are independent
Exposure to Toxics
Smoking
E S
Cancer
Exposure to Toxics is dependent on Smoking, given
Cancer
25
More Conditional IndependenceExplaining Away
Exposure to Toxics
Exposure to Toxics
Smoking
Smoking
Cancer
Cancer
Exposure to Toxics is dependent on Smoking,
given Cancer
Moralize the graph.
26
Put it all together
27
General Product (Chain) Rule for Bayesian
Networks
Paiparents(Xi)
28
Conditional Independence
A variable (node) is conditionally independent of
its non-descendants given its parents.
Gender
Age
Non-Descendants
Exposure to Toxics
Smoking
Parents
Cancer is independent of Age and Gender given
Exposure to Toxics and Smoking.
Cancer
Serum Calcium
Lung Tumor
Descendants
29
Another non-descendant
Gender
Age
Cancer is independent of Diet given Exposure to
Toxics and Smoking.
Exposure to Toxics
Smoking
Diet
Cancer
Serum Calcium
Lung Tumor
30
Representing the Joint Distribution
In general, for a network with nodes X1, X2, ,
Xn then
An enormous saving can be made regarding the
number of values required for the joint
distribution. To determine the joint
distribution directly for n binary variables 2n
1 values are required. For a BN with n binary
variables and each node has at most k parents
then less than 2kn values are required.
31
An Example
P(s1)0.2
P(l1s1)0.003P(l1s2)0.00005
P(b1s1)0.25P(b1s2)0.05
P(f1b1,l1)0.75P(f1b1,l2)0.10P(f1b2,l1)0.5
P(f1b2,l2)0.05
P(x1l1)0.6P(x1l2)0.02
32
Solution
Note that our joint distribution with 5 variables
can be represented as
Consequently the joint probability distribution
can now be expressed as
For example, the probability that someone has a
smoking history, lung cancer but not bronchitis,
suffers from fatigue and tests positive in an
X-ray test is
33
Independence and Graph Separation
  • Given a set of observations, is one set of
    variables dependent on another set?
  • Observing effects can induce dependencies.
  • d-separation (Pearl 1988) allows us to check
    conditional independence graphically.

34
Bayesian networks
  • Additional structure
  • Nodes as functions
  • Causal independence
  • Context specific dependencies
  • Continuous variables
  • Hierarchy and model construction

35
Nodes as funtions
  • A BN node is conditional distribution function
  • its parent values are the inputs
  • its output is a distribution over its values

A
0.5
X
0.3
0.2
B
36
Nodes as funtions
A
Any type of function from Val(A,B) to
distributions over Val(X)
X
B
37
Continuous variables
A/C Setting
Outdoor Temperature
hi
97o
38
Gaussian (normal) distributions
N(m, s)
39
Gaussian networks
Each variable is a linear function of its
parents, with Gaussian noise
Joint probability density functions
40
Composing functions
  • Recall a BN node is a function
  • We can compose functions to get more complex
    functions.
  • The result A hierarchically structured BN.
  • Since functions can be called more than once, we
    can reuse a BN model fragment in multiple
    contexts.

41
Owner
Maintenance
Age
Original-value
Mileage
Brakes
Car
Fuel-efficiency
Braking-power
42
Bayesian Networks
  • Knowledge acquisition
  • Variables
  • Structure
  • Numbers

43
What is a variable?
  • Collectively exhaustive, mutually exclusive values

Error Occured
No Error
44
Clarity Test Knowable in Principle
  • Weather Sunny, Cloudy, Rain, Snow
  • Gasoline Cents per gallon
  • Temperature ? 100F , lt 100F
  • User needs help on Excel Charting Yes, No
  • Users personality dominant, submissive

45
Structuring
Network structure corresponding to causality is
usually good.
Extending the conversation.
Lung Tumor
46
Course Contents
  • Concepts in Probability
  • Bayesian Networks
  • Inference
  • Decision making
  • Learning networks from data
  • Reasoning over time
  • Applications

47
Inference
  • Patterns of reasoning
  • Basic inference
  • Exact inference
  • Exploiting structure
  • Approximate inference

48
Predictive Inference
Gender
Age
How likely are elderly males to get malignant
cancer?
Exposure to Toxics
Smoking
P(Cmalignant Agegt60, Gender male)
Cancer
Serum Calcium
Lung Tumor
49
Combined
Gender
Age
How likely is an elderly male patient with high
Serum Calcium to have malignant cancer?
Exposure to Toxics
Smoking
Cancer
P(Cmalignant Agegt60, Gender male, Serum
Calcium high)
Serum Calcium
Lung Tumor
50
Explaining away
Gender
Age
  • If we see a lung tumor, the probability of heavy
    smoking and of exposure to toxics both go up.

Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
51
Inference in Belief Networks
  • Find P(QqE e)
  • Q the query variable
  • E set of evidence variables

X1,, Xn are network variables except Q, E
P(q, e)
S P(q, e, x1,, xn)
x1,, xn
52
Basic Inference
A
B
53
Inference in trees
Y2
Y1
X
X
P(x) S P(x y1, y2) P(y1, y2)
y1, y2
54
Polytrees
  • A network is singly connected (a polytree) if it
    contains no undirected loops.

D
C
Theorem Inference in a singly connected network
can be done in linear time. Main idea in
variable elimination, need only maintain
distributions over single nodes. in network
size including table sizes.
55
The problem with loops
P(c)
0.5
Cloudy
c
c
Rain
Sprinkler
P(s)
0.01
0.99
P(r)
0.01
0.99
Grass-wet
deterministic or
The grass is dry only if no rain and no
sprinklers.
56
The problem with loops contd.
P(g)
0
problem
57
Variable elimination
A
B
C
58
Inference as variable elimination
  • A factor over X is a function from val(X) to
    numbers in 0,1
  • A CPT is a factor
  • A joint distribution is also a factor
  • BN inference
  • factors are multiplied to give new ones
  • variables in factors summed out
  • A variable can be summed out as soon as all
    factors mentioning it have been multiplied.

59
Variable Elimination with loops
Gender
Age
Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
Complexity is exponential in the size of the
factors
60
Inference in BNs and Junction Tree
  • The main point of BNs is to enable probabilistic
    inference to be performed. Inference is the task
    of computing the probability of each value of a
    node in BNs when other variables values are
    know.
  • The general idea is doing inference by
    representing the joint probability distribution
    on an undirected graph called the Junction tree
  • The junction tree has the following
    characteristics
  • it is an undirected tree, its nodes are
    clusters of variables
  • given two clusters, C1 and C2, every node on
    the path between them contains their
    intersection C1 ? C2
  • a Separator, S, is associated with each edge
    and contains the variables in the
    intersection between neighbouring nodes

61
Inference in BNs
  • Moralize the Bayesian network
  • Triangulate the moralized graph
  • Let the cliques of the triangulated graph be the
    nodes of a tree, and construct the junction tree
  • Belief propagation throughout the junction tree
    to do inference

62
Constructing the Junction Tree (1)
Step 1. Form the moral graph from the
DAG Consider BN in our example
Moral Graph marry parents and remove arrows
DAG
63
Constructing the Junction Tree (2)
Step 2. Triangulate the moral graph An undirected
graph is triangulated if every cycle of length
greater than 3 possesses a chord
64
Constructing the Junction Tree (3)
Step 3. Identify the Cliques A clique is a subset
of nodes which is complete (i.e. there is an edge
between every pair of nodes) and maximal.
Cliques B,S,LB,L,FL,X
?
65
Constructing the Junction Tree (4)
Step 4. Build Junction Tree The cliques should be
ordered (C1,C2,,Ck) so they possess the running
intersection property for all 1 lt j k, there
is an i lt j such that Cj ? (C1? ?Cj-1) ? Ci.
To build the junction tree choose one such I for
each j and add an edge between Cj and Ci.
Junction Tree
Cliques B,S,LB,L,FL,X
?
BL
L
66
Potentials Initialization
To initialize the potential functions 1. set all
potentials to unity 2. for each variable, Xi,
select one node in the junction tree (i.e. one
clique) containing both that variable and its
parents, pa(Xi), in the original DAG 3. multiply
the potential by P(xipa(xi))
BL
L
67
Potential Representation
The joint probability distribution can now be
represented in terms of potential functions, ?,
defined on each clique and each separator of the
junction tree. The joint distribution is given by
The idea is to transform one representation of
the joint distribution to another in which for
each clique, C, the potential function gives the
marginal distribution for the variables in C, i.e.
This will also apply for the separators, S.
68
Triangulation
  • Given a numbered graph, proceed from node n,
    decrease to 1
  • Determine the lower-numbered nodes which are
    adjacent to the current node, including those
    which may have been made adjacent to this node
    earlier in this algorithm
  • Connects these nodes to each other.

69
Triangulation
  • Numbering the nodes
  • Arbitrarily number the nodes
  • Maximum cardinality search
  • Give any node a value of 1
  • For each subsequent number, pick an new
    unnumbered node that neighbors the most already
    numbered nodes

70
Triangulation
Moralized graph
BN
71
Triangulation
8
5
3
6
4
7
2
1
Arbitrary numbering
72
Triangulation
Maximum cardinality search
73
Course Contents
  • Concepts in Probability
  • Bayesian Networks
  • Inference
  • Decision making
  • Learning networks from data
  • Reasoning over time
  • Applications

74
Decision making
  • Decision - an irrevocable allocation of domain
    resources
  • Decision should be made so as to maximize
    expected utility.
  • View decision making in terms of
  • Beliefs/Uncertainties
  • Alternatives/Decisions
  • Objectives/Utilities

75
Course Contents
  • Concepts in Probability
  • Bayesian Networks
  • Inference
  • Decision making
  • Learning networks from data
  • Reasoning over time
  • Applications

76
Learning networks from data
  • The learning task
  • Parameter learning
  • Fully observable
  • Partially observable
  • Structure learning
  • Hidden variables

77
The learning task
B E A C N
...
Input training data
  • Input fully or partially observable data cases?
  • Output parameters or also structure?

78
Parameter learning one variable
  • Unfamiliar coin
  • Let q bias of coin (long-run fraction of heads)
  • If q known (given), then
  • P(X heads q )

q
  • Different coin tosses independent given q
  • P(X1, , Xn q )

q h (1-q)t
79
Maximum likelihood
  • Input a set of previous coin tosses
  • X1, , Xn H, T, H, H, H, T, T, H, . . ., H
  • Goal estimate q
  • The likelihood P(X1, , Xn q ) q h (1-q )t
  • The maximum likelihood solution is

80
Conditioning on data
? P(q ) P(D q ) P(q ) q h (1-q )t
P(q )
81
Conditioning on data
82
General parameter learning
  • A multi-variable BN is composed of several
    independent parameters (coins).

Three parameters
  • Can use same techniques as one-variable case to
    learn each one separately

83
Partially observable data
Burglary
Earthquake
B E A C N
?
a
c
?
Alarm
b
?
a
?
n
Newscast
Call
...
  • Fill in missing data with expected value
  • expected distribution over possible values
  • use best guess BN to estimate distribution

84
Intuition
  • In fully observable case
  • In partially observable case I is unknown.

Best estimate for I is
Problem q unknown.
85
Expectation Maximization (EM)
  • Expectation (E) step
  • Use current parameters q to estimate filled in
    data.
  • Maximization (M) step
  • Use filled in data to do max likelihood
    estimation

86
Structure learning
Goal find good BN structure (relative to
data)
Solution do heuristic search over space of
network structures.
87
Search space
Space network structures Operators
add/reverse/delete edges
88
Heuristic search
Use scoring function to do heuristic search (any
algorithm). Greedy hill-climbing with randomness
works pretty well.
score
89
Scoring
  • Fill in parameters using previous techniques
    score completed networks.
  • One possibility for score

D
likelihood function Score(B) P(data B)
Example X, Y independent coin tosses typical
data (27 h-h, 22 h-t, 25 t-h, 26 t-t)
Max. likelihood network typically fully connected
This is not surprising maximum likelihood always
overfits
90
Better scoring functions
  • MDL formulation balance fit to data and model
    complexity ( of parameters)

Score(B) P(data B) - model complexity
  • Full Bayesian formulation
  • prior on network structures parameters
  • more parameters ? higher dimensional space
  • get balance effect as a byproduct

with Dirichlet parameter prior, MDL is an
approximation to full Bayesian score.
91
Hidden variables
  • There may be interesting variables that we never
    get to observe
  • topic of a document in information retrieval
  • users current task in online help system.
  • Our learning algorithm should
  • hypothesize the existence of such variables
  • learn an appropriate state space for them.

92
E1
E3
E2
randomly scattered data
93
E1
E3
E2
actual data
94
Bayesian clustering (Autoclass)
Class
naïve Bayes model
...
E1
E2
En
  • (hypothetical) class variable never observed
  • if we know that there are k classes, just run EM
  • learned classes clusters
  • Bayesian analysis allows us to choose k, trade
    off fit to data with model complexity

95
E1
E3
E2
Resulting cluster distributions
96
Detecting hidden variables
  • Unexpected correlations hidden variables.

97
Course Contents
  • Concepts in Probability
  • Bayesian Networks
  • Inference
  • Decision making
  • Learning networks from data
  • Reasoning over time
  • Applications

98
Reasoning over time
  • Dynamic Bayesian networks
  • Hidden Markov models
  • Decision-theoretic planning
  • Markov decision problems
  • Structured representation of actions
  • The qualification problem the frame problem
  • Causality (and the frame problem revisited)

99
Dynamic environments
State(t)
  • Markov property
  • past independent of future given current state
  • a conditional independence assumption
  • implied by fact that there are no arcs t? t2.

100
Dynamic Bayesian networks
  • State described via random variables.

...
101
Hidden Markov model
  • An HMM is a simple model for a partially
    observable stochastic domain.

102
Hidden Markov model
Partially observable stochastic environment
  • Mobile robots
  • states location
  • observations sensor input
  • Speech recognition
  • states phonemes
  • observations acoustic signal
  • Biological sequencing
  • states protein structure
  • observations amino acids

103
Acting under uncertainty
Markov Decision Problem (MDP)
  • Overall utility sum of momentary rewards.
  • Allows rich preference model, e.g.

rewards corresponding to get to goal asap
104
Partially observable MDPs
  • The optimal action at time t depends on the
    entire history of previous observations.
  • Instead, a distribution over State(t) suffices.

105
Structured representation
  • Probabilistic action model
  • allows for exceptions qualifications
  • persistence arcs a solution to the frame
    problem.

106
Applications
  • Medical expert systems
  • Pathfinder
  • Parenting MSN
  • Fault diagnosis
  • Ricoh FIXIT
  • Decision-theoretic troubleshooting
  • Vista
  • Collaborative filtering
Write a Comment
User Comments (0)
About PowerShow.com