An introduction to machine learning and probabilistic graphical models

About This Presentation

Title:

An introduction to machine learning and probabilistic graphical models

Description:

An introduction to machine learning and probabilistic graphical models Kevin Murphy MIT AI Lab Presented at Intel s workshop on Machine learning – PowerPoint PPT presentation

Number of Views:1071

Avg rating:3.0/5.0

Slides: 89

Provided by: csUbcCam

Category:

more less

Transcript and Presenter's Notes

Title: An introduction to machine learning and probabilistic graphical models

1
An introduction to machine learning and
probabilistic graphical models

Kevin Murphy
MIT AI Lab

Presented at Intels workshop on Machine
learningfor the life sciences, Berkeley, CA, 3
November 2003
2
Overview

Supervised learning
Unsupervised learning
Graphical models
Learning relational models

Thanks to Nir Friedman, Stuart Russell, Leslie
Kaelbling andvarious web sources for letting me
use many of their slides
3
Supervised learning
no
yes
Color Shape Size Output
Blue Torus Big Y
Blue Square Small Y
Blue Star Small Y
Red Arrow Small N
4
Supervised learning
Training data
X1 X2 X3 T
B T B Y
B S S Y
B S S Y
R A S N
Learner
Prediction
T
Y
N
Testing data
X1 X2 X3 T
B A S ?
Y C S ?
Hypothesis
5
Key issue generalization
yes
no
?
?
Cant just memorize the training set (overfitting)
6
Hypothesis spaces

Decision trees
Neural networks
K-nearest neighbors
Naïve Bayes classifier
Support vector machines (SVMs)
Boosted decision stumps

7
Perceptron(neural net with no hidden layers)
Linearly separable data
8
Which separating hyperplane?
9
The linear separator with the largest margin is
the best one to pick
10
What if the data is not linearly separable?
11
Kernel trick
kernel
Kernel implicitly maps from 2D to 3D,making
problem linearly separable
12
Support Vector Machines (SVMs)

Two key ideas
Large margins
Kernel trick

13
Boosting
Simple classifiers (weak learners) can have their
performanceboosted by taking weighted
combinations
Boosting maximizes the margin
14
Supervised learning success stories

Face detection
Steering an autonomous car across the US
Detecting credit card fraud
Medical diagnosis

15
Unsupervised learning

What if there are no output labels?

16
K-means clustering

Guess number of clusters, K
Guess initial cluster centers, ?1, ?2
Assign data points xi to nearest cluster center
Re-compute cluster centers based on assignments

17
AutoClass (Cheeseman et al, 1986)

EM algorithm for mixtures of Gaussians
Soft version of K-means
Uses Bayesian criterion to select K
Discovered new types of stars from spectral data
Discovered new classes of proteins and introns
from DNA/protein sequence databases

18
Hierarchical clustering
19
Principal Component Analysis (PCA)

PCA seeks a projection that best represents the
data in a least-squares sense.

PCA reduces the dimensionality of feature space
by restricting attention to those directions
along which the scatter of the cloud is greatest.
20
Discovering nonlinear manifolds
21
Combining supervised and unsupervised learning
22
Discovering rules (data mining)
Occup. Income Educ. Sex Married Age
Student 10k MA M S 22
Student 20k PhD F S 24
Doctor 80k MD M M 30
Retired 30k HS F M 60
Find the most frequent patterns (association
rules)
Num in household 1 num children 0 gt
language English
Language English Income lt 40k Married
false num children 0 gt education
college, grad school
23
Unsupervised learning summary

Clustering
Hierarchical clustering
Linear dimensionality reduction (PCA)
Non-linear dim. Reduction
Learning rules

24
Discovering networks
?
From data visualization to causal discovery
25
Networks in biology

Most processes in the cell are controlled by
networks of interacting molecules
Metabolic Network
Signal Transduction Networks
Regulatory Networks
Networks can be modeled at multiple levels of
detail/ realism
Molecular level
Concentration level
Qualitative level

Decreasing detail
26
Molecular level Lysis-Lysogeny circuit in Lambda
phage
Arkin et al. (1998), Genetics 149(4)1633-48
5 genes, 67 parameters based on 50 years of
research Stochastic simulation required
supercomputer
27
Concentration level metabolic pathways

Usually modeled with differential equations

28
Qualitative level Boolean Networks
29
Probabilistic graphical models

Supports graph-based modeling at various levels
of detail
Models can be learned from noisy, partial data
Can model inherently stochastic phenomena,
e.g., molecular-level fluctuations
But can also model deterministic, causal
processes.

"The actual science of logic is conversant at
present only with things either certain,
impossible, or entirely doubtful. Therefore the
true logic for this world is the calculus of
probabilities." -- James Clerk Maxwell
"Probability theory is nothing but common sense
reduced to calculation." -- Pierre Simon Laplace
30
Graphical models outline

What are graphical models?
Inference
Structure learning

31
Simple probabilistic modellinear regression
Deterministic (functional) relationship
Y ? ? X noise
Y
X
32
Simple probabilistic modellinear regression
Deterministic (functional) relationship
Y ? ? X noise
Y
Learning estimatingparameters ?, ?, ?
from(x,y) pairs.
Is the empirical mean
Can be estimate byleast squares
X
Is the residual variance
33
Piecewise linear regression
Latent switch variable hidden process at work
34
Probabilistic graphical model for piecewise
linear regression
input

Hidden variable Q chooses which set ofparameters
to use for predicting Y.

Value of Q depends on value of input X.

This is an example of mixtures of experts

output
Learning is harder because Q is hidden, so we
dont know whichdata points to assign to each
line can be solved with EM (c.f., K-means)
35
Classes of graphical models
Probabilistic models
Graphical models
Undirected
Directed
Bayes nets
MRFs
DBNs
36
Bayesian Networks
Compact representation of probability
distributions via conditional independence

Qualitative part
Directed acyclic graph (DAG)
Nodes - random variables
Edges - direct influence

Earthquake
Burglary
Radio
Alarm
Call
Together Define a unique distribution in a
factored form
Quantitative part Set of conditional
probability distributions
37
Example ICU Alarm network

Domain Monitoring Intensive-Care Patients
37 variables
509 parameters
instead of 254

38
Success stories for graphical models

Multiple sequence alignment
Forensic analysis
Medical and fault diagnosis
Speech recognition
Visual tracking
Channel coding at Shannon limit
Genetic pedigree analysis

39
Graphical models outline

What are graphical models? p
Inference
Structure learning

40
Probabilistic Inference

Posterior probabilities
Probability of any event given any evidence
P(XE)

Radio
Call
41
Viterbi decoding
Compute most probable explanation (MPE) of
observed data
Hidden Markov Model (HMM)
hidden
X1
X2
X3
Y1
Y3
observed
Y2
Tomato
42
Inference computational issues
Easy
Hard
Dense, loopy graphs
Chains
Trees
Grids
43
Inference computational issues
Easy
Hard
Dense, loopy graphs
Chains
Trees
Grids
Many difference inference algorithms,both exact
and approximate
44
Bayesian inference

Bayesian probability treats parameters as random
variables
Learning/ parameter estimation is replaced by
probabilistic inference P(?D)
Example Bayesian linear regression parameters
are? (?, ?, ?)

Parameters are tied (shared)across repetitions
of the data
?
X1
Xn
Y1
Yn
45
Bayesian inference

Elegant no distinction between parameters and
other hidden variables
Can use priors to learn from small data sets
(c.f., one-shot learning by humans)
- Math can get hairy
- Often computationally intractable

46
Graphical models outline

What are graphical models?
Inference
Structure learning

p
p
47
Why Struggle for Accurate Structure?
Missing an arc
Adding an arc

Cannot be compensated for by fitting parameters
Wrong assumptions about domain structure

Increases the number of parameters to be
estimated
Wrong assumptions about domain structure

48
Scorebased Learning
Define scoring function that evaluates how well a
structure matches the data
E
E
B
E
A
A
B
A
B
Search for a structure that maximizes the score
49
Learning Trees

Can find optimal tree structure in O(n2 log n)
time just find the max-weight spanning tree
If some of the variables are hidden, problem
becomes hard again, but can use EM to fit
mixtures of trees

50
Heuristic Search

Learning arbitrary graph structure is NP-hard.So
it is common to resort to heuristic search
Define a search space
search states are possible structures
operators make small changes to structure
Traverse space looking for high-scoring
structures
Search techniques
Greedy hill-climbing
Best first search
Simulated Annealing
...

51
Local Search Operations

Typical operations

Add C ?D
?score S(C,E ?D) - S(E ?D)
Reverse C ?E
Delete C ?E
52
Problems with local search
Easy to get stuck in local optima
truth
you
S(GD)
53
Problems with local search II
Picking a single best model can be misleading
54
Problems with local search II
Picking a single best model can be misleading

Small sample size ? many high scoring models
Answer based on one model often useless
Want features common to many models

55
Bayesian Approach to Structure Learning

Posterior distribution over structures
Estimate probability of features
Edge X?Y
Path X? ? Y

Bayesian score for G
Feature of G, e.g., X?Y
Indicator function for feature f
56
Bayesian approach computational issues

Posterior distribution over structures

How compute sum over super-exponential number of
graphs?

MCMC over networks
MCMC over node-orderings (Rao-Blackwellisation)

57
Structure learning other issues

Discovering latent variables
Learning causal models
Learning from interventional data
Active learning

58
Discovering latent variables
a) 17 parameters
b) 59 parameters
There are some techniques for automatically
detecting thepossible presence of latent
variables
59
Learning causal models

So far, we have only assumed that X -gt Y -gt Z
means that Z is independent of X given Y.
However, we often want to interpret directed
arrows causally.
This is uncontroversial for the arrow of time.
But can we infer causality from static
observational data?

60
Learning causal models

We can infer causality from static observational
data if we have at least four measured variables
and certain tetrad conditions hold.
See books by Pearl and Spirtes et al.
However, we can only learn up to Markov
equivalence, not matter how much data we have.

Y
Z
X
Y
Z
X
Y
Z
X
Y
Z
X
61
Learning from interventional data

The only way to distinguish between Markov
equivalent networks is to perform interventions,
e.g., gene knockouts.
We need to (slightly) modify our learning
algorithms.

smoking
smoking
Cut arcs cominginto nodes whichwere set
byintervention
Yellowfingers
Yellowfingers
P(smoker do(paint yellow)) prior
P(smokerobserve(yellow)) gtgt prior
62
Active learning

Which experiments (interventions) should we
perform to learn structure as efficiently as
possible?
This problem can be modeled using decision
theory.
Exact solutions are wildly computationally
intractable.
Can we come up with good approximate decision
making techniques?
Can we implement hardware to automatically
perform the experiments?
AB Automated Biologist

63
Learning from relational data
Can we learn concepts from a set of relations
between objects,instead of/ in addition to just
their attributes?
64
Learning from relational data approaches

Probabilistic relational models (PRMs)
Reify a relationship (arcs) between nodes
(objects) by making into a node (hypergraph)
Inductive Logic Programming (ILP)
Top-down, e.g., FOIL (generalization of C4.5)
Bottom up, e.g., PROGOL (inverse deduction)

65
ILP for learning protein folding input
yes
no
TotalLength(D2mhr, 118) NumberHelices(D2mhr, 6)

100 conjuncts describing structure of each
pos/neg example
66
ILP for learning protein folding results

PROGOL learned the following rule to predict if a
protein will form a four-helical up-and-down
bundle
In English The protein P folds if it contains a
long helix h1 at a secondary structure position
between 1 and 3 and h1 is next to a second helix

67
ILP Pros and Cons

Can discover new predicates (concepts)
automatically
Can learn relational models from relational (or
flat) data
- Computationally intractable
- Poor handling of noise

68
The future of machine learning for bioinformatics?
Oracle
69
The future of machine learning for bioinformatics
Prior knowledge
Hypotheses
Replicated experiments
Learner
Biological literature
Expt.design
Real world

Computer assisted pathway refinement

70
The end
71
Decision trees
blue?
oval?
yes
big?
no
no
yes
72
Decision trees
blue?
oval?
yes
Handles mixed variables Handles missing
data Efficient for large data sets Handles
irrelevant attributes Easy to understand -
Predictive power
big?
no
no
yes
73
Feedforward neural network
input
Hidden layer
Output
Sigmoid function at each node
Weights on each arc
74
Feedforward neural network
input
Hidden layer
Output
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predicts poorly
75
Nearest Neighbor

Remember all your data
When someone asks a question,
find the nearest old data point
return the answer associated with it

76
Nearest Neighbor
?
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predictive power
77
Support Vector Machines (SVMs)

Two key ideas
Large margins are good
Kernel trick

78
SVM mathematical details

Training data l-dimensional vector with flag
of true or false

Separating hyperplane

Margin

Inequalities

Support vector expansion

Support vectors

Decision

79
Replace all inner products with kernels
Kernel function
80
SVMs summary
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predictive power
General lessons from SVM success

Kernel trick can be used to make many linear
methods non-linear e.g., kernel PCA, kernelized
mutual information

Large margin classifiers are good

81
Boosting summary

Can boost any weak learner
Most commonly boosted decision stumps

Handles mixed variables Handles missing
data Efficient for large data sets Handles
irrelevant attributes - Easy to understand
Predictive power
82
Supervised learning summary

Learn mapping F from inputs to outputs using a
training set of (x,t) pairs
F can be drawn from different hypothesis spaces,
e.g., decision trees, linear separators, linear
in high dimensions, mixtures of linear
Algorithms offer a variety of tradeoffs
Many good books, e.g.,
The elements of statistical learning,Hastie,
Tibshirani, Friedman, 2001
Pattern classification, Duda, Hart, Stork, 2001

83
Inference

Posterior probabilities
Probability of any event given any evidence
Most likely explanation
Scenario that explains evidence
Rational decision making
Maximize expected utility
Value of Information
Effect of intervention

Radio
Call
84
Assumption needed to makelearning work

We need to assume Future futures will resemble
past futures (B. Russell)
Unlearnable hypothesis All emeralds are grue,
where grue meansgreen if observed before time
t, blue afterwards.

85
Structure learning success stories gene
regulation network (Friedman et al.)

Yeast data Hughes et al 2000
600 genes
300 experiments

86
Structure learning success stories II
Phylogenetic Tree Reconstruction (Friedman et al.)

Input Biological sequences
Human CGTTGC
Chimp CCTAGG
Orang CGAACG
.
Output a phylogeny

Uses structural EM, with max-spanning-treein the
inner loop
10 billion years
87
Instances of graphical models
Probabilistic models
Graphical models
Naïve Bayes classifier
Undirected
Directed
Bayes nets
MRFs
Mixturesof experts
DBNs
Kalman filtermodel
Ising model
Hidden Markov Model (HMM)
88
ML enabling technologies