Title: An introduction to machine learning and probabilistic graphical models
1An introduction to machine learning and
probabilistic graphical models
Presented at Intels workshop on Machine
learningfor the life sciences, Berkeley, CA, 3
November 2003
2Overview
- Supervised learning
- Unsupervised learning
- Graphical models
- Learning relational models
Thanks to Nir Friedman, Stuart Russell, Leslie
Kaelbling andvarious web sources for letting me
use many of their slides
3Supervised learning
no
yes
Color Shape Size Output
Blue Torus Big Y
Blue Square Small Y
Blue Star Small Y
Red Arrow Small N
4Supervised learning
Training data
X1 X2 X3 T
B T B Y
B S S Y
B S S Y
R A S N
Learner
Prediction
T
Y
N
Testing data
X1 X2 X3 T
B A S ?
Y C S ?
Hypothesis
5Key issue generalization
yes
no
?
?
Cant just memorize the training set (overfitting)
6Hypothesis spaces
- Decision trees
- Neural networks
- K-nearest neighbors
- Naïve Bayes classifier
- Support vector machines (SVMs)
- Boosted decision stumps
7Perceptron(neural net with no hidden layers)
Linearly separable data
8Which separating hyperplane?
9The linear separator with the largest margin is
the best one to pick
10What if the data is not linearly separable?
11Kernel trick
kernel
Kernel implicitly maps from 2D to 3D,making
problem linearly separable
12Support Vector Machines (SVMs)
- Two key ideas
- Large margins
- Kernel trick
13Boosting
Simple classifiers (weak learners) can have their
performanceboosted by taking weighted
combinations
Boosting maximizes the margin
14Supervised learning success stories
- Face detection
- Steering an autonomous car across the US
- Detecting credit card fraud
- Medical diagnosis
15Unsupervised learning
- What if there are no output labels?
16K-means clustering
- Guess number of clusters, K
- Guess initial cluster centers, ?1, ?2
- Assign data points xi to nearest cluster center
- Re-compute cluster centers based on assignments
17AutoClass (Cheeseman et al, 1986)
- EM algorithm for mixtures of Gaussians
- Soft version of K-means
- Uses Bayesian criterion to select K
- Discovered new types of stars from spectral data
- Discovered new classes of proteins and introns
from DNA/protein sequence databases
18Hierarchical clustering
19Principal Component Analysis (PCA)
- PCA seeks a projection that best represents the
data in a least-squares sense.
PCA reduces the dimensionality of feature space
by restricting attention to those directions
along which the scatter of the cloud is greatest.
20Discovering nonlinear manifolds
21Combining supervised and unsupervised learning
22Discovering rules (data mining)
Occup. Income Educ. Sex Married Age
Student 10k MA M S 22
Student 20k PhD F S 24
Doctor 80k MD M M 30
Retired 30k HS F M 60
Find the most frequent patterns (association
rules)
Num in household 1 num children 0 gt
language English
Language English Income lt 40k Married
false num children 0 gt education
college, grad school
23Unsupervised learning summary
- Clustering
- Hierarchical clustering
- Linear dimensionality reduction (PCA)
- Non-linear dim. Reduction
- Learning rules
24Discovering networks
?
From data visualization to causal discovery
25Networks in biology
- Most processes in the cell are controlled by
networks of interacting molecules - Metabolic Network
- Signal Transduction Networks
- Regulatory Networks
- Networks can be modeled at multiple levels of
detail/ realism - Molecular level
- Concentration level
- Qualitative level
Decreasing detail
26Molecular level Lysis-Lysogeny circuit in Lambda
phage
Arkin et al. (1998), Genetics 149(4)1633-48
5 genes, 67 parameters based on 50 years of
research Stochastic simulation required
supercomputer
27Concentration level metabolic pathways
- Usually modeled with differential equations
28Qualitative level Boolean Networks
29Probabilistic graphical models
- Supports graph-based modeling at various levels
of detail - Models can be learned from noisy, partial data
- Can model inherently stochastic phenomena,
e.g., molecular-level fluctuations - But can also model deterministic, causal
processes.
"The actual science of logic is conversant at
present only with things either certain,
impossible, or entirely doubtful. Therefore the
true logic for this world is the calculus of
probabilities." -- James Clerk Maxwell
"Probability theory is nothing but common sense
reduced to calculation." -- Pierre Simon Laplace
30Graphical models outline
- What are graphical models?
- Inference
- Structure learning
31Simple probabilistic modellinear regression
Deterministic (functional) relationship
Y ? ? X noise
Y
X
32Simple probabilistic modellinear regression
Deterministic (functional) relationship
Y ? ? X noise
Y
Learning estimatingparameters ?, ?, ?
from(x,y) pairs.
Is the empirical mean
Can be estimate byleast squares
X
Is the residual variance
33Piecewise linear regression
Latent switch variable hidden process at work
34Probabilistic graphical model for piecewise
linear regression
input
- Hidden variable Q chooses which set ofparameters
to use for predicting Y.
- Value of Q depends on value of input X.
- This is an example of mixtures of experts
output
Learning is harder because Q is hidden, so we
dont know whichdata points to assign to each
line can be solved with EM (c.f., K-means)
35Classes of graphical models
Probabilistic models
Graphical models
Undirected
Directed
Bayes nets
MRFs
DBNs
36Bayesian Networks
Compact representation of probability
distributions via conditional independence
- Qualitative part
- Directed acyclic graph (DAG)
- Nodes - random variables
- Edges - direct influence
Earthquake
Burglary
Radio
Alarm
Call
Together Define a unique distribution in a
factored form
Quantitative part Set of conditional
probability distributions
37Example ICU Alarm network
- Domain Monitoring Intensive-Care Patients
- 37 variables
- 509 parameters
- instead of 254
38Success stories for graphical models
- Multiple sequence alignment
- Forensic analysis
- Medical and fault diagnosis
- Speech recognition
- Visual tracking
- Channel coding at Shannon limit
- Genetic pedigree analysis
39Graphical models outline
- What are graphical models? p
- Inference
- Structure learning
40Probabilistic Inference
- Posterior probabilities
- Probability of any event given any evidence
- P(XE)
Radio
Call
41Viterbi decoding
Compute most probable explanation (MPE) of
observed data
Hidden Markov Model (HMM)
hidden
X1
X2
X3
Y1
Y3
observed
Y2
Tomato
42Inference computational issues
Easy
Hard
Dense, loopy graphs
Chains
Trees
Grids
43Inference computational issues
Easy
Hard
Dense, loopy graphs
Chains
Trees
Grids
Many difference inference algorithms,both exact
and approximate
44Bayesian inference
- Bayesian probability treats parameters as random
variables - Learning/ parameter estimation is replaced by
probabilistic inference P(?D) - Example Bayesian linear regression parameters
are? (?, ?, ?)
Parameters are tied (shared)across repetitions
of the data
?
X1
Xn
Y1
Yn
45Bayesian inference
- Elegant no distinction between parameters and
other hidden variables - Can use priors to learn from small data sets
(c.f., one-shot learning by humans) - - Math can get hairy
- - Often computationally intractable
46Graphical models outline
- What are graphical models?
- Inference
- Structure learning
p
p
47Why Struggle for Accurate Structure?
Missing an arc
Adding an arc
- Cannot be compensated for by fitting parameters
- Wrong assumptions about domain structure
- Increases the number of parameters to be
estimated - Wrong assumptions about domain structure
48Scorebased Learning
Define scoring function that evaluates how well a
structure matches the data
E
E
B
E
A
A
B
A
B
Search for a structure that maximizes the score
49Learning Trees
- Can find optimal tree structure in O(n2 log n)
time just find the max-weight spanning tree - If some of the variables are hidden, problem
becomes hard again, but can use EM to fit
mixtures of trees
50Heuristic Search
- Learning arbitrary graph structure is NP-hard.So
it is common to resort to heuristic search - Define a search space
- search states are possible structures
- operators make small changes to structure
- Traverse space looking for high-scoring
structures - Search techniques
- Greedy hill-climbing
- Best first search
- Simulated Annealing
- ...
51Local Search Operations
Add C ?D
?score S(C,E ?D) - S(E ?D)
Reverse C ?E
Delete C ?E
52Problems with local search
Easy to get stuck in local optima
truth
you
S(GD)
53Problems with local search II
Picking a single best model can be misleading
54Problems with local search II
Picking a single best model can be misleading
- Small sample size ? many high scoring models
- Answer based on one model often useless
- Want features common to many models
55Bayesian Approach to Structure Learning
- Posterior distribution over structures
- Estimate probability of features
- Edge X?Y
- Path X? ? Y
-
Bayesian score for G
Feature of G, e.g., X?Y
Indicator function for feature f
56Bayesian approach computational issues
- Posterior distribution over structures
How compute sum over super-exponential number of
graphs?
- MCMC over networks
- MCMC over node-orderings (Rao-Blackwellisation)
57Structure learning other issues
- Discovering latent variables
- Learning causal models
- Learning from interventional data
- Active learning
58Discovering latent variables
a) 17 parameters
b) 59 parameters
There are some techniques for automatically
detecting thepossible presence of latent
variables
59Learning causal models
- So far, we have only assumed that X -gt Y -gt Z
means that Z is independent of X given Y. - However, we often want to interpret directed
arrows causally. - This is uncontroversial for the arrow of time.
- But can we infer causality from static
observational data?
60Learning causal models
- We can infer causality from static observational
data if we have at least four measured variables
and certain tetrad conditions hold. - See books by Pearl and Spirtes et al.
- However, we can only learn up to Markov
equivalence, not matter how much data we have.
Y
Z
X
Y
Z
X
Y
Z
X
Y
Z
X
61Learning from interventional data
- The only way to distinguish between Markov
equivalent networks is to perform interventions,
e.g., gene knockouts. - We need to (slightly) modify our learning
algorithms.
smoking
smoking
Cut arcs cominginto nodes whichwere set
byintervention
Yellowfingers
Yellowfingers
P(smoker do(paint yellow)) prior
P(smokerobserve(yellow)) gtgt prior
62Active learning
- Which experiments (interventions) should we
perform to learn structure as efficiently as
possible? - This problem can be modeled using decision
theory. - Exact solutions are wildly computationally
intractable. - Can we come up with good approximate decision
making techniques? - Can we implement hardware to automatically
perform the experiments? - AB Automated Biologist
63Learning from relational data
Can we learn concepts from a set of relations
between objects,instead of/ in addition to just
their attributes?
64Learning from relational data approaches
- Probabilistic relational models (PRMs)
- Reify a relationship (arcs) between nodes
(objects) by making into a node (hypergraph) - Inductive Logic Programming (ILP)
- Top-down, e.g., FOIL (generalization of C4.5)
- Bottom up, e.g., PROGOL (inverse deduction)
65ILP for learning protein folding input
yes
no
TotalLength(D2mhr, 118) NumberHelices(D2mhr, 6)
100 conjuncts describing structure of each
pos/neg example
66ILP for learning protein folding results
- PROGOL learned the following rule to predict if a
protein will form a four-helical up-and-down
bundle - In English The protein P folds if it contains a
long helix h1 at a secondary structure position
between 1 and 3 and h1 is next to a second helix
67ILP Pros and Cons
- Can discover new predicates (concepts)
automatically - Can learn relational models from relational (or
flat) data - - Computationally intractable
- - Poor handling of noise
68The future of machine learning for bioinformatics?
Oracle
69The future of machine learning for bioinformatics
Prior knowledge
Hypotheses
Replicated experiments
Learner
Biological literature
Expt.design
Real world
- Computer assisted pathway refinement
70The end
71Decision trees
blue?
oval?
yes
big?
no
no
yes
72Decision trees
blue?
oval?
yes
Handles mixed variables Handles missing
data Efficient for large data sets Handles
irrelevant attributes Easy to understand -
Predictive power
big?
no
no
yes
73Feedforward neural network
input
Hidden layer
Output
Sigmoid function at each node
Weights on each arc
74Feedforward neural network
input
Hidden layer
Output
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predicts poorly
75Nearest Neighbor
- Remember all your data
- When someone asks a question,
- find the nearest old data point
- return the answer associated with it
76Nearest Neighbor
?
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predictive power
77Support Vector Machines (SVMs)
- Two key ideas
- Large margins are good
- Kernel trick
78SVM mathematical details
- Training data l-dimensional vector with flag
of true or false
79Replace all inner products with kernels
Kernel function
80SVMs summary
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predictive power
General lessons from SVM success
- Kernel trick can be used to make many linear
methods non-linear e.g., kernel PCA, kernelized
mutual information
- Large margin classifiers are good
81Boosting summary
- Can boost any weak learner
- Most commonly boosted decision stumps
Handles mixed variables Handles missing
data Efficient for large data sets Handles
irrelevant attributes - Easy to understand
Predictive power
82Supervised learning summary
- Learn mapping F from inputs to outputs using a
training set of (x,t) pairs - F can be drawn from different hypothesis spaces,
e.g., decision trees, linear separators, linear
in high dimensions, mixtures of linear - Algorithms offer a variety of tradeoffs
- Many good books, e.g.,
- The elements of statistical learning,Hastie,
Tibshirani, Friedman, 2001 - Pattern classification, Duda, Hart, Stork, 2001
83Inference
- Posterior probabilities
- Probability of any event given any evidence
- Most likely explanation
- Scenario that explains evidence
- Rational decision making
- Maximize expected utility
- Value of Information
- Effect of intervention
Radio
Call
84Assumption needed to makelearning work
- We need to assume Future futures will resemble
past futures (B. Russell) - Unlearnable hypothesis All emeralds are grue,
where grue meansgreen if observed before time
t, blue afterwards.
85Structure learning success stories gene
regulation network (Friedman et al.)
- Yeast data Hughes et al 2000
- 600 genes
- 300 experiments
86Structure learning success stories II
Phylogenetic Tree Reconstruction (Friedman et al.)
- Input Biological sequences
- Human CGTTGC
- Chimp CCTAGG
- Orang CGAACG
- .
- Output a phylogeny
Uses structural EM, with max-spanning-treein the
inner loop
10 billion years
87Instances of graphical models
Probabilistic models
Graphical models
Naïve Bayes classifier
Undirected
Directed
Bayes nets
MRFs
Mixturesof experts
DBNs
Kalman filtermodel
Ising model
Hidden Markov Model (HMM)
88ML enabling technologies
- Faster computers
- More data
- The web
- Parallel corpora (machine translation)
- Multiple sequenced genomes
- Gene expression arrays
- New ideas
- Kernel trick
- Large margins
- Boosting
- Graphical models