An introduction to machine learning and

probabilistic graphical models

- Kevin Murphy
- MIT AI Lab

Presented at Intels workshop on Machine

learningfor the life sciences, Berkeley, CA, 3

November 2003

Overview

- Supervised learning
- Unsupervised learning
- Graphical models
- Learning relational models

Thanks to Nir Friedman, Stuart Russell, Leslie

Kaelbling andvarious web sources for letting me

use many of their slides

Supervised learning

no

yes

Color Shape Size Output

Blue Torus Big Y

Blue Square Small Y

Blue Star Small Y

Red Arrow Small N

Supervised learning

Training data

X1 X2 X3 T

B T B Y

B S S Y

B S S Y

R A S N

Learner

Prediction

T

Y

N

Testing data

X1 X2 X3 T

B A S ?

Y C S ?

Hypothesis

Key issue generalization

yes

no

?

?

Cant just memorize the training set (overfitting)

Hypothesis spaces

- Decision trees
- Neural networks
- K-nearest neighbors
- Naïve Bayes classifier
- Support vector machines (SVMs)
- Boosted decision stumps

Perceptron(neural net with no hidden layers)

Linearly separable data

Which separating hyperplane?

The linear separator with the largest margin is

the best one to pick

What if the data is not linearly separable?

Kernel trick

kernel

Kernel implicitly maps from 2D to 3D,making

problem linearly separable

Support Vector Machines (SVMs)

- Two key ideas
- Large margins
- Kernel trick

Boosting

Simple classifiers (weak learners) can have their

performanceboosted by taking weighted

combinations

Boosting maximizes the margin

Supervised learning success stories

- Face detection
- Steering an autonomous car across the US
- Detecting credit card fraud
- Medical diagnosis

Unsupervised learning

- What if there are no output labels?

K-means clustering

- Guess number of clusters, K
- Guess initial cluster centers, ?1, ?2
- Assign data points xi to nearest cluster center
- Re-compute cluster centers based on assignments

AutoClass (Cheeseman et al, 1986)

- EM algorithm for mixtures of Gaussians
- Soft version of K-means
- Uses Bayesian criterion to select K
- Discovered new types of stars from spectral data
- Discovered new classes of proteins and introns

from DNA/protein sequence databases

Hierarchical clustering

Principal Component Analysis (PCA)

- PCA seeks a projection that best represents the

data in a least-squares sense.

PCA reduces the dimensionality of feature space

by restricting attention to those directions

along which the scatter of the cloud is greatest.

Discovering nonlinear manifolds

Combining supervised and unsupervised learning

Discovering rules (data mining)

Occup. Income Educ. Sex Married Age

Student 10k MA M S 22

Student 20k PhD F S 24

Doctor 80k MD M M 30

Retired 30k HS F M 60

Find the most frequent patterns (association

rules)

Num in household 1 num children 0 gt

language English

Language English Income lt 40k Married

false num children 0 gt education

college, grad school

Unsupervised learning summary

- Clustering
- Hierarchical clustering
- Linear dimensionality reduction (PCA)
- Non-linear dim. Reduction
- Learning rules

Discovering networks

?

From data visualization to causal discovery

Networks in biology

- Most processes in the cell are controlled by

networks of interacting molecules - Metabolic Network
- Signal Transduction Networks
- Regulatory Networks
- Networks can be modeled at multiple levels of

detail/ realism - Molecular level
- Concentration level
- Qualitative level

Decreasing detail

Molecular level Lysis-Lysogeny circuit in Lambda

phage

Arkin et al. (1998), Genetics 149(4)1633-48

5 genes, 67 parameters based on 50 years of

research Stochastic simulation required

supercomputer

Concentration level metabolic pathways

- Usually modeled with differential equations

Qualitative level Boolean Networks

Probabilistic graphical models

- Supports graph-based modeling at various levels

of detail - Models can be learned from noisy, partial data
- Can model inherently stochastic phenomena,

e.g., molecular-level fluctuations - But can also model deterministic, causal

processes.

"The actual science of logic is conversant at

present only with things either certain,

impossible, or entirely doubtful. Therefore the

true logic for this world is the calculus of

probabilities." -- James Clerk Maxwell

"Probability theory is nothing but common sense

reduced to calculation." -- Pierre Simon Laplace

Graphical models outline

- What are graphical models?
- Inference
- Structure learning

Simple probabilistic modellinear regression

Deterministic (functional) relationship

Y ? ? X noise

Y

X

Simple probabilistic modellinear regression

Deterministic (functional) relationship

Y ? ? X noise

Y

Learning estimatingparameters ?, ?, ?

from(x,y) pairs.

Is the empirical mean

Can be estimate byleast squares

X

Is the residual variance

Piecewise linear regression

Latent switch variable hidden process at work

Probabilistic graphical model for piecewise

linear regression

input

- Hidden variable Q chooses which set ofparameters

to use for predicting Y.

- Value of Q depends on value of input X.

- This is an example of mixtures of experts

output

Learning is harder because Q is hidden, so we

dont know whichdata points to assign to each

line can be solved with EM (c.f., K-means)

Classes of graphical models

Probabilistic models

Graphical models

Undirected

Directed

Bayes nets

MRFs

DBNs

Bayesian Networks

Compact representation of probability

distributions via conditional independence

- Qualitative part
- Directed acyclic graph (DAG)
- Nodes - random variables
- Edges - direct influence

Earthquake

Burglary

Radio

Alarm

Call

Together Define a unique distribution in a

factored form

Quantitative part Set of conditional

probability distributions

Example ICU Alarm network

- Domain Monitoring Intensive-Care Patients
- 37 variables
- 509 parameters
- instead of 254

Success stories for graphical models

- Multiple sequence alignment
- Forensic analysis
- Medical and fault diagnosis
- Speech recognition
- Visual tracking
- Channel coding at Shannon limit
- Genetic pedigree analysis

Graphical models outline

- What are graphical models? p
- Inference
- Structure learning

Probabilistic Inference

- Posterior probabilities
- Probability of any event given any evidence
- P(XE)

Radio

Call

Viterbi decoding

Compute most probable explanation (MPE) of

observed data

Hidden Markov Model (HMM)

hidden

X1

X2

X3

Y1

Y3

observed

Y2

Tomato

Inference computational issues

Easy

Hard

Dense, loopy graphs

Chains

Trees

Grids

Inference computational issues

Easy

Hard

Dense, loopy graphs

Chains

Trees

Grids

Many difference inference algorithms,both exact

and approximate

Bayesian inference

- Bayesian probability treats parameters as random

variables - Learning/ parameter estimation is replaced by

probabilistic inference P(?D) - Example Bayesian linear regression parameters

are? (?, ?, ?)

Parameters are tied (shared)across repetitions

of the data

?

X1

Xn

Y1

Yn

Bayesian inference

- Elegant no distinction between parameters and

other hidden variables - Can use priors to learn from small data sets

(c.f., one-shot learning by humans) - - Math can get hairy
- - Often computationally intractable

Graphical models outline

- What are graphical models?
- Inference
- Structure learning

p

p

Why Struggle for Accurate Structure?

Missing an arc

Adding an arc

- Cannot be compensated for by fitting parameters
- Wrong assumptions about domain structure

- Increases the number of parameters to be

estimated - Wrong assumptions about domain structure

Scorebased Learning

Define scoring function that evaluates how well a

structure matches the data

E

E

B

E

A

A

B

A

B

Search for a structure that maximizes the score

Learning Trees

- Can find optimal tree structure in O(n2 log n)

time just find the max-weight spanning tree - If some of the variables are hidden, problem

becomes hard again, but can use EM to fit

mixtures of trees

Heuristic Search

- Learning arbitrary graph structure is NP-hard.So

it is common to resort to heuristic search - Define a search space
- search states are possible structures
- operators make small changes to structure
- Traverse space looking for high-scoring

structures - Search techniques
- Greedy hill-climbing
- Best first search
- Simulated Annealing
- ...

Local Search Operations

- Typical operations

Add C ?D

?score S(C,E ?D) - S(E ?D)

Reverse C ?E

Delete C ?E

Problems with local search

Easy to get stuck in local optima

truth

you

S(GD)

Problems with local search II

Picking a single best model can be misleading

Problems with local search II

Picking a single best model can be misleading

- Small sample size ? many high scoring models
- Answer based on one model often useless
- Want features common to many models

Bayesian Approach to Structure Learning

- Posterior distribution over structures
- Estimate probability of features
- Edge X?Y
- Path X? ? Y

Bayesian score for G

Feature of G, e.g., X?Y

Indicator function for feature f

Bayesian approach computational issues

- Posterior distribution over structures

How compute sum over super-exponential number of

graphs?

- MCMC over networks
- MCMC over node-orderings (Rao-Blackwellisation)

Structure learning other issues

- Discovering latent variables
- Learning causal models
- Learning from interventional data
- Active learning

Discovering latent variables

a) 17 parameters

b) 59 parameters

There are some techniques for automatically

detecting thepossible presence of latent

variables

Learning causal models

- So far, we have only assumed that X -gt Y -gt Z

means that Z is independent of X given Y. - However, we often want to interpret directed

arrows causally. - This is uncontroversial for the arrow of time.
- But can we infer causality from static

observational data?

Learning causal models

- We can infer causality from static observational

data if we have at least four measured variables

and certain tetrad conditions hold. - See books by Pearl and Spirtes et al.
- However, we can only learn up to Markov

equivalence, not matter how much data we have.

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Learning from interventional data

- The only way to distinguish between Markov

equivalent networks is to perform interventions,

e.g., gene knockouts. - We need to (slightly) modify our learning

algorithms.

smoking

smoking

Cut arcs cominginto nodes whichwere set

byintervention

Yellowfingers

Yellowfingers

P(smoker do(paint yellow)) prior

P(smokerobserve(yellow)) gtgt prior

Active learning

- Which experiments (interventions) should we

perform to learn structure as efficiently as

possible? - This problem can be modeled using decision

theory. - Exact solutions are wildly computationally

intractable. - Can we come up with good approximate decision

making techniques? - Can we implement hardware to automatically

perform the experiments? - AB Automated Biologist

Learning from relational data

Can we learn concepts from a set of relations

between objects,instead of/ in addition to just

their attributes?

Learning from relational data approaches

- Probabilistic relational models (PRMs)
- Reify a relationship (arcs) between nodes

(objects) by making into a node (hypergraph) - Inductive Logic Programming (ILP)
- Top-down, e.g., FOIL (generalization of C4.5)
- Bottom up, e.g., PROGOL (inverse deduction)

ILP for learning protein folding input

yes

no

TotalLength(D2mhr, 118) NumberHelices(D2mhr, 6)

100 conjuncts describing structure of each

pos/neg example

ILP for learning protein folding results

- PROGOL learned the following rule to predict if a

protein will form a four-helical up-and-down

bundle - In English The protein P folds if it contains a

long helix h1 at a secondary structure position

between 1 and 3 and h1 is next to a second helix

ILP Pros and Cons

- Can discover new predicates (concepts)

automatically - Can learn relational models from relational (or

flat) data - - Computationally intractable
- - Poor handling of noise

The future of machine learning for bioinformatics?

Oracle

The future of machine learning for bioinformatics

Prior knowledge

Hypotheses

Replicated experiments

Learner

Biological literature

Expt.design

Real world

- Computer assisted pathway refinement

The end

Decision trees

blue?

oval?

yes

big?

no

no

yes

Decision trees

blue?

oval?

yes

Handles mixed variables Handles missing

data Efficient for large data sets Handles

irrelevant attributes Easy to understand -

Predictive power

big?

no

no

yes

Feedforward neural network

input

Hidden layer

Output

Sigmoid function at each node

Weights on each arc

Feedforward neural network

input

Hidden layer

Output

- Handles mixed variables - Handles missing

data - Efficient for large data sets - Handles

irrelevant attributes - Easy to understand

Predicts poorly

Nearest Neighbor

- Remember all your data
- When someone asks a question,
- find the nearest old data point
- return the answer associated with it

Nearest Neighbor

?

- Handles mixed variables - Handles missing

data - Efficient for large data sets - Handles

irrelevant attributes - Easy to understand

Predictive power

Support Vector Machines (SVMs)

- Two key ideas
- Large margins are good
- Kernel trick

SVM mathematical details

- Training data l-dimensional vector with flag

of true or false

- Separating hyperplane

- Margin

- Inequalities

- Support vector expansion

- Support vectors

- Decision

Replace all inner products with kernels

Kernel function

SVMs summary

- Handles mixed variables - Handles missing

data - Efficient for large data sets - Handles

irrelevant attributes - Easy to understand

Predictive power

General lessons from SVM success

- Kernel trick can be used to make many linear

methods non-linear e.g., kernel PCA, kernelized

mutual information

- Large margin classifiers are good

Boosting summary

- Can boost any weak learner
- Most commonly boosted decision stumps

Handles mixed variables Handles missing

data Efficient for large data sets Handles

irrelevant attributes - Easy to understand

Predictive power

Supervised learning summary

- Learn mapping F from inputs to outputs using a

training set of (x,t) pairs - F can be drawn from different hypothesis spaces,

e.g., decision trees, linear separators, linear

in high dimensions, mixtures of linear - Algorithms offer a variety of tradeoffs
- Many good books, e.g.,
- The elements of statistical learning,Hastie,

Tibshirani, Friedman, 2001 - Pattern classification, Duda, Hart, Stork, 2001

Inference

- Posterior probabilities
- Probability of any event given any evidence
- Most likely explanation
- Scenario that explains evidence
- Rational decision making
- Maximize expected utility
- Value of Information
- Effect of intervention

Radio

Call

Assumption needed to makelearning work

- We need to assume Future futures will resemble

past futures (B. Russell) - Unlearnable hypothesis All emeralds are grue,

where grue meansgreen if observed before time

t, blue afterwards.

Structure learning success stories gene

regulation network (Friedman et al.)

- Yeast data Hughes et al 2000
- 600 genes
- 300 experiments

Structure learning success stories II

Phylogenetic Tree Reconstruction (Friedman et al.)

- Input Biological sequences
- Human CGTTGC
- Chimp CCTAGG
- Orang CGAACG
- .
- Output a phylogeny

Uses structural EM, with max-spanning-treein the

inner loop

10 billion years

Instances of graphical models

Probabilistic models

Graphical models

Naïve Bayes classifier

Undirected

Directed

Bayes nets

MRFs

Mixturesof experts

DBNs

Kalman filtermodel

Ising model

Hidden Markov Model (HMM)

ML enabling technologies

- Faster computers
- More data
- The web
- Parallel corpora (machine translation)
- Multiple sequenced genomes
- Gene expression arrays
- New ideas
- Kernel trick
- Large margins
- Boosting
- Graphical models