An introduction to machine learning and probabilistic graphical models - PowerPoint PPT Presentation

About This Presentation
Title:

An introduction to machine learning and probabilistic graphical models

Description:

An introduction to machine learning and probabilistic graphical models Kevin Murphy MIT AI Lab Presented at Intel s workshop on Machine learning – PowerPoint PPT presentation

Number of Views:1070
Avg rating:3.0/5.0
Slides: 89
Provided by: csUbcCam
Category:

less

Transcript and Presenter's Notes

Title: An introduction to machine learning and probabilistic graphical models


1
An introduction to machine learning and
probabilistic graphical models
  • Kevin Murphy
  • MIT AI Lab

Presented at Intels workshop on Machine
learningfor the life sciences, Berkeley, CA, 3
November 2003
2
Overview
  • Supervised learning
  • Unsupervised learning
  • Graphical models
  • Learning relational models

Thanks to Nir Friedman, Stuart Russell, Leslie
Kaelbling andvarious web sources for letting me
use many of their slides
3
Supervised learning
no
yes
Color Shape Size Output
Blue Torus Big Y
Blue Square Small Y
Blue Star Small Y
Red Arrow Small N
4
Supervised learning
Training data
X1 X2 X3 T
B T B Y
B S S Y
B S S Y
R A S N
Learner
Prediction
T
Y
N
Testing data
X1 X2 X3 T
B A S ?
Y C S ?
Hypothesis
5
Key issue generalization
yes
no
?
?
Cant just memorize the training set (overfitting)
6
Hypothesis spaces
  • Decision trees
  • Neural networks
  • K-nearest neighbors
  • Naïve Bayes classifier
  • Support vector machines (SVMs)
  • Boosted decision stumps

7
Perceptron(neural net with no hidden layers)
Linearly separable data
8
Which separating hyperplane?
9
The linear separator with the largest margin is
the best one to pick
10
What if the data is not linearly separable?
11
Kernel trick
kernel
Kernel implicitly maps from 2D to 3D,making
problem linearly separable
12
Support Vector Machines (SVMs)
  • Two key ideas
  • Large margins
  • Kernel trick

13
Boosting
Simple classifiers (weak learners) can have their
performanceboosted by taking weighted
combinations
Boosting maximizes the margin
14
Supervised learning success stories
  • Face detection
  • Steering an autonomous car across the US
  • Detecting credit card fraud
  • Medical diagnosis

15
Unsupervised learning
  • What if there are no output labels?

16
K-means clustering
  1. Guess number of clusters, K
  2. Guess initial cluster centers, ?1, ?2
  3. Assign data points xi to nearest cluster center
  4. Re-compute cluster centers based on assignments

17
AutoClass (Cheeseman et al, 1986)
  • EM algorithm for mixtures of Gaussians
  • Soft version of K-means
  • Uses Bayesian criterion to select K
  • Discovered new types of stars from spectral data
  • Discovered new classes of proteins and introns
    from DNA/protein sequence databases

18
Hierarchical clustering
19
Principal Component Analysis (PCA)
  • PCA seeks a projection that best represents the
    data in a least-squares sense.

PCA reduces the dimensionality of feature space
by restricting attention to those directions
along which the scatter of the cloud is greatest.
20
Discovering nonlinear manifolds
21
Combining supervised and unsupervised learning
22
Discovering rules (data mining)
Occup. Income Educ. Sex Married Age
Student 10k MA M S 22
Student 20k PhD F S 24
Doctor 80k MD M M 30
Retired 30k HS F M 60
Find the most frequent patterns (association
rules)
Num in household 1 num children 0 gt
language English
Language English Income lt 40k Married
false num children 0 gt education
college, grad school
23
Unsupervised learning summary
  • Clustering
  • Hierarchical clustering
  • Linear dimensionality reduction (PCA)
  • Non-linear dim. Reduction
  • Learning rules

24
Discovering networks
?
From data visualization to causal discovery
25
Networks in biology
  • Most processes in the cell are controlled by
    networks of interacting molecules
  • Metabolic Network
  • Signal Transduction Networks
  • Regulatory Networks
  • Networks can be modeled at multiple levels of
    detail/ realism
  • Molecular level
  • Concentration level
  • Qualitative level

Decreasing detail
26
Molecular level Lysis-Lysogeny circuit in Lambda
phage
Arkin et al. (1998), Genetics 149(4)1633-48
5 genes, 67 parameters based on 50 years of
research Stochastic simulation required
supercomputer
27
Concentration level metabolic pathways
  • Usually modeled with differential equations

28
Qualitative level Boolean Networks
29
Probabilistic graphical models
  • Supports graph-based modeling at various levels
    of detail
  • Models can be learned from noisy, partial data
  • Can model inherently stochastic phenomena,
    e.g., molecular-level fluctuations
  • But can also model deterministic, causal
    processes.

"The actual science of logic is conversant at
present only with things either certain,
impossible, or entirely doubtful. Therefore the
true logic for this world is the calculus of
probabilities." -- James Clerk Maxwell
"Probability theory is nothing but common sense
reduced to calculation." -- Pierre Simon Laplace
30
Graphical models outline
  • What are graphical models?
  • Inference
  • Structure learning

31
Simple probabilistic modellinear regression
Deterministic (functional) relationship
Y ? ? X noise
Y
X
32
Simple probabilistic modellinear regression
Deterministic (functional) relationship
Y ? ? X noise
Y
Learning estimatingparameters ?, ?, ?
from(x,y) pairs.
Is the empirical mean
Can be estimate byleast squares
X
Is the residual variance
33
Piecewise linear regression
Latent switch variable hidden process at work
34
Probabilistic graphical model for piecewise
linear regression
input
  • Hidden variable Q chooses which set ofparameters
    to use for predicting Y.
  • Value of Q depends on value of input X.
  • This is an example of mixtures of experts

output
Learning is harder because Q is hidden, so we
dont know whichdata points to assign to each
line can be solved with EM (c.f., K-means)
35
Classes of graphical models
Probabilistic models
Graphical models
Undirected
Directed
Bayes nets
MRFs
DBNs
36
Bayesian Networks
Compact representation of probability
distributions via conditional independence
  • Qualitative part
  • Directed acyclic graph (DAG)
  • Nodes - random variables
  • Edges - direct influence

Earthquake
Burglary
Radio
Alarm
Call
Together Define a unique distribution in a
factored form
Quantitative part Set of conditional
probability distributions
37
Example ICU Alarm network
  • Domain Monitoring Intensive-Care Patients
  • 37 variables
  • 509 parameters
  • instead of 254

38
Success stories for graphical models
  • Multiple sequence alignment
  • Forensic analysis
  • Medical and fault diagnosis
  • Speech recognition
  • Visual tracking
  • Channel coding at Shannon limit
  • Genetic pedigree analysis

39
Graphical models outline
  • What are graphical models? p
  • Inference
  • Structure learning

40
Probabilistic Inference
  • Posterior probabilities
  • Probability of any event given any evidence
  • P(XE)

Radio
Call
41
Viterbi decoding
Compute most probable explanation (MPE) of
observed data
Hidden Markov Model (HMM)
hidden
X1
X2
X3
Y1
Y3
observed
Y2
Tomato
42
Inference computational issues
Easy
Hard
Dense, loopy graphs
Chains
Trees
Grids
43
Inference computational issues
Easy
Hard
Dense, loopy graphs
Chains
Trees
Grids
Many difference inference algorithms,both exact
and approximate
44
Bayesian inference
  • Bayesian probability treats parameters as random
    variables
  • Learning/ parameter estimation is replaced by
    probabilistic inference P(?D)
  • Example Bayesian linear regression parameters
    are? (?, ?, ?)

Parameters are tied (shared)across repetitions
of the data
?
X1
Xn
Y1
Yn
45
Bayesian inference
  • Elegant no distinction between parameters and
    other hidden variables
  • Can use priors to learn from small data sets
    (c.f., one-shot learning by humans)
  • - Math can get hairy
  • - Often computationally intractable

46
Graphical models outline
  • What are graphical models?
  • Inference
  • Structure learning

p
p
47
Why Struggle for Accurate Structure?
Missing an arc
Adding an arc
  • Cannot be compensated for by fitting parameters
  • Wrong assumptions about domain structure
  • Increases the number of parameters to be
    estimated
  • Wrong assumptions about domain structure

48
Scorebased Learning
Define scoring function that evaluates how well a
structure matches the data
E
E
B
E
A
A
B
A
B
Search for a structure that maximizes the score
49
Learning Trees
  • Can find optimal tree structure in O(n2 log n)
    time just find the max-weight spanning tree
  • If some of the variables are hidden, problem
    becomes hard again, but can use EM to fit
    mixtures of trees

50
Heuristic Search
  • Learning arbitrary graph structure is NP-hard.So
    it is common to resort to heuristic search
  • Define a search space
  • search states are possible structures
  • operators make small changes to structure
  • Traverse space looking for high-scoring
    structures
  • Search techniques
  • Greedy hill-climbing
  • Best first search
  • Simulated Annealing
  • ...

51
Local Search Operations
  • Typical operations

Add C ?D
?score S(C,E ?D) - S(E ?D)
Reverse C ?E
Delete C ?E
52
Problems with local search
Easy to get stuck in local optima
truth
you
S(GD)
53
Problems with local search II
Picking a single best model can be misleading
54
Problems with local search II
Picking a single best model can be misleading
  • Small sample size ? many high scoring models
  • Answer based on one model often useless
  • Want features common to many models

55
Bayesian Approach to Structure Learning
  • Posterior distribution over structures
  • Estimate probability of features
  • Edge X?Y
  • Path X? ? Y

Bayesian score for G
Feature of G, e.g., X?Y
Indicator function for feature f
56
Bayesian approach computational issues
  • Posterior distribution over structures

How compute sum over super-exponential number of
graphs?
  • MCMC over networks
  • MCMC over node-orderings (Rao-Blackwellisation)

57
Structure learning other issues
  • Discovering latent variables
  • Learning causal models
  • Learning from interventional data
  • Active learning

58
Discovering latent variables
a) 17 parameters
b) 59 parameters
There are some techniques for automatically
detecting thepossible presence of latent
variables
59
Learning causal models
  • So far, we have only assumed that X -gt Y -gt Z
    means that Z is independent of X given Y.
  • However, we often want to interpret directed
    arrows causally.
  • This is uncontroversial for the arrow of time.
  • But can we infer causality from static
    observational data?

60
Learning causal models
  • We can infer causality from static observational
    data if we have at least four measured variables
    and certain tetrad conditions hold.
  • See books by Pearl and Spirtes et al.
  • However, we can only learn up to Markov
    equivalence, not matter how much data we have.

Y
Z
X
Y
Z
X
Y
Z
X
Y
Z
X
61
Learning from interventional data
  • The only way to distinguish between Markov
    equivalent networks is to perform interventions,
    e.g., gene knockouts.
  • We need to (slightly) modify our learning
    algorithms.

smoking
smoking
Cut arcs cominginto nodes whichwere set
byintervention
Yellowfingers
Yellowfingers
P(smoker do(paint yellow)) prior
P(smokerobserve(yellow)) gtgt prior
62
Active learning
  • Which experiments (interventions) should we
    perform to learn structure as efficiently as
    possible?
  • This problem can be modeled using decision
    theory.
  • Exact solutions are wildly computationally
    intractable.
  • Can we come up with good approximate decision
    making techniques?
  • Can we implement hardware to automatically
    perform the experiments?
  • AB Automated Biologist

63
Learning from relational data
Can we learn concepts from a set of relations
between objects,instead of/ in addition to just
their attributes?
64
Learning from relational data approaches
  • Probabilistic relational models (PRMs)
  • Reify a relationship (arcs) between nodes
    (objects) by making into a node (hypergraph)
  • Inductive Logic Programming (ILP)
  • Top-down, e.g., FOIL (generalization of C4.5)
  • Bottom up, e.g., PROGOL (inverse deduction)

65
ILP for learning protein folding input
yes
no
TotalLength(D2mhr, 118) NumberHelices(D2mhr, 6)

100 conjuncts describing structure of each
pos/neg example
66
ILP for learning protein folding results
  • PROGOL learned the following rule to predict if a
    protein will form a four-helical up-and-down
    bundle
  • In English The protein P folds if it contains a
    long helix h1 at a secondary structure position
    between 1 and 3 and h1 is next to a second helix

67
ILP Pros and Cons
  • Can discover new predicates (concepts)
    automatically
  • Can learn relational models from relational (or
    flat) data
  • - Computationally intractable
  • - Poor handling of noise

68
The future of machine learning for bioinformatics?
Oracle
69
The future of machine learning for bioinformatics
Prior knowledge
Hypotheses
Replicated experiments
Learner
Biological literature
Expt.design
Real world
  • Computer assisted pathway refinement

70
The end
71
Decision trees
blue?
oval?
yes
big?
no
no
yes
72
Decision trees
blue?
oval?
yes
Handles mixed variables Handles missing
data Efficient for large data sets Handles
irrelevant attributes Easy to understand -
Predictive power
big?
no
no
yes
73
Feedforward neural network
input
Hidden layer
Output
Sigmoid function at each node
Weights on each arc
74
Feedforward neural network
input
Hidden layer
Output
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predicts poorly
75
Nearest Neighbor
  • Remember all your data
  • When someone asks a question,
  • find the nearest old data point
  • return the answer associated with it

76
Nearest Neighbor
?
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predictive power
77
Support Vector Machines (SVMs)
  • Two key ideas
  • Large margins are good
  • Kernel trick

78
SVM mathematical details
  • Training data l-dimensional vector with flag
    of true or false
  • Separating hyperplane
  • Margin
  • Inequalities
  • Support vector expansion
  • Support vectors
  • Decision

79
Replace all inner products with kernels
Kernel function
80
SVMs summary
- Handles mixed variables - Handles missing
data - Efficient for large data sets - Handles
irrelevant attributes - Easy to understand
Predictive power
General lessons from SVM success
  • Kernel trick can be used to make many linear
    methods non-linear e.g., kernel PCA, kernelized
    mutual information
  • Large margin classifiers are good

81
Boosting summary
  • Can boost any weak learner
  • Most commonly boosted decision stumps

Handles mixed variables Handles missing
data Efficient for large data sets Handles
irrelevant attributes - Easy to understand
Predictive power
82
Supervised learning summary
  • Learn mapping F from inputs to outputs using a
    training set of (x,t) pairs
  • F can be drawn from different hypothesis spaces,
    e.g., decision trees, linear separators, linear
    in high dimensions, mixtures of linear
  • Algorithms offer a variety of tradeoffs
  • Many good books, e.g.,
  • The elements of statistical learning,Hastie,
    Tibshirani, Friedman, 2001
  • Pattern classification, Duda, Hart, Stork, 2001

83
Inference
  • Posterior probabilities
  • Probability of any event given any evidence
  • Most likely explanation
  • Scenario that explains evidence
  • Rational decision making
  • Maximize expected utility
  • Value of Information
  • Effect of intervention

Radio
Call
84
Assumption needed to makelearning work
  • We need to assume Future futures will resemble
    past futures (B. Russell)
  • Unlearnable hypothesis All emeralds are grue,
    where grue meansgreen if observed before time
    t, blue afterwards.

85
Structure learning success stories gene
regulation network (Friedman et al.)
  • Yeast data Hughes et al 2000
  • 600 genes
  • 300 experiments

86
Structure learning success stories II
Phylogenetic Tree Reconstruction (Friedman et al.)
  • Input Biological sequences
  • Human CGTTGC
  • Chimp CCTAGG
  • Orang CGAACG
  • .
  • Output a phylogeny

Uses structural EM, with max-spanning-treein the
inner loop
10 billion years
87
Instances of graphical models
Probabilistic models
Graphical models
Naïve Bayes classifier
Undirected
Directed
Bayes nets
MRFs
Mixturesof experts
DBNs
Kalman filtermodel
Ising model
Hidden Markov Model (HMM)
88
ML enabling technologies
  • Faster computers
  • More data
  • The web
  • Parallel corpora (machine translation)
  • Multiple sequenced genomes
  • Gene expression arrays
  • New ideas
  • Kernel trick
  • Large margins
  • Boosting
  • Graphical models
Write a Comment
User Comments (0)
About PowerShow.com