MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS - PowerPoint PPT Presentation

About This Presentation
Title:

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Description:

Title: talk proteomics Author: Elena Marchiori Last modified by: elena Created Date: 3/28/2002 7:39:25 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 76
Provided by: Elen128
Category:

less

Transcript and Presenter's Notes

Title: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS


1
MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS
  • Elena Marchiori
  • IBIVU
  • Vrije Universiteit Amsterdam

2
Summary
  • Machine Learning
  • Supervised Learning classification
  • Unsupervised Learning clustering

3
Machine Learning (ML)
  • Construct a computational model from a dataset
    describing properties of an unknown (but
    existent) system.

System (unknown)
observations
properties

?
ML
Computational model
prediction
4
Supervised Learning
  • The dataset describes examples of input-output
    behaviour of a unknown (but existent) system.
  • The algorithm tries to find a function
    equivalent to the system.
  • ML techniques for classification K-nearest
    neighbour, decision trees, Naïve Bayes, Support
    Vector Machines.

5
Supervised Learning
property of interest
System (unknown)
observations

supervisor
Training data
?
ML algorithm
new observation
model
prediction
Unsupervised learning
6
Example A Classification Problem
  • Categorize images of fishsay, Atlantic salmon
    vs. Pacific salmon
  • Use features such as length, width, lightness,
    fin shape number, mouth position, etc.
  • Steps
  • Preprocessing (e.g., background subtraction)
  • Feature extraction
  • Classification

example from Duda Hart
7
Classification in Bioinformatics
  • Computational diagnostic early cancer detection
  • Tumor biomarker discovery
  • Protein folding prediction
  • Protein-protein binding sites prediction

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
8
Classification Techniques
  • Naïve Bayes
  • K Nearest Neighbour
  • Support Vector Machines (next lesson)

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
9
Bayesian Approach
  • Each observed training example can incrementally
    decrease or increase probability of hypothesis
    instead of eliminate an hypothesis
  • Prior knowledge can be combined with observed
    data to determine hypothesis
  • Bayesian methods can accommodate hypotheses that
    make probabilistic predictions
  • New instances can be classified by combining the
    predictions of multiple hypotheses, weighted by
    their probabilities

Kathleen McKeowns slides
10
Bayesian Approach
  • Assign the most probable target value, given
    lta1,a2,angt
  • VMAPargmax P(vj a1,a2,an)
  • Using Bayes Theorem
  • VMAPargmax P(a1,a2,anvj)P(vi) vj?V
    P(a1,a2,an) argmax
    P(a1,a2,anvj)P(vi) vj?V
  • Bayesian learning is optimal
  • Easy to estimate P(vi) by counting in training
    data
  • Estimating the different P(a1,a2,anvj) not
    feasible
  • (we would need a training set of size
    proportional to the number of possible instances
    times the number of classes)

Kathleen McKeowns slides
11
Bayes Rules
  • Product Rule P(a ? b) P(ab)P(b) P(ba)P(a)
  • Bayes rule P(ab)P(ba)P(a)
    P(b)
  • In distribution form P(YX)P(XY)P(Y)
    aP(XY)P(Y) P(X)

Kathleen McKeowns slides
12
Naïve Bayes
  • Assume independence of attributes
  • P(a1,a2,anvj)?P(aivj)
    i
  • Substitute into VMAP formula
  • VNBargmax P(vj)?P(aivj) vj?V
    i

Kathleen McKeowns slides
13
VNBargmax P(vj)?P(aivj) vj?V
S-length S-width P-length Class
1 high high high Versicolour
2 low high low Setosa
3 low high low Verginica
4 low high med Verginica
5 high high high Versicolour
6 high high med Setosa
7 high high low Setosa
8 high high high Versicolour
9 high high high Versicolour
Kathleen McKeowns slides
14
Estimating Probabilities
  • What happens when the number of data elements is
    small?
  • Suppose true P(S-lengthlowverginica).05
  • There are only 2 instances with CVerginica
  • We estimate probability by nc/n using the
    training set
  • Then S-length low Verginica must 0
  • Then, instead of .05 we use estimated probability
    of 0
  • Two problems
  • Biased underestimate of probability
  • This probability term will dominate if future
    query contains S-lengthlow

Kathleen McKeowns slides
15
Instead use m-estimate
  • Use priors as well
  • ncmp nm
  • Where p prior estimate of P(S-lengthlowvergini
    ca)
  • m is a constant called the equivalent sample size
  • Determines how heavily to weight p relative to
    the observed data
  • Typical method assume a uniform prior of an
    attribute (e.g. if values low,med,high -gt p 1/3)

Kathleen McKeowns slides
16
K-Nearest Neighbour
  • Memorize the training data
  • Given a new example, find its k nearest
    neighbours, and output the majority vote class.
  • Choices
  • How many neighbours?
  • What distance measure?

17
Application in Bioinformatics
  • A Regression-based K nearest neighbor algorithm
    for gene function prediction from heterogeneous
    data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics
    2006, 7
  • For each dataset k, for each pair of genes p
    compute similarity fk(p) of p wrt k-th data
  • Construct predictor of gene pair similarity, e.g.
    logistic regression
  • H f(p,1),,f(p,m) ? H(f(p,1),,f(p,m)) such
    that
  • H high value if genes of p have similar
    functions.
  • Given a new gene g find kNN using H as distance
  • Predict the functional classes C1, .., Cn of g
    with confidence equal to
  • Confidence(Ci) 1- ? (1- Pij) with gj neighbour
    of g and Ci in the set of classes of gj
    (probability that at least one prediction is
    correct, that is 1 probability that all
    predictions are wrong)

18
Classification CV error
N samples
  • Training error
  • Empirical error
  • Error on independent test set
  • Test error
  • Cross validation (CV) error
  • Leave-one-out (LOO)
  • N-fold CV

splitting
1/n samples for testing
N-1/n samples for training
Count errors
Summarize CV error rate
Supervised learning
19
Two schemes of cross validation
CV2
CV1
N samples
N samples
LOO
Gene selection
Train and test the gene-selector and the
classifier
LOO
Train and test the classifier
Count errors
Count errors
Supervised learning
20
Difference between CV1 and CV2
  • CV1 gene selection within LOOCV
  • CV2 gene selection before before LOOCV
  • CV2 can yield optimistic estimation of
    classification true error
  • CV2 used in paper by Golub et al.
  • 0 training error
  • 2 CV error (5.26)
  • 5 test error (14.7)
  • CV error different from test error!

Supervised learning
21
Significance of classification results
  • Permutation test
  • Permute class label of samples
  • LOOCV error on data with permuted labels
  • Repeat process a high number of times
  • Compare with LOOCV error on original data
  • P-value ( times LOOCV on permuted data lt
    LOOCV on original data) / total of permutations
    considered

Supervised learning
22
Unsupervised Learning

ML for unsupervised learning attempts to
discover interesting structure in the available
data
Unsupervised learning
23
Unsupervised Learning
  • The dataset describes the structure of an unknown
    (but existent) system.
  • The computer program tries to identify structure
    of the system (clustering, data compression).
  • ML techniques hierarchical clustering, k-means,
    Self Organizing Maps (SOM), fuzzy clustering
    (described in a future lesson).

24
Clustering
  • Clustering is one of the most important
    unsupervised learning processes for organizing
    objects into groups whose members are similar in
    some way.
  • Clustering finds structures in a collection of
    unlabeled data.
  • A cluster is a collection of objects which are
    similar between them and are dissimilar to the
    objects belonging to other clusters.

25
Clustering Algorithms
  • Start with a collection of n objects each
    represented by a pdimensional feature vector xi
    , i1, n.
  • The goal is to associatethe n objects to k
    clusters so that objects within a clusters are
    more similar than objects between clusters. k
    is usually unknown.
  • Popular methods hierarchical, k-means, SOM,

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
26
Hierarchical Clustering
Venn Diagram of Clustered Data
Dendrogram
From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
27
Hierarchical Clustering (Cont.)
  • Multilevel clustering level 1 has n clusters ?
    level n has one cluster.
  • Agglomerative HC starts with singleton and merge
    clusters.
  • Divisive HC starts with one sample and split
    clusters.

28
Nearest Neighbor Algorithm
  • Nearest Neighbor Algorithm is an agglomerative
    approach (bottom-up).
  • Starts with n nodes (n is the size of our
    sample), merges the 2 most similar nodes at each
    step, and stops when the desired number of
    clusters is reached.

From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
29
Nearest Neighbor, Level 2, k 7 clusters.
From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
30
Nearest Neighbor, Level 3, k 6 clusters.
31
Nearest Neighbor, Level 4, k 5 clusters.
32
Nearest Neighbor, Level 5, k 4 clusters.
33
Nearest Neighbor, Level 6, k 3 clusters.
34
Nearest Neighbor, Level 7, k 2 clusters.
35
Nearest Neighbor, Level 8, k 1 cluster.
36
Hierarchical Clustering
Calculate the similarity between all possible
combinations of two profiles
  • Keys
  • Similarity
  • Clustering

Two most similar clusters are grouped together to
form a new cluster
Calculate the similarity between the new cluster
and all remaining clusters.
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
37
Clustering in Bioinformatics
  • Microarray data quality checking
  • Does replicates cluster together?
  • Does similar conditions, time points, tissue
    types cluster together?
  • Cluster genes ? Prediction of functions of
    unknown genes by known ones
  • Cluster samples ? Discover clinical
    characteristics (e.g. survival, marker status)
    shared by samples.
  • Promoter analysis of commonly regulated genes

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
38
Functional significant gene clusters
Two-way clustering
Sample clusters
Gene clusters
39
Bhattacharjee et al. (2001) Human lung carcinomas
mRNA expression profiling reveals distinct
adenocarcinoma subclasses. Proc. Natl. Acad. Sci.
USA, Vol. 98, 13790-13795.
40
Similarity Measurements
  • Pearson Correlation

Two profiles (vectors)
and
1 ? Pearson Correlation ? 1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
41
Similarity Measurements
  • Pearson Correlation Trend Similarity

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
42
Similarity Measurements
  • Euclidean Distance

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
43
Similarity Measurements
  • Euclidean Distance Absolute difference

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
44
Clustering
C1
Merge which pair of clusters?
C2
C3
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
45
Clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters


C2
C1
Tend to generate long chains
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
46
Clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters


C2
C1
Tend to generate clumps
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
47
Clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).


C2
C1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
48
Clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.


C2
C1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
49
Considerations
  • What genes are used to cluster samples?
  • Expression variation
  • Inherent variation
  • Prior knowledge (irrelevant genes)
  • Etc.

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
50
K-means Clustering
  • Initialize the K cluster representatives ws,
    e.g. to randomly chosen examples.
  • Assign each input example x to the cluster c(x)
    with the nearest corresponding weight vector
  • Update the weights
  • Increment n by 1 and go until no noticeable
    changes of the cluster representatives occur.

Unsupervised learning
51
Example I
Initial Data and Seeds
Final Clustering
Unsupervised learning
52
Example II
Initial Data and Seeds
Final Clustering
Unsupervised learning
53
SOM Brains self-organization
The brain maps the external multidimensional
representation of the world into a similar 1 or 2
- dimensional internal representation. That
is, the brain processes the external signals in a
topology-preserving way Mimicking the way the
brain learns, our clustering system should be
able to do the same thing.
Unsupervised learning
54
Self-Organized Map idea
Data vectors XT (X1, ... Xd) from
d-dimensional space. Grid of nodes, with local
processor (called neuron) in each node. Local
processor j has d adaptive parameters W(j).
Goal change W(j) parameters to recover data
clusters in X space.
Unsupervised learning
55
Training process
Java demos http//www.neuroinformatik.ruhr-uni-bo
chum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Unsupervised learning
56
Concept of the SOM
Input space
Reduced feature space
Ba
s1
s2
Mn
Sr
Cluster centers (code vectors)
Place of these code vectors in the reduced space
Clustering and ordering of the cluster centers
in a two dimensional grid
Unsupervised learning
57
Concept of the SOM
We can use it for visualization
Ba
Mn
SA3
We can use it for classification
Sr
SA3

Mg
We can use it for clustering
Unsupervised learning
58
SOM learning algorithm
  • Initialization. n0. Choose random small values
    for weight vectors components.
  • Sampling. Select an x from the input examples.
  • Similarity matching. Find the winning neuron i(x)
    at iteration n
  • Updating adjust the weight vectors of all
    neurons using the following rule
  • Continuation n n1. Go to the Sampling step
    until no noticeable changes in the weights are
    observed.

Unsupervised learning
59
Neighborhood Function
  • Gaussian neighborhood function
  • dji lateral distance of neurons i and j
  • in a 1-dimensional lattice j - i
  • in a 2-dimensional lattice rj - ri where
    rj is the position of neuron j in the lattice.

Unsupervised learning
60
Initial h function (Example )
Unsupervised learning
61
Some examples of real-life applications
  • Helsinki University of Technology web site
  • http//www.cis.hut.fi/research/refs/
  • Contains gt 5000 papers on SOM and its
    applications
  • Brain research modeling of formation of various
    topographical maps in motor, auditory, visual and
    somatotopic areas.
  • Clusterization of genes, protein properties,
    chemical compounds, speech phonemes, sounds of
    birds and insects, astronomical objects,
    economical data, business and financial data
    ....
  • Data compression (images and audio), information
    filtering.
  • Medical and technical diagnostics.

Unsupervised learning
62
Issues in Clustering
  • How many clusters?
  • User parameter
  • Use model selection criteria (Bayesian
    Information Criterion) with penalization term
    which considers model complexity. See e.g.
    X-means http//www2.cs.cmu.edu/dpelleg/kmeans.ht
    ml
  • What similarity measure?
  • Euclidean distance
  • Correlation coefficient
  • Ad-hoc similarity measures

Unsupervised learning
63
Validation of clustering results
  • External measures
  • According to some external knowledge
  • Consideration of bias and subjectivity
  • Internal measures
  • Quality of clusters according to the data
  • Compactness and separation
  • Stability
  • See e.g. J.Handl, J.Knowles, D.B.Kell
  • Computational cluster validation in postgenomic
    data analysis, Bioinformatics, 21(15)3201-3212,
    2005

Unsupervised learning
64
Molecular Classification of CancerClass
Discovery and Class Prediction by Gene Expression
Monitoring
Bioinformatics Application
  • T.R. Golub et al., Science 286, 531 (1999)

Unsupervised learning
65
Identification of cancer types
  • Why is Identification of Cancer Class (tumor
    sub-type) important?
  • Cancers of Identical grade can have widely
    variable clinical courses (i.e. acute
    lymphoblastic leukemia, or Acute myeloid
    leukemia).
  • Tradition Method
  • Morphological appearance.
  • Enzyme-based histochemical analyses.
  • Immunophenotyping.
  • Cytogenetic analysis.

Golub et al 1999
Unsupervised learning
66
Class Prediction
  • How could one use an initial collection of
    samples belonging to know classes to create a
    class Predictor?
  • Identification of Informative Genes
  • Weighted Vote

Golub et al slides
Unsupervised learning
67
Data
  • Initial Sample 38 Bone Marrow Samples (27 ALL,
    11 AML) obtained at the time of diagnosis.
  • Independent Sample 34 leukemia consisted of 24
    bone marrow and 10 peripheral blood samples (20
    ALL and 14 AML).

Golub et al slides
Unsupervised learning
68
Validation of Gene Voting
  • Initial Samples 36 of the 38 samples as either
    AML or ALL and two as uncertain. All 36 samples
    agrees with clinical diagnosis.
  • Independent Samples 29 of 34 samples are
    strongly predicted with 100 accuracy.

Golub et al slides
Unsupervised learning
69
Class Discovery
  • Can cancer classes be discovered automatically
    based on gene expression?
  • Cluster tumors by gene expression
  • Determine whether the putative classes produced
    are meaningful.

Golub et al slides
Unsupervised learning
70
Cluster tumors
  • Self-organization Map (SOM)
  • Mathematical cluster analysis for recognizing and
    clasifying feautres in complex, multidimensional
    data (similar to K-mean approach)
  • Chooses a geometry of nodes
  • Nodes are mapped into K-dimensional space,
    initially at random.
  • Iteratively adjust the nodes.

Golub et al slides
Unsupervised learning
71
Validation of SOM
  • Prediction based on cluster A1 and A2
  • 24/25 of the ALL samples from initial dataset
    were clustered in group A1
  • 10/13 of the AML samples from initial dataset
    were clustered in group A2

Golub et al slides
Unsupervised learning
72
Validation of SOM
  • How could one evaluate the putative cluster if
    the right answer were not known?
  • Assumption class discovery could be tested by
    class prediction.
  • Testing of Assumption
  • Construct Predictors based on clusters A1 and A2.
  • Construct Predictors based on random clusters

Golub et al slides
Unsupervised learning
73
Validation of SOM
  • Predictions using predictors based on clusters A1
    and A2 yields 34 accurate predictions, one error
    and three uncertains.

Golub et al slides
Unsupervised learning
74
Validation of SOM
Golub et al slides
Unsupervised learning
75
CONCLUSION
  • In Machine Learning, every technique has its
    assumptions and constraints, advantages and
    limitations
  • My view
  • First perform simple data analysis before
    applying fancy high tech ML methods
  • Possibly use different ML techniques and then
    ensemble results
  • Apply correct cross validation method!
  • Check for significance of results (permutation
    test, stability of selected genes)
  • Work in collaboration with data producer
    (biologist, pathologist) when possible!

ML in bioinformatics
Write a Comment
User Comments (0)
About PowerShow.com