Canadian Bioinformatics Workshops presentation

About This Presentation

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops

1
Canadian Bioinformatics Workshops

www.bioinformatics.ca

2
(No Transcript)
3
Module 2 Clustering, Classification and Feature
Selection
Sohrab Shah Centre for Translational and Applied
Genomics Molecular Oncology Breast Cancer
Research Program BC Cancer Agency sshah_at_bccrc.ca
4
Module Overview

Introduction to clustering
distance metrics
hierarchical, partitioning and model based
clustering
Introduction to classification
building a classifier
avoiding overfitting
cross validation
Feature Selection in clustering and
classification

5
Introduction to clustering

What is clustering?
unsupervised learning
discovery of patterns in data
class discovery
Grouping together objects that are most similar
(or least dissimilar)
objects may be genes, or samples, or both
Example question Are there samples in my cohort
that can be subgrouped based on molecular
profiling?
Do these groups have correlation to clinical
outcome?

6
Distance metrics

In order to perform clustering, we need to have a
way to measure how similar (or dissimilar) two
objects are
Euclidean distance
Manhattan distance
1-correlation
proportional to Euclidean distance, but invariant
to range of measurement from one sample to the
next

dissimilar
similar
7
Distance metrics compared
Euclidean
Manhattan
1-Pearson
Conclusion distance matters!
8
Other distance metrics

Hamming distance for ordinal, binary or
categorical data

9
Approaches to clustering

Partitioning methods
K-means
K-medoids (partitioning around medoids)
Model based approaches
Hierarchical methods
nested clusters
start with pairs
build a tree up to the root

10
Partitioning methods

Anatomy of a partitioning based method
data matrix
distance function
number of groups
Output
group assignment of every object

11
Partitioning based methods

Choose K groups
initialise group centers
aka centroid, medoid
assign each object to the nearest centroid
according to the distance metric
reassign (or recompute) centroids
repeat last 2 steps until assignment stabilizes

12
K-medoids in action
13
K-means vs K-medoids
K-means K-medoids
Centroids are the mean of the clusters Centroids are an actual object that minimizes the total within cluster distance
Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix
Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects
kmeans pam
14
Partitioning based methods
Advantages Disadvantages
Number of groups is well defined Have to choose the number of groups
A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster
Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations
15
Agglomerative hierarchical clustering
16
Hierarchical clustering

Anatomy of hierarchical clustering
distance matrix
linkage method
Output
dendrogram
a tree that defines the relationships between
objects and the distance between clusters
a nested sequence of clusters

17
Linkage methods
single
complete
distance between centroids
average
18
Linkage methods

Ward (1963)
form partitions that minimizes the loss
associated with each grouping
loss defined as error sum of squares (ESS)
consider 10 objects with scores (2, 6, 5, 6, 2,
2, 2, 2, 0, 0, 0)

ESSOnegroup (2 -2.5)2 (6 -2.5)2 .......
(0 -2.5)2 50.5 On the other hand, if the
10 objects are classified according to their
scores into four sets, 0,0,0, 2,2,2,2, 5,
6,6 The ESS can be evaluated as the sum of
squares of four separate error sums of
squares ESSOnegroup ESSgroup1 ESSgroup2
ESSgroup3 ESSgroup4 0.0 Thus, clustering
the 10 scores into 4 clusters results in no loss
of information.
19
Linkage methods in action

clustering based on single linkage
single lt- hclust(dist(t(exprMatSub),method"euclid
ean"), methodsingle")
plot(single)

20
Linkage methods in action

clustering based on complete linkage
complete lt- hclust(dist(t(exprMatSub),method"eucl
idean"), method"complete")
plot(complete)

21
Linkage methods in action

clustering based on centroid linkage
centroid lt- hclust(dist(t(exprMatSub),method"eucl
idean"), methodcentroid")
plot(centroid)

22
Linkage methods in action

clustering based on average linkage
average lt- hclust(dist(t(exprMatSub),method"eucli
dean"), methodaverage")
plot(average)

23
Linkage methods in action

clustering based on Ward linkage
ward lt- hclust(dist(t(exprMatSub),method"euclidea
n"), methodward")
plot(ward)

24
Linkage methods in action
Conclusion linkage matters!
25
Hierarchical clustering analyzed
Advantages Disadvantages
There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure
No need to specify number groups ahead of time Its necessary to cut the dendrogram in order to produce clusters
Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be undone
26
Model based approaches

Assume the data are generated from a mixture of
K distributions
What cluster assignment and parameters of the K
distributions best explain the data?
Fit a model to the data
Try to get the best fit
Classical example mixture of Gaussians (mixture
of normals)
Take advantage of probability theory and
well-defined distributions in statistics

27
Model based clustering array CGH
28
Model based clustering of aCGH
Problem patient cohorts often exhibit molecular
heterogeneity making rarer shared CNAs hard to
detect
Approach Cluster the data by extending the
profiling to the multi-group setting
A mixture of HMMs HMM-Mix
Distribution of calls in a group
CNA calls
Raw data
Shah et al (Bioinformatics, 2009)
29
Advantages of model based approaches

In addition to clustering patients into groups,
we output a model that best represents the
patients in a group
We can then associate each model with clinical
variables and simply output a classifier to be
used on new patients
Choosing the number of groups becomes a model
selection problem (ie the Bayesian Information
Criterion)
see Yeung et al Bioinformatics (2001)

30
Clustering 106 follicular lymphoma patients with
HMM-Mix
Initialisation
Clinical
Converged

Recapitulates known FL subgroups
Subgroups have clinical relevance

30
31
Feature selection

Most features (genes, SNP probesets, BAC clones)
in high dimensional datasets will be
uninformative
examples unexpressed genes, housekeeping genes,
passenger alterations
Clustering (and classification) has a much higher
chance of success if uninformative features are
removed
Simple approaches
select intrinsically variable genes
require a minimum level of expression in a
proportion of samples
genefilter package (Bioonductor) Lab1
Return to feature selection in the context of
classification

32
Advanced topics in clustering

Top down clustering
Bi-clustering or two-way clustering
Principal components analysis
Choosing the number of groups
model selection
AIC, BIC
Silhouette coefficient
The Gap curve
Joint clustering and feature selection

33
What Have We Learned?

There are three main types of clustering
approaches
hierarchical
partitioning
model based
Feature selection is important
reduces computational time
more likely to identify well-separated groups
The distance metric matters
The linkage method matters in hierarchical
clustering
Model based approaches offer principled
probabilistic methods

34
Module Overview

Clustering
Classification
Feature Selection

35
Classification

What is classificiation?
Supervised learning
discriminant analysis
Work from a set of objects with predefined
classes
ie basal vs luminal or good responder vs poor
responder
Task learn from the features of the objects
what is the basis for discrimination?
Statistically and mathematically heavy

36
Classification
poor response
poor response
good response
good response
37
Example DLBCL subtypes
Wright et al, PNAS (2003)
38
DLBCL subtypes
Wright et al, PNAS (2003)
39
Classification approaches

Wright et al PNAS (2003)
Weighted features in a linear predictor score
aj weight of gene j determined by t-test
statistic
Xj expression value of gene j
Assume there are 2 distinct distributions of LPS
1 for ABC, 1 for GCB

40
Wright et al, DLBCL, contd

Use Bayes rule to determine a probability that a
sample comes from group 1
probability density function
that represents group 1

41
Learning the classifier, Wright et al

Choosing the genes (feature selection)
use cross validation
Leave one out cross validation
Pick a set of samples
Use all but one of the samples as training,
leaving one out for test
Fit the model using the training data
Can the classifier correctly pick the class of
the remaining case?
Repeat exhaustively for leaving out each sample
in turn
Repeat using different sets and numbers of genes
based on t-statistic
Pick the set of genes that give the highest
accuracy

42
Overfitting

In many cases in biology, the number of features
is much larger than the number of samples
Important features may not be represented in the
training data
This can result in overfitting
when a classifier discriminates well on its
training data, but does not generalise to
orthogonally derived data sets
Validation is required in at least one external
cohort to believe the results
example the expression subtypes for breast
cancer have been repeatedly validated in numerous
data sets

43
Overfitting

To reduce the problem of overfitting, one can use
Bayesian priors to regularize the parameter
estimates of the model
Some methods now integrate feature selection and
classification in a unified analytical framework
see Law et al IEEE (2005) Sparse Multinomial
Logistic Regression (SMLR) http//www.cs.duke.edu
/amink/software/smlr/
Cross validation should always be used in
training a classifier

44
Evaluating a classifier

The receiver operator characteristic curve
plots the true positive rate vs the false
positive rate

Given ground truth and a probabilistic classifier
for some number of probability thresholds
compute the TPR
proportion of positives that were predicted as
true
compute the FPR
number of false predictions over the total number
of predictions

45
Other methods for classification

Support vector machines
Linear discriminant analysis
Logistic regression
Random forests
See
Ma and Huang Briefings in Bioinformatics (2008)
Saeys et al Bioinformatics (2007)

46
Questions?
47
Lab Clustering and feature selection

Get familiar clustering tools and plotting
Feature selection methods
Distance matrices
Linkage methods
Partition methods
Try to reproduce some of the figures from Chin et
al using the freely available data

Canadian Bioinformatics Workshops PowerPoint PPT Presentation