Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca * – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 49
Provided by: GaryBa152
Category:

less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
(No Transcript)
3
Module 2 Clustering, Classification and Feature
Selection
Sohrab Shah Centre for Translational and Applied
Genomics Molecular Oncology Breast Cancer
Research Program BC Cancer Agency sshah_at_bccrc.ca
4
Module Overview
  • Introduction to clustering
  • distance metrics
  • hierarchical, partitioning and model based
    clustering
  • Introduction to classification
  • building a classifier
  • avoiding overfitting
  • cross validation
  • Feature Selection in clustering and
    classification

5
Introduction to clustering
  • What is clustering?
  • unsupervised learning
  • discovery of patterns in data
  • class discovery
  • Grouping together objects that are most similar
    (or least dissimilar)
  • objects may be genes, or samples, or both
  • Example question Are there samples in my cohort
    that can be subgrouped based on molecular
    profiling?
  • Do these groups have correlation to clinical
    outcome?

6
Distance metrics
  • In order to perform clustering, we need to have a
    way to measure how similar (or dissimilar) two
    objects are
  • Euclidean distance
  • Manhattan distance
  • 1-correlation
  • proportional to Euclidean distance, but invariant
    to range of measurement from one sample to the
    next

dissimilar
similar
7
Distance metrics compared
Euclidean
Manhattan
1-Pearson
Conclusion distance matters!
8
Other distance metrics
  • Hamming distance for ordinal, binary or
    categorical data

9
Approaches to clustering
  • Partitioning methods
  • K-means
  • K-medoids (partitioning around medoids)
  • Model based approaches
  • Hierarchical methods
  • nested clusters
  • start with pairs
  • build a tree up to the root

10
Partitioning methods
  • Anatomy of a partitioning based method
  • data matrix
  • distance function
  • number of groups
  • Output
  • group assignment of every object

11
Partitioning based methods
  • Choose K groups
  • initialise group centers
  • aka centroid, medoid
  • assign each object to the nearest centroid
    according to the distance metric
  • reassign (or recompute) centroids
  • repeat last 2 steps until assignment stabilizes

12
K-medoids in action
13
K-means vs K-medoids
K-means K-medoids
Centroids are the mean of the clusters Centroids are an actual object that minimizes the total within cluster distance
Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix
Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects
kmeans pam
14
Partitioning based methods
Advantages Disadvantages
Number of groups is well defined Have to choose the number of groups
A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster
Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations
15
Agglomerative hierarchical clustering
16
Hierarchical clustering
  • Anatomy of hierarchical clustering
  • distance matrix
  • linkage method
  • Output
  • dendrogram
  • a tree that defines the relationships between
    objects and the distance between clusters
  • a nested sequence of clusters

17
Linkage methods
single
complete
distance between centroids
average
18
Linkage methods
  • Ward (1963)
  • form partitions that minimizes the loss
    associated with each grouping
  • loss defined as error sum of squares (ESS)
  • consider 10 objects with scores (2, 6, 5, 6, 2,
    2, 2, 2, 0, 0, 0)

ESSOnegroup (2 -2.5)2 (6 -2.5)2 .......
(0 -2.5)2 50.5 On the other hand, if the
10 objects are classified according to their
scores into four sets, 0,0,0, 2,2,2,2, 5,
6,6 The ESS can be evaluated as the sum of
squares of four separate error sums of
squares ESSOnegroup ESSgroup1 ESSgroup2
ESSgroup3 ESSgroup4 0.0 Thus, clustering
the 10 scores into 4 clusters results in no loss
of information.
19
Linkage methods in action
  • clustering based on single linkage
  • single lt- hclust(dist(t(exprMatSub),method"euclid
    ean"), methodsingle")
  • plot(single)

20
Linkage methods in action
  • clustering based on complete linkage
  • complete lt- hclust(dist(t(exprMatSub),method"eucl
    idean"), method"complete")
  • plot(complete)

21
Linkage methods in action
  • clustering based on centroid linkage
  • centroid lt- hclust(dist(t(exprMatSub),method"eucl
    idean"), methodcentroid")
  • plot(centroid)

22
Linkage methods in action
  • clustering based on average linkage
  • average lt- hclust(dist(t(exprMatSub),method"eucli
    dean"), methodaverage")
  • plot(average)

23
Linkage methods in action
  • clustering based on Ward linkage
  • ward lt- hclust(dist(t(exprMatSub),method"euclidea
    n"), methodward")
  • plot(ward)

24
Linkage methods in action
Conclusion linkage matters!
25
Hierarchical clustering analyzed
Advantages Disadvantages
There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure
No need to specify number groups ahead of time Its necessary to cut the dendrogram in order to produce clusters
Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be undone
26
Model based approaches
  • Assume the data are generated from a mixture of
    K distributions
  • What cluster assignment and parameters of the K
    distributions best explain the data?
  • Fit a model to the data
  • Try to get the best fit
  • Classical example mixture of Gaussians (mixture
    of normals)
  • Take advantage of probability theory and
    well-defined distributions in statistics

27
Model based clustering array CGH
28
Model based clustering of aCGH
Problem patient cohorts often exhibit molecular
heterogeneity making rarer shared CNAs hard to
detect
Approach Cluster the data by extending the
profiling to the multi-group setting
A mixture of HMMs HMM-Mix
Distribution of calls in a group
CNA calls
Raw data
Shah et al (Bioinformatics, 2009)
29
Advantages of model based approaches
  • In addition to clustering patients into groups,
    we output a model that best represents the
    patients in a group
  • We can then associate each model with clinical
    variables and simply output a classifier to be
    used on new patients
  • Choosing the number of groups becomes a model
    selection problem (ie the Bayesian Information
    Criterion)
  • see Yeung et al Bioinformatics (2001)

30
Clustering 106 follicular lymphoma patients with
HMM-Mix
Initialisation
Clinical
Converged
  • Recapitulates known FL subgroups
  • Subgroups have clinical relevance

30
31
Feature selection
  • Most features (genes, SNP probesets, BAC clones)
    in high dimensional datasets will be
    uninformative
  • examples unexpressed genes, housekeeping genes,
    passenger alterations
  • Clustering (and classification) has a much higher
    chance of success if uninformative features are
    removed
  • Simple approaches
  • select intrinsically variable genes
  • require a minimum level of expression in a
    proportion of samples
  • genefilter package (Bioonductor) Lab1
  • Return to feature selection in the context of
    classification

32
Advanced topics in clustering
  • Top down clustering
  • Bi-clustering or two-way clustering
  • Principal components analysis
  • Choosing the number of groups
  • model selection
  • AIC, BIC
  • Silhouette coefficient
  • The Gap curve
  • Joint clustering and feature selection

33
What Have We Learned?
  • There are three main types of clustering
    approaches
  • hierarchical
  • partitioning
  • model based
  • Feature selection is important
  • reduces computational time
  • more likely to identify well-separated groups
  • The distance metric matters
  • The linkage method matters in hierarchical
    clustering
  • Model based approaches offer principled
    probabilistic methods

34
Module Overview
  • Clustering
  • Classification
  • Feature Selection

35
Classification
  • What is classificiation?
  • Supervised learning
  • discriminant analysis
  • Work from a set of objects with predefined
    classes
  • ie basal vs luminal or good responder vs poor
    responder
  • Task learn from the features of the objects
    what is the basis for discrimination?
  • Statistically and mathematically heavy

36
Classification
poor response
poor response
good response
good response
37
Example DLBCL subtypes
Wright et al, PNAS (2003)
38
DLBCL subtypes
Wright et al, PNAS (2003)
39
Classification approaches
  • Wright et al PNAS (2003)
  • Weighted features in a linear predictor score
  • aj weight of gene j determined by t-test
    statistic
  • Xj expression value of gene j
  • Assume there are 2 distinct distributions of LPS
    1 for ABC, 1 for GCB

40
Wright et al, DLBCL, contd
  • Use Bayes rule to determine a probability that a
    sample comes from group 1
  • probability density function
    that represents group 1

41
Learning the classifier, Wright et al
  • Choosing the genes (feature selection)
  • use cross validation
  • Leave one out cross validation
  • Pick a set of samples
  • Use all but one of the samples as training,
    leaving one out for test
  • Fit the model using the training data
  • Can the classifier correctly pick the class of
    the remaining case?
  • Repeat exhaustively for leaving out each sample
    in turn
  • Repeat using different sets and numbers of genes
    based on t-statistic
  • Pick the set of genes that give the highest
    accuracy

42
Overfitting
  • In many cases in biology, the number of features
    is much larger than the number of samples
  • Important features may not be represented in the
    training data
  • This can result in overfitting
  • when a classifier discriminates well on its
    training data, but does not generalise to
    orthogonally derived data sets
  • Validation is required in at least one external
    cohort to believe the results
  • example the expression subtypes for breast
    cancer have been repeatedly validated in numerous
    data sets

43
Overfitting
  • To reduce the problem of overfitting, one can use
    Bayesian priors to regularize the parameter
    estimates of the model
  • Some methods now integrate feature selection and
    classification in a unified analytical framework
  • see Law et al IEEE (2005) Sparse Multinomial
    Logistic Regression (SMLR) http//www.cs.duke.edu
    /amink/software/smlr/
  • Cross validation should always be used in
    training a classifier

44
Evaluating a classifier
  • The receiver operator characteristic curve
  • plots the true positive rate vs the false
    positive rate
  • Given ground truth and a probabilistic classifier
  • for some number of probability thresholds
  • compute the TPR
  • proportion of positives that were predicted as
    true
  • compute the FPR
  • number of false predictions over the total number
    of predictions

45
Other methods for classification
  • Support vector machines
  • Linear discriminant analysis
  • Logistic regression
  • Random forests
  • See
  • Ma and Huang Briefings in Bioinformatics (2008)
  • Saeys et al Bioinformatics (2007)

46
Questions?
47
Lab Clustering and feature selection
  • Get familiar clustering tools and plotting
  • Feature selection methods
  • Distance matrices
  • Linkage methods
  • Partition methods
  • Try to reproduce some of the figures from Chin et
    al using the freely available data

48
Module 2 Lab
  • Coffee break
  • Back at 1500
Write a Comment
User Comments (0)
About PowerShow.com