Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
22
Module Title of Module
3Module 6Classification
Exploratory Data Analysis and Essential
Statistics using R Boris Steipe Toronto,
September 89 2011
The judgement of Paris. Archaic (560-550 BCE)
DEPARTMENT OF BIOCHEMISTRY DEPARTMENT OF
MOLECULAR GENETICS
Includes material originally developed by Sohrab
Shah
4Classification
- What is classificiation?
- Supervised learning
- discriminant analysis
- Work from a set of objects with predefined
classes - ie basal vs luminal or good responder vs poor
responder - Task learn from the features of the objects
what is the basis for discrimination? - Statistically and mathematically heavy
5Classification
poor response
poor response
good response
good response
6Classification
- Input a set of measures, variables or features
- Output a discrete label for what the set of
features most closely resemble - Classification can be probabilistic or
deterministic - How is classification different from clustering?
- We know the groups or classes a priori
- Classification is a prediction on what an
object is, not what other objects it most closely
resembles - Clustering is finding patterns in data
- Classification is using known patterns to predict
an object type
7Example DLBCL subtypes
Wright et al, PNAS (2003)
Module Title of Module
8DLBCL subtypes
Wright et al, PNAS (2003)
9Classification approaches
- Wright et al PNAS (2003)
- Weighted features in a linear predictor score
- aj weight of gene j determined by t-test
statistic - Xj expression value of gene j
- Assume there are 2 distinct distributions of LPS
1 for ABC, 1 for GCB
10Wright et al, DLBCL, contd
- Use Bayes rule to determine a probability that a
sample comes from group 1 - probability density function
that represents group 1
11Learning the classifier, Wright et al
- Choosing the genes (feature selection)
- use cross validation
- Leave one out cross validation
- Pick a set of samples
- Use all but one of the samples as training,
leaving one out for test - Fit the model using the training data
- Can the classifier correctly pick the class of
the remaining case? - Repeat exhaustively for leaving out each sample
in turn - Repeat using different sets and numbers of genes
based on t-statistic - Pick the set of genes that give the highest
accuracy
12Overfitting
- In many cases in biology, the number of features
is much larger than the number of samples - Important features may not be represented in the
training data - This can result in overfitting
- when a classifier discriminates well on its
training data, but does not generalise to
orthogonally derived data sets - Validation is required in at least one external
cohort to believe the results - example the expression subtypes for breast
cancer have been repeatedly validated in numerous
data sets
13Overfitting
- To reduce the problem of overfitting, one can use
Bayesian priors to regularize the parameter
estimates of the model - Some methods now integrate feature selection and
classification in a unified analytical framework - see Law et al IEEE (2005) Sparse Multinomial
Logistic Regression (SMLR) http//www.cs.duke.edu
/amink/software/smlr/ - Cross validation should always be used in
training a classifier
14Evaluating a classifier
- The receiver operator characteristic curve
- plots the true positive rate vs the false
positive rate
- Given ground truth and a probabilistic classifier
- for some number of probability thresholds
- compute the TPR
- proportion of positives that were predicted as
true - compute the FPR
- number of false predictions over the total number
of predictions
15Evaluating a classifier
- Important terms
- Prediction classifier says the object is a hit
- Rejection classifier says the object is a miss
- True Positive (TP) Prediction that is a true hit
- True Negative (TN) Rejection that is a true miss
- False Positive (FP) Prediction that is a true
miss - False Negative (FN) Rejection that is a true hit
- False Positive Rate FPRFP/(FPTN)
- specificity 1-FPR
- True Positive Rate TPRTP/(TPFN)
- sensitivity TPR
16Evaluating a classifier
- Use Area under the ROC curve as a single measure
- encodes the trade-off between FPR and TPR
- only possible for probabilistic outputs
- requires ground truth and probabilities as inputs
- at a fixed number of ordered probability
thresholds, calculate FPR and TPR and plot - for deterministic methods, it is possible to
calculate a point in the FPRTPR space
17Evaluating a classifier
- ROC curves are useful for comparing classifiers
ROC curves using depth thresholds of
0-7,10 Breast cancer data using Affy SNP 6.0
genotypes as truth
18All you need to know about ROC analysis
http//www.hpl.hp.com/techreports/2003/HPL-2003-4.
pdf Tutorial by Tom Fawcett at HP (2003)
19Practical example detecting single nucleotide
variants from next gen sequencing data
20Aligning billions of short reads to the genome
- MAQ, BWA, SOAP, SHRiMP, Mosaik, BowTie, Rmap,
ELAND - Chopping, hashing and indexing the genome
string matching with mismatch tolerance
aattcaggaccca----------------------------- aattcag
gacccacacga------------------------ aattcaggacccac
acgacgggaagacaa------------- -attcaggacaaacacgaagg
gaagacaagttcatgtacttt ----caggacccacacgacgggtagaca
agttcatgtacttt --------acccacacgacgggtagacaagttcat
gtacttt --------acccacacgacgggtagacaagttcatgtacttt
----------------gacgggaagacaagttcatgtacttt ------
---------------------------atgtacttt
21SNVMix1 Modeling allelic counts
22Querying SNVMix1
- Given the model parameters, what is the
probability that genotype k gave rise to the
observed data at each position?
SNVMix1 is a mixture of Binomial distributions
with component weights
aa
bb
ab
23SNVMix1 obviates the need for depth-based
thresholding
Output of SNVMix1 on simulated data with
increasing depth
24Learning parameters by model fitting cancer
genomes
- Li et al (2008) Maq and Li et al (2009) SOAP
- Use parameters of the Binomial assuming normal
diploid genomes - Cancer genomes
- often not diploid
- mixed in with normal cells
- exhibit intra-tumoral heterogeneity
- Need to fit the model to data in order to learn
more representative parameters
25Fitting the model to data using EM
- Recall are unobserved
- Use Expectation Maximization to fit the model to
data - E-step
- M-step
Position i
Genotype k
26Fitting the model confers increased accuracy
- 16 ovarian transcriptomes
- 144,271 coding positions with matched Affy SNP
6.0 data - 10 repeats of 4 fold x-validation to estimate
parameters - Run EM on ¾ of the data to estimate parameters
- Compute on
remaining positions
p lt 0.00001
27Other methods for classification
- Support vector machines
- Linear discriminant analysis
- Logistic regression
- Random forests
- See
- Ma and Huang Briefings in Bioinformatics (2008)
- Saeys et al Bioinformatics (2007)
28We are on a Coffee Break Networking Session