Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 29

Provided by: Michael2382

Category:

more less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops

1
Canadian Bioinformatics Workshops

www.bioinformatics.ca

2
2
Module Title of Module
3
Module 6Classification
Exploratory Data Analysis and Essential
Statistics using R Boris Steipe Toronto,
September 89 2011

The judgement of Paris. Archaic (560-550 BCE)
DEPARTMENT OF BIOCHEMISTRY DEPARTMENT OF
MOLECULAR GENETICS
Includes material originally developed by Sohrab
Shah

4
Classification

What is classificiation?
Supervised learning
discriminant analysis
Work from a set of objects with predefined
classes
ie basal vs luminal or good responder vs poor
responder
Task learn from the features of the objects
what is the basis for discrimination?
Statistically and mathematically heavy

5
Classification
poor response
poor response
good response
good response
6
Classification

Input a set of measures, variables or features
Output a discrete label for what the set of
features most closely resemble
Classification can be probabilistic or
deterministic
How is classification different from clustering?
We know the groups or classes a priori
Classification is a prediction on what an
object is, not what other objects it most closely
resembles
Clustering is finding patterns in data
Classification is using known patterns to predict
an object type

7
Example DLBCL subtypes
Wright et al, PNAS (2003)
Module Title of Module
8
DLBCL subtypes
Wright et al, PNAS (2003)
9
Classification approaches

Wright et al PNAS (2003)
Weighted features in a linear predictor score
aj weight of gene j determined by t-test
statistic
Xj expression value of gene j
Assume there are 2 distinct distributions of LPS
1 for ABC, 1 for GCB

10
Wright et al, DLBCL, contd

Use Bayes rule to determine a probability that a
sample comes from group 1
probability density function
that represents group 1

11
Learning the classifier, Wright et al

Choosing the genes (feature selection)
use cross validation
Leave one out cross validation
Pick a set of samples
Use all but one of the samples as training,
leaving one out for test
Fit the model using the training data
Can the classifier correctly pick the class of
the remaining case?
Repeat exhaustively for leaving out each sample
in turn
Repeat using different sets and numbers of genes
based on t-statistic
Pick the set of genes that give the highest
accuracy

12
Overfitting

In many cases in biology, the number of features
is much larger than the number of samples
Important features may not be represented in the
training data
This can result in overfitting
when a classifier discriminates well on its
training data, but does not generalise to
orthogonally derived data sets
Validation is required in at least one external
cohort to believe the results
example the expression subtypes for breast
cancer have been repeatedly validated in numerous
data sets

13
Overfitting

To reduce the problem of overfitting, one can use
Bayesian priors to regularize the parameter
estimates of the model
Some methods now integrate feature selection and
classification in a unified analytical framework
see Law et al IEEE (2005) Sparse Multinomial
Logistic Regression (SMLR) http//www.cs.duke.edu
/amink/software/smlr/
Cross validation should always be used in
training a classifier

14
Evaluating a classifier

The receiver operator characteristic curve
plots the true positive rate vs the false
positive rate

Given ground truth and a probabilistic classifier
for some number of probability thresholds
compute the TPR
proportion of positives that were predicted as
true
compute the FPR
number of false predictions over the total number
of predictions

15
Evaluating a classifier

Important terms
Prediction classifier says the object is a hit
Rejection classifier says the object is a miss
True Positive (TP) Prediction that is a true hit
True Negative (TN) Rejection that is a true miss
False Positive (FP) Prediction that is a true
miss
False Negative (FN) Rejection that is a true hit
False Positive Rate FPRFP/(FPTN)
specificity 1-FPR
True Positive Rate TPRTP/(TPFN)
sensitivity TPR

16
Evaluating a classifier

Use Area under the ROC curve as a single measure
encodes the trade-off between FPR and TPR
only possible for probabilistic outputs
requires ground truth and probabilities as inputs
at a fixed number of ordered probability
thresholds, calculate FPR and TPR and plot
for deterministic methods, it is possible to
calculate a point in the FPRTPR space

17
Evaluating a classifier

ROC curves are useful for comparing classifiers

ROC curves using depth thresholds of
0-7,10 Breast cancer data using Affy SNP 6.0
genotypes as truth
18
All you need to know about ROC analysis
http//www.hpl.hp.com/techreports/2003/HPL-2003-4.
pdf Tutorial by Tom Fawcett at HP (2003)
19
Practical example detecting single nucleotide
variants from next gen sequencing data
20
Aligning billions of short reads to the genome

MAQ, BWA, SOAP, SHRiMP, Mosaik, BowTie, Rmap,
ELAND
Chopping, hashing and indexing the genome
string matching with mismatch tolerance

aattcaggaccca----------------------------- aattcag
gacccacacga------------------------ aattcaggacccac
acgacgggaagacaa------------- -attcaggacaaacacgaagg
gaagacaagttcatgtacttt ----caggacccacacgacgggtagaca
agttcatgtacttt --------acccacacgacgggtagacaagttcat
gtacttt --------acccacacgacgggtagacaagttcatgtacttt
----------------gacgggaagacaagttcatgtacttt ------
---------------------------atgtacttt
21
SNVMix1 Modeling allelic counts
22
Querying SNVMix1

Given the model parameters, what is the
probability that genotype k gave rise to the
observed data at each position?

SNVMix1 is a mixture of Binomial distributions
with component weights
aa
bb
ab
23
SNVMix1 obviates the need for depth-based
thresholding
Output of SNVMix1 on simulated data with
increasing depth
24
Learning parameters by model fitting cancer
genomes