Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 29
Provided by: Michael2382
Category:

less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
2
Module Title of Module
3
Module 6Classification
Exploratory Data Analysis and Essential
Statistics using R Boris Steipe Toronto,
September 89 2011

The judgement of Paris. Archaic (560-550 BCE)
DEPARTMENT OF BIOCHEMISTRY DEPARTMENT OF
MOLECULAR GENETICS
Includes material originally developed by Sohrab
Shah

4
Classification
  • What is classificiation?
  • Supervised learning
  • discriminant analysis
  • Work from a set of objects with predefined
    classes
  • ie basal vs luminal or good responder vs poor
    responder
  • Task learn from the features of the objects
    what is the basis for discrimination?
  • Statistically and mathematically heavy

5
Classification
poor response
poor response
good response
good response
6
Classification
  • Input a set of measures, variables or features
  • Output a discrete label for what the set of
    features most closely resemble
  • Classification can be probabilistic or
    deterministic
  • How is classification different from clustering?
  • We know the groups or classes a priori
  • Classification is a prediction on what an
    object is, not what other objects it most closely
    resembles
  • Clustering is finding patterns in data
  • Classification is using known patterns to predict
    an object type

7
Example DLBCL subtypes
Wright et al, PNAS (2003)
Module Title of Module
8
DLBCL subtypes
Wright et al, PNAS (2003)
9
Classification approaches
  • Wright et al PNAS (2003)
  • Weighted features in a linear predictor score
  • aj weight of gene j determined by t-test
    statistic
  • Xj expression value of gene j
  • Assume there are 2 distinct distributions of LPS
    1 for ABC, 1 for GCB

10
Wright et al, DLBCL, contd
  • Use Bayes rule to determine a probability that a
    sample comes from group 1
  • probability density function
    that represents group 1

11
Learning the classifier, Wright et al
  • Choosing the genes (feature selection)
  • use cross validation
  • Leave one out cross validation
  • Pick a set of samples
  • Use all but one of the samples as training,
    leaving one out for test
  • Fit the model using the training data
  • Can the classifier correctly pick the class of
    the remaining case?
  • Repeat exhaustively for leaving out each sample
    in turn
  • Repeat using different sets and numbers of genes
    based on t-statistic
  • Pick the set of genes that give the highest
    accuracy

12
Overfitting
  • In many cases in biology, the number of features
    is much larger than the number of samples
  • Important features may not be represented in the
    training data
  • This can result in overfitting
  • when a classifier discriminates well on its
    training data, but does not generalise to
    orthogonally derived data sets
  • Validation is required in at least one external
    cohort to believe the results
  • example the expression subtypes for breast
    cancer have been repeatedly validated in numerous
    data sets

13
Overfitting
  • To reduce the problem of overfitting, one can use
    Bayesian priors to regularize the parameter
    estimates of the model
  • Some methods now integrate feature selection and
    classification in a unified analytical framework
  • see Law et al IEEE (2005) Sparse Multinomial
    Logistic Regression (SMLR) http//www.cs.duke.edu
    /amink/software/smlr/
  • Cross validation should always be used in
    training a classifier

14
Evaluating a classifier
  • The receiver operator characteristic curve
  • plots the true positive rate vs the false
    positive rate
  • Given ground truth and a probabilistic classifier
  • for some number of probability thresholds
  • compute the TPR
  • proportion of positives that were predicted as
    true
  • compute the FPR
  • number of false predictions over the total number
    of predictions

15
Evaluating a classifier
  • Important terms
  • Prediction classifier says the object is a hit
  • Rejection classifier says the object is a miss
  • True Positive (TP) Prediction that is a true hit
  • True Negative (TN) Rejection that is a true miss
  • False Positive (FP) Prediction that is a true
    miss
  • False Negative (FN) Rejection that is a true hit
  • False Positive Rate FPRFP/(FPTN)
  • specificity 1-FPR
  • True Positive Rate TPRTP/(TPFN)
  • sensitivity TPR

16
Evaluating a classifier
  • Use Area under the ROC curve as a single measure
  • encodes the trade-off between FPR and TPR
  • only possible for probabilistic outputs
  • requires ground truth and probabilities as inputs
  • at a fixed number of ordered probability
    thresholds, calculate FPR and TPR and plot
  • for deterministic methods, it is possible to
    calculate a point in the FPRTPR space


17
Evaluating a classifier
  • ROC curves are useful for comparing classifiers

ROC curves using depth thresholds of
0-7,10 Breast cancer data using Affy SNP 6.0
genotypes as truth
18
All you need to know about ROC analysis
http//www.hpl.hp.com/techreports/2003/HPL-2003-4.
pdf Tutorial by Tom Fawcett at HP (2003)
19
Practical example detecting single nucleotide
variants from next gen sequencing data
20
Aligning billions of short reads to the genome
  • MAQ, BWA, SOAP, SHRiMP, Mosaik, BowTie, Rmap,
    ELAND
  • Chopping, hashing and indexing the genome
    string matching with mismatch tolerance

aattcaggaccca----------------------------- aattcag
gacccacacga------------------------ aattcaggacccac
acgacgggaagacaa------------- -attcaggacaaacacgaagg
gaagacaagttcatgtacttt ----caggacccacacgacgggtagaca
agttcatgtacttt --------acccacacgacgggtagacaagttcat
gtacttt --------acccacacgacgggtagacaagttcatgtacttt
----------------gacgggaagacaagttcatgtacttt ------
---------------------------atgtacttt
21
SNVMix1 Modeling allelic counts
22
Querying SNVMix1
  • Given the model parameters, what is the
    probability that genotype k gave rise to the
    observed data at each position?

SNVMix1 is a mixture of Binomial distributions
with component weights
aa
bb
ab
23
SNVMix1 obviates the need for depth-based
thresholding
Output of SNVMix1 on simulated data with
increasing depth
24
Learning parameters by model fitting cancer
genomes
  • Li et al (2008) Maq and Li et al (2009) SOAP
  • Use parameters of the Binomial assuming normal
    diploid genomes
  • Cancer genomes
  • often not diploid
  • mixed in with normal cells
  • exhibit intra-tumoral heterogeneity
  • Need to fit the model to data in order to learn
    more representative parameters

25
Fitting the model to data using EM
  • Recall are unobserved
  • Use Expectation Maximization to fit the model to
    data
  • E-step
  • M-step

Position i
Genotype k
26
Fitting the model confers increased accuracy
  • 16 ovarian transcriptomes
  • 144,271 coding positions with matched Affy SNP
    6.0 data
  • 10 repeats of 4 fold x-validation to estimate
    parameters
  • Run EM on ¾ of the data to estimate parameters
  • Compute on
    remaining positions

p lt 0.00001
27
Other methods for classification
  • Support vector machines
  • Linear discriminant analysis
  • Logistic regression
  • Random forests
  • See
  • Ma and Huang Briefings in Bioinformatics (2008)
  • Saeys et al Bioinformatics (2007)

28
We are on a Coffee Break Networking Session
Write a Comment
User Comments (0)
About PowerShow.com