Title: DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers
1DNA Barcode Data AnalysisBoosting Assignment
Accuracy by Combining Distance- and
Character-Based Classifiers
Bogdan Pasaniuc, Sotirios Kentros and Ion Mândoiu
Computer Science Engineering Department,
University of Connecticut
2Outline
- Motivation Problem Definition
- Methods used
- Hamming Distance (MIN-HD and AVG-HD)
- Aminoacid Similarity (MAX-AA-SIM and AVG-AA-SIM)
- Convex-score similarity (MAX-CS-SIM)
- Trinucleotide frequency (MIN-3FREQ)
- Positional weight matrix (MAX-PWM)
- Character-based pairwise species discrimination
(k-BEST) - Combining the Methods
- Results
- Species Classification
- New Species Recognition
- Future Work Conclusions
3Motivation
- DNA barcoding recently proposed as a tool for
differentiating species - Goal To make a fingerprint for species, using
a short sequence of DNA - Assumption High interspecie variability while
retaining low intraspecie sequence variability - Barcode of choice cytochrome c oxidase subunit 1
mitochondrial region ("COI", 648 base pairs long).
4Problem definition
- The scope of our project is to explore if by
combining simple classification methods one can
increase the classification accuracy. - We address two problems
- Classification of barcodes given a training set
of species. - Identification of barcodes that belong in new
species. - Assumption All the barcode DNA sequences are
aligned
5Problem definition(1)
- Scenario
- Barcodes with known species
- New barcode
- Species Classification
- INPUT a set S of barcodes for which the species
is known and x a new barcode - OUTPUT the most likely species of x
6Problem definition(2)
- Species Classification New Species Detection
- INPUT a set S of barcodes for which the species
is known and x a new barcode - OUTPUT the most likely species of x or determine
that x is likely to belong to a new species
7Distance based methods
- Find a distance between barcodes that is able
to distinguish between species - Low intraspecies variability
- High interpecies variability
- Hamming Distance
- Aminoacid Similarity
- Convex-score similarity
- Trinucleotide frequency
8Methods
d(x,S1)
d(x,Sn)
x
species Sn
species S1
d(x,S2)
species S2
- d(x,Si) Minimum d(x,y) sequence y belongs to
species Si - Minimum Method Classifier
- d(x,Si) Average d(x,y) sequence y belongs to
species Si - Average Method Classifier
9Hamming Distance
- Percent of basepair divergences
- Average
- Given barcode x find species S such that the
minimum hamming distances on the average from x
to y (y in S) is minimized - species(x) S.
- Minimum
- Given barcode x find barcode y that minimizes the
hamming distance from x to y - species(x) species(y)
10Aminoacid Similarity
-
- Genetic code
- rules that map DNA sequences to proteins
- Codon tri-nucleotide unit that encodes for one
aminoacid - Divide DNA seq. into codons and substitute each
one by its corresp. aminoacid - Blosum62 (BLOck SUbstitution Matrix)
- 20x20 matrix that gives score for each two
aminoacids based on aminoacid properties - The higher the score the more likely no
functional change in the protein
11Aminoacid Similarity
-
- Measures how similar the two aminoacid sequences
encoded by the barcodes are - Similarity(x,y)
- barcodes x, y -gt Aminoacid sequences x , y
(using genetic code) - Score of the aminoacid alignment using the
Blosum62 - Average
- Find the species with maximum average similarity
- Maximum
- Find the barcode with maximum similarity
12Convex-score Similarity
- Long runs of consecutive basepair matches
indicate that the encoded aminoacid sequence
plays an important role -gt the two barcodes are
close on the evolutionary distance - The longer the run of basepair matches, the
higher the score - The contribution of a run is convexly increasing
with its length - The new sequence is assigned to the species
containing the highest scoring sequence
13Trinucleotide Distance
- For each species compute the vector of
trinucleotide frequencies - For the new sequence x we compute the vector of
trinucleotide frequencies - Find the closest species.
-
- To measure the distance between 2 vectors of
frequencies we use Minimum Mean Square distance
14New species detection
- Distance-based Methods
- Find the most likely species S for the new
barcode - Compute the highest distance between two barcodes
in S ( dist(S) ) - If the distance from the new barcode to S is
higher than dist(S) then the new barcode is
likely to belong to a new specie
15Positional weight matrix
- We assume independence of loci
- For each species compute a positional weight
matrix (PWM) - For each locus the PWM gives the probability of
seeing each nucleotide at that locus in that
species - For a barcode x we can compute the probability
that x belongs to species S as the product of the
probabilities of observing at every locus the
respective nucleotide in x - Assign x to the specie that gives the highest
probability
16Character-based pairwise species discrimination
- Given species S1, S2 and new barcode x we find
the k most discriminating characters - A locus -gt character
- Nucleotides -gt possible values for character
- Idea If at a given locus, there is a nucleotide
that appears in S1 and not in S2, if x contains
that nucleotide at that locus -gt x is more likely
to belong to S1 and not to S2
17Character-based pairwise species discrimination
- The two species (red, blue) are discriminated by
character i with 100 accuracy - The nucleotide present at position i in the new
barcode x tells us in which specie x is more
likely to belong - i is a pure character ( there is no nucleotide
appearing in both species)
i A A C C C T T
T G G
w(i) 1
18Character-based pairwise species discrimination
- The two species (red, blue) are discriminated by
character i with 90 accuracy - if the new barcode x has a C,T,G at i we guess
correctly the species of x - if the new barcode x has an A at i then we choose
the species of x as the species containing the
highest number of As at i (red sp.)
i A A C C C A T
T G G
w(i) 0.9
19Character-based pairwise species discrimination
- Finding the k most discriminative characters
- The discriminative power of character i is given
by - Cnt(i,X,S1) - the number of times we see
nucleotide X at position i in species S1 - Size(S1) - number of barcodes in specie S1
20Character-based pairwise species discrimination
- Given species S1, S2 and new barcode x we find
the k most discriminating characters - Compute how many times specie S1 is favored over
S2 and output the most favored specie - Repeat steps 1 and 2 for all pairs of species and
the new barcode x - The specie S that is favored the most in all
these pairwise discriminations is assigned to
barcode x - Threshold for new species
21Combining the Methods
- Run every classifier independently
- Simple Voting
- Every classifiers returned species has a weight
of 1 - The species with the most votes is the candidate
species - Weighted Voting
- Each classifier has a different weight
- The results we present are obtained using the
Simple Voting Scheme
22Datasets(1)
- Dataset provided at http//dimacs.rutgers.edu/work
shops/BarcodeResearchchallanges consisting of
1623 aligned sequences classified into 150
species with each sequence consisting of 590
nucleotides on the average. - We randomly deleted from each species 10 to 50
percent of the sequences - Deleted seq -gt test
- Remaining seq -gt train
- Every species is represented in the training
dataset
23 Classification Accuracy(in )(DAWG train dataset)
Classifier Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing
Classifier 10 20 30 40 50
MIN-HD 98.8 98.0 97.8 97.2 96.0
AVG-HD 97.2 97.2 96.6 96.2 95.6
MAX-AA-SIM 99.0 99.0 99.2 98.4 96.8
AVG-AA-SIM 94.6 94.2 94.8 94.2 93.0
MAX-CS-SIM 98.2 98.2 98.6 97.6 97.4
MIN-3FREQ 94.6 93.8 94.2 92.0 92.4
MAX-PWM 98.0 98.6 97.8 95.4 94.6
10-BEST 98.6 97.0 97.6 96.2 96.2
COMBINED 99.4 99.4 99.6 98.6 98.0
24Datasets(2)
- Cowries dataset in (MeyerPaulay05)
- We removed the species containing less than 4
barcodes per species - We randomly deleted from each species 10 to 50
percent of the sequences - Deleted seq -gt test
- Remaining seq -gt train
- We made sure that in every species has a least
one barcode in the training set
25 Classification Accuracy(in )(COWRIES dataset)
Classifier Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing
Classifier 10 20 30 40 50
MIN-HD 96.6 96.0 96.2 96.4 96.3
AVG-HD 95.0 95.4 94.4 95.2 94.8
MAX-AA-SIM 96.4 95.2 95.6 95.8 96.2
AVG-AA-SIM 93.8 94.0 92.6 92.8 92.8
MAX-CS-SIM 96.2 95.6 95.6 96.0 95.6
MIN-3FREQ 89.2 90.1 89.4 89.0 89.0
MAX-PWM 91.2 91.4 90.4 90.8 90.4
10-BEST 92.6 91.4 91.2 91.2 91.8
COMBINED 96.6 96.4 96.2 96.0 96.2
26Datasets(3)
- In order to test the accuracy of new species
detection and classification we devised a regular
leave one out procedure - delete a whole species
- randomly delete from each remaining species 0 to
50 percent of the barcodes - Deleted seq -gt test
- Remaining seq -gt train
- The following table gives accuracy results on
average for 150x6 different testcases
27Leave one out Accuracy(in )DAWG train dataset
Classifier Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing
Classifier 0 10 20 30 40 50
MIN-HD 80.9 91.7 92.8 91.6 90.3 88.4
AVG-HD 81.1 91.5 92.3 91.0 89.9 87.8
MAX-AA-SIM 83.4 82.7 82.9 80.2 78.4 74.8
AVG-AA-SIM 83.1 89.5 89.3 88.8 88.3 88.2
MAX-CS-SIM 94.3 94.4 94.0 92.9 91.7 89.7
MIN-3FREQ 82.9 70.3 69.6 67.8 65.8 63.0
MAX-PWM 91.2 91.7 91.6 89.8 88.0 85.4
10-BEST 93.3 94.7 93.8 92.6 91.6 89.6
COMBINED 93.7 97.6 97.8 97.8 97.4 97.0
28Conclusions
- Every method shows a tradeoff between new species
detection and classification accuracy - Hamming distance performs very well when no new
species are present but the accuracy results get
worse for new species detection - The combined method yields good accuracy both on
seq. classification and new species detection - The runtime of all methods is within the same
order of magnitude - Except for k-BEST all other methods are scalable
to millions of species
29 Ongoing Work
- Confidence measures for all the methods
- Further investigate threshold selection and
weighting schemes - Scalable character based method
- Possible ignoring parts of the given sequences
could improve accuracy. Are there redundant/noisy
regions? - New species clustering determining the different
new species present