DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers

About This Presentation

Title:

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers

Description:

DNA Barcode Data Analysis: ... Barcode of choice: cytochrome c oxidase subunit 1 mitochondrial region ('COI' ... For a barcode x we can compute the probability ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 30

Provided by: hydrag

Category:

more less

Transcript and Presenter's Notes

Title: DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers

1
DNA Barcode Data AnalysisBoosting Assignment
Accuracy by Combining Distance- and
Character-Based Classifiers
Bogdan Pasaniuc, Sotirios Kentros and Ion Mândoiu
Computer Science Engineering Department,
University of Connecticut
2
Outline

Motivation Problem Definition
Methods used
Hamming Distance (MIN-HD and AVG-HD)
Aminoacid Similarity (MAX-AA-SIM and AVG-AA-SIM)
Convex-score similarity (MAX-CS-SIM)
Trinucleotide frequency (MIN-3FREQ)
Positional weight matrix (MAX-PWM)
Character-based pairwise species discrimination
(k-BEST)
Combining the Methods
Results
Species Classification
New Species Recognition
Future Work Conclusions

3
Motivation

DNA barcoding recently proposed as a tool for
differentiating species
Goal To make a fingerprint for species, using
a short sequence of DNA
Assumption High interspecie variability while
retaining low intraspecie sequence variability
Barcode of choice cytochrome c oxidase subunit 1
mitochondrial region ("COI", 648 base pairs long).

4
Problem definition

The scope of our project is to explore if by
combining simple classification methods one can
increase the classification accuracy.
We address two problems
Classification of barcodes given a training set
of species.
Identification of barcodes that belong in new
species.
Assumption All the barcode DNA sequences are
aligned

5
Problem definition(1)

Scenario
Barcodes with known species
New barcode
Species Classification
INPUT a set S of barcodes for which the species
is known and x a new barcode
OUTPUT the most likely species of x

6
Problem definition(2)

Species Classification New Species Detection
INPUT a set S of barcodes for which the species
is known and x a new barcode
OUTPUT the most likely species of x or determine
that x is likely to belong to a new species

7
Distance based methods

Find a distance between barcodes that is able
to distinguish between species
Low intraspecies variability
High interpecies variability
Hamming Distance
Aminoacid Similarity
Convex-score similarity
Trinucleotide frequency

8
Methods
d(x,S1)
d(x,Sn)
x
species Sn
species S1
d(x,S2)

species S2

d(x,Si) Minimum d(x,y) sequence y belongs to
species Si
Minimum Method Classifier
d(x,Si) Average d(x,y) sequence y belongs to
species Si
Average Method Classifier

9
Hamming Distance

Percent of basepair divergences
Average
Given barcode x find species S such that the
minimum hamming distances on the average from x
to y (y in S) is minimized
species(x) S.
Minimum
Given barcode x find barcode y that minimizes the
hamming distance from x to y
species(x) species(y)

10
Aminoacid Similarity

Genetic code
rules that map DNA sequences to proteins
Codon tri-nucleotide unit that encodes for one
aminoacid
Divide DNA seq. into codons and substitute each
one by its corresp. aminoacid
Blosum62 (BLOck SUbstitution Matrix)
20x20 matrix that gives score for each two
aminoacids based on aminoacid properties
The higher the score the more likely no
functional change in the protein

11
Aminoacid Similarity

Measures how similar the two aminoacid sequences
encoded by the barcodes are
Similarity(x,y)
barcodes x, y -gt Aminoacid sequences x , y
(using genetic code)
Score of the aminoacid alignment using the
Blosum62
Average
Find the species with maximum average similarity
Maximum
Find the barcode with maximum similarity

12
Convex-score Similarity

Long runs of consecutive basepair matches
indicate that the encoded aminoacid sequence
plays an important role -gt the two barcodes are
close on the evolutionary distance
The longer the run of basepair matches, the
higher the score
The contribution of a run is convexly increasing
with its length
The new sequence is assigned to the species
containing the highest scoring sequence

13
Trinucleotide Distance

For each species compute the vector of
trinucleotide frequencies
For the new sequence x we compute the vector of
trinucleotide frequencies
Find the closest species.
To measure the distance between 2 vectors of
frequencies we use Minimum Mean Square distance

14
New species detection

Distance-based Methods
Find the most likely species S for the new
barcode
Compute the highest distance between two barcodes
in S ( dist(S) )
If the distance from the new barcode to S is
higher than dist(S) then the new barcode is
likely to belong to a new specie

15
Positional weight matrix

We assume independence of loci
For each species compute a positional weight
matrix (PWM)
For each locus the PWM gives the probability of
seeing each nucleotide at that locus in that
species
For a barcode x we can compute the probability
that x belongs to species S as the product of the
probabilities of observing at every locus the
respective nucleotide in x
Assign x to the specie that gives the highest
probability

16
Character-based pairwise species discrimination

Given species S1, S2 and new barcode x we find
the k most discriminating characters
A locus -gt character
Nucleotides -gt possible values for character
Idea If at a given locus, there is a nucleotide
that appears in S1 and not in S2, if x contains
that nucleotide at that locus -gt x is more likely
to belong to S1 and not to S2

17
Character-based pairwise species discrimination

The two species (red, blue) are discriminated by
character i with 100 accuracy
The nucleotide present at position i in the new
barcode x tells us in which specie x is more
likely to belong
i is a pure character ( there is no nucleotide
appearing in both species)

i A A C C C T T
T G G
w(i) 1
18
Character-based pairwise species discrimination

The two species (red, blue) are discriminated by
character i with 90 accuracy
if the new barcode x has a C,T,G at i we guess
correctly the species of x
if the new barcode x has an A at i then we choose
the species of x as the species containing the
highest number of As at i (red sp.)

i A A C C C A T
T G G
w(i) 0.9
19
Character-based pairwise species discrimination

Finding the k most discriminative characters
The discriminative power of character i is given
by
Cnt(i,X,S1) - the number of times we see
nucleotide X at position i in species S1
Size(S1) - number of barcodes in specie S1

20
Character-based pairwise species discrimination

Given species S1, S2 and new barcode x we find
the k most discriminating characters
Compute how many times specie S1 is favored over
S2 and output the most favored specie
Repeat steps 1 and 2 for all pairs of species and
the new barcode x
The specie S that is favored the most in all
these pairwise discriminations is assigned to
barcode x
Threshold for new species

21
Combining the Methods

Run every classifier independently
Simple Voting
Every classifiers returned species has a weight
of 1
The species with the most votes is the candidate
species
Weighted Voting
Each classifier has a different weight
The results we present are obtained using the
Simple Voting Scheme

22
Datasets(1)

Dataset provided at http//dimacs.rutgers.edu/work
shops/BarcodeResearchchallanges consisting of
1623 aligned sequences classified into 150
species with each sequence consisting of 590
nucleotides on the average.
We randomly deleted from each species 10 to 50
percent of the sequences
Deleted seq -gt test
Remaining seq -gt train
Every species is represented in the training
dataset

23

Classification Accuracy(in )(DAWG train dataset)

Classifier Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing
Classifier 10 20 30 40 50
MIN-HD 98.8 98.0 97.8 97.2 96.0
AVG-HD 97.2 97.2 96.6 96.2 95.6
MAX-AA-SIM 99.0 99.0 99.2 98.4 96.8
AVG-AA-SIM 94.6 94.2 94.8 94.2 93.0
MAX-CS-SIM 98.2 98.2 98.6 97.6 97.4
MIN-3FREQ 94.6 93.8 94.2 92.0 92.4
MAX-PWM 98.0 98.6 97.8 95.4 94.6
10-BEST 98.6 97.0 97.6 96.2 96.2
COMBINED 99.4 99.4 99.6 98.6 98.0
24
Datasets(2)

Cowries dataset in (MeyerPaulay05)
We removed the species containing less than 4
barcodes per species
We randomly deleted from each species 10 to 50
percent of the sequences
Deleted seq -gt test
Remaining seq -gt train
We made sure that in every species has a least
one barcode in the training set

25

Classification Accuracy(in )(COWRIES dataset)

Classifier Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing
Classifier 10 20 30 40 50
MIN-HD 96.6 96.0 96.2 96.4 96.3
AVG-HD 95.0 95.4 94.4 95.2 94.8
MAX-AA-SIM 96.4 95.2 95.6 95.8 96.2
AVG-AA-SIM 93.8 94.0 92.6 92.8 92.8
MAX-CS-SIM 96.2 95.6 95.6 96.0 95.6
MIN-3FREQ 89.2 90.1 89.4 89.0 89.0
MAX-PWM 91.2 91.4 90.4 90.8 90.4
10-BEST 92.6 91.4 91.2 91.2 91.8
COMBINED 96.6 96.4 96.2 96.0 96.2
26
Datasets(3)

In order to test the accuracy of new species
detection and classification we devised a regular
leave one out procedure
delete a whole species
randomly delete from each remaining species 0 to
50 percent of the barcodes
Deleted seq -gt test
Remaining seq -gt train
The following table gives accuracy results on
average for 150x6 different testcases

27
Leave one out Accuracy(in )DAWG train dataset
Classifier Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing
Classifier 0 10 20 30 40 50
MIN-HD 80.9 91.7 92.8 91.6 90.3 88.4
AVG-HD 81.1 91.5 92.3 91.0 89.9 87.8
MAX-AA-SIM 83.4 82.7 82.9 80.2 78.4 74.8
AVG-AA-SIM 83.1 89.5 89.3 88.8 88.3 88.2
MAX-CS-SIM 94.3 94.4 94.0 92.9 91.7 89.7
MIN-3FREQ 82.9 70.3 69.6 67.8 65.8 63.0
MAX-PWM 91.2 91.7 91.6 89.8 88.0 85.4
10-BEST 93.3 94.7 93.8 92.6 91.6 89.6
COMBINED 93.7 97.6 97.8 97.8 97.4 97.0
28
Conclusions

Every method shows a tradeoff between new species
detection and classification accuracy
Hamming distance performs very well when no new
species are present but the accuracy results get
worse for new species detection
The combined method yields good accuracy both on
seq. classification and new species detection
The runtime of all methods is within the same
order of magnitude
Except for k-BEST all other methods are scalable
to millions of species

29
Ongoing Work

Confidence measures for all the methods
Further investigate threshold selection and
weighting schemes
Scalable character based method
Possible ignoring parts of the given sequences
could improve accuracy. Are there redundant/noisy
regions?
New species clustering determining the different
new species present

Write a Comment

User Comments (0)

About PowerShow.com

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers - PowerPoint PPT Presentation

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers

DNA Barcode Data Analysis: ... Barcode of choice: cytochrome c oxidase subunit 1 mitochondrial region ('COI' ... For a barcode x we can compute the probability ... – PowerPoint PPT presentation