DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers

Description:

DNA Barcode Data Analysis: ... Barcode of choice: cytochrome c oxidase subunit 1 mitochondrial region ('COI' ... For a barcode x we can compute the probability ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 30
Provided by: hydrag
Category:

less

Transcript and Presenter's Notes

Title: DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers


1
DNA Barcode Data AnalysisBoosting Assignment
Accuracy by Combining Distance- and
Character-Based Classifiers
Bogdan Pasaniuc, Sotirios Kentros and Ion Mândoiu
Computer Science Engineering Department,
University of Connecticut
2
Outline
  • Motivation Problem Definition
  • Methods used
  • Hamming Distance (MIN-HD and AVG-HD)
  • Aminoacid Similarity (MAX-AA-SIM and AVG-AA-SIM)
  • Convex-score similarity (MAX-CS-SIM)
  • Trinucleotide frequency (MIN-3FREQ)
  • Positional weight matrix (MAX-PWM)
  • Character-based pairwise species discrimination
    (k-BEST)
  • Combining the Methods
  • Results
  • Species Classification
  • New Species Recognition
  • Future Work Conclusions

3
Motivation
  • DNA barcoding recently proposed as a tool for
    differentiating species
  • Goal To make a fingerprint for species, using
    a short sequence of DNA
  • Assumption High interspecie variability while
    retaining low intraspecie sequence variability
  • Barcode of choice cytochrome c oxidase subunit 1
    mitochondrial region ("COI", 648 base pairs long).

4
Problem definition
  • The scope of our project is to explore if by
    combining simple classification methods one can
    increase the classification accuracy.
  • We address two problems
  • Classification of barcodes given a training set
    of species.
  • Identification of barcodes that belong in new
    species.
  • Assumption All the barcode DNA sequences are
    aligned

5
Problem definition(1)
  • Scenario
  • Barcodes with known species
  • New barcode
  • Species Classification
  • INPUT a set S of barcodes for which the species
    is known and x a new barcode
  • OUTPUT the most likely species of x

6
Problem definition(2)
  • Species Classification New Species Detection
  • INPUT a set S of barcodes for which the species
    is known and x a new barcode
  • OUTPUT the most likely species of x or determine
    that x is likely to belong to a new species

7
Distance based methods
  • Find a distance between barcodes that is able
    to distinguish between species
  • Low intraspecies variability
  • High interpecies variability
  • Hamming Distance
  • Aminoacid Similarity
  • Convex-score similarity
  • Trinucleotide frequency

8
Methods
d(x,S1)
d(x,Sn)
x
species Sn
species S1
d(x,S2)

species S2
  • d(x,Si) Minimum d(x,y) sequence y belongs to
    species Si
  • Minimum Method Classifier
  • d(x,Si) Average d(x,y) sequence y belongs to
    species Si
  • Average Method Classifier

9
Hamming Distance
  • Percent of basepair divergences
  • Average
  • Given barcode x find species S such that the
    minimum hamming distances on the average from x
    to y (y in S) is minimized
  • species(x) S.
  • Minimum
  • Given barcode x find barcode y that minimizes the
    hamming distance from x to y
  • species(x) species(y)

10
Aminoacid Similarity
  • Genetic code
  • rules that map DNA sequences to proteins
  • Codon tri-nucleotide unit that encodes for one
    aminoacid
  • Divide DNA seq. into codons and substitute each
    one by its corresp. aminoacid
  • Blosum62 (BLOck SUbstitution Matrix)
  • 20x20 matrix that gives score for each two
    aminoacids based on aminoacid properties
  • The higher the score the more likely no
    functional change in the protein

11
Aminoacid Similarity
  • Measures how similar the two aminoacid sequences
    encoded by the barcodes are
  • Similarity(x,y)
  • barcodes x, y -gt Aminoacid sequences x , y
    (using genetic code)
  • Score of the aminoacid alignment using the
    Blosum62
  • Average
  • Find the species with maximum average similarity
  • Maximum
  • Find the barcode with maximum similarity

12
Convex-score Similarity
  • Long runs of consecutive basepair matches
    indicate that the encoded aminoacid sequence
    plays an important role -gt the two barcodes are
    close on the evolutionary distance
  • The longer the run of basepair matches, the
    higher the score
  • The contribution of a run is convexly increasing
    with its length
  • The new sequence is assigned to the species
    containing the highest scoring sequence

13
Trinucleotide Distance
  • For each species compute the vector of
    trinucleotide frequencies
  • For the new sequence x we compute the vector of
    trinucleotide frequencies
  • Find the closest species.
  • To measure the distance between 2 vectors of
    frequencies we use Minimum Mean Square distance

14
New species detection
  • Distance-based Methods
  • Find the most likely species S for the new
    barcode
  • Compute the highest distance between two barcodes
    in S ( dist(S) )
  • If the distance from the new barcode to S is
    higher than dist(S) then the new barcode is
    likely to belong to a new specie

15
Positional weight matrix
  • We assume independence of loci
  • For each species compute a positional weight
    matrix (PWM)
  • For each locus the PWM gives the probability of
    seeing each nucleotide at that locus in that
    species
  • For a barcode x we can compute the probability
    that x belongs to species S as the product of the
    probabilities of observing at every locus the
    respective nucleotide in x
  • Assign x to the specie that gives the highest
    probability

16
Character-based pairwise species discrimination
  • Given species S1, S2 and new barcode x we find
    the k most discriminating characters
  • A locus -gt character
  • Nucleotides -gt possible values for character
  • Idea If at a given locus, there is a nucleotide
    that appears in S1 and not in S2, if x contains
    that nucleotide at that locus -gt x is more likely
    to belong to S1 and not to S2

17
Character-based pairwise species discrimination
  • The two species (red, blue) are discriminated by
    character i with 100 accuracy
  • The nucleotide present at position i in the new
    barcode x tells us in which specie x is more
    likely to belong
  • i is a pure character ( there is no nucleotide
    appearing in both species)

i A A C C C T T
T G G
w(i) 1
18
Character-based pairwise species discrimination
  • The two species (red, blue) are discriminated by
    character i with 90 accuracy
  • if the new barcode x has a C,T,G at i we guess
    correctly the species of x
  • if the new barcode x has an A at i then we choose
    the species of x as the species containing the
    highest number of As at i (red sp.)

i A A C C C A T
T G G
w(i) 0.9
19
Character-based pairwise species discrimination
  • Finding the k most discriminative characters
  • The discriminative power of character i is given
    by
  • Cnt(i,X,S1) - the number of times we see
    nucleotide X at position i in species S1
  • Size(S1) - number of barcodes in specie S1

20
Character-based pairwise species discrimination
  1. Given species S1, S2 and new barcode x we find
    the k most discriminating characters
  2. Compute how many times specie S1 is favored over
    S2 and output the most favored specie
  3. Repeat steps 1 and 2 for all pairs of species and
    the new barcode x
  4. The specie S that is favored the most in all
    these pairwise discriminations is assigned to
    barcode x
  5. Threshold for new species

21
Combining the Methods
  • Run every classifier independently
  • Simple Voting
  • Every classifiers returned species has a weight
    of 1
  • The species with the most votes is the candidate
    species
  • Weighted Voting
  • Each classifier has a different weight
  • The results we present are obtained using the
    Simple Voting Scheme

22
Datasets(1)
  • Dataset provided at http//dimacs.rutgers.edu/work
    shops/BarcodeResearchchallanges consisting of
    1623 aligned sequences classified into 150
    species with each sequence consisting of 590
    nucleotides on the average.
  • We randomly deleted from each species 10 to 50
    percent of the sequences
  • Deleted seq -gt test
  • Remaining seq -gt train
  • Every species is represented in the training
    dataset

23

Classification Accuracy(in )(DAWG train dataset)

Classifier Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing
Classifier 10 20 30 40 50
MIN-HD 98.8 98.0 97.8 97.2 96.0
AVG-HD 97.2 97.2 96.6 96.2 95.6
MAX-AA-SIM 99.0 99.0 99.2 98.4 96.8
AVG-AA-SIM 94.6 94.2 94.8 94.2 93.0
MAX-CS-SIM 98.2 98.2 98.6 97.6 97.4
MIN-3FREQ 94.6 93.8 94.2 92.0 92.4
MAX-PWM 98.0 98.6 97.8 95.4 94.6
10-BEST 98.6 97.0 97.6 96.2 96.2
COMBINED 99.4 99.4 99.6 98.6 98.0
24
Datasets(2)
  • Cowries dataset in (MeyerPaulay05)
  • We removed the species containing less than 4
    barcodes per species
  • We randomly deleted from each species 10 to 50
    percent of the sequences
  • Deleted seq -gt test
  • Remaining seq -gt train
  • We made sure that in every species has a least
    one barcode in the training set

25

Classification Accuracy(in )(COWRIES dataset)

Classifier Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing Percentage of barcodes removed from each species and used for testing
Classifier 10 20 30 40 50
MIN-HD 96.6 96.0 96.2 96.4 96.3
AVG-HD 95.0 95.4 94.4 95.2 94.8
MAX-AA-SIM 96.4 95.2 95.6 95.8 96.2
AVG-AA-SIM 93.8 94.0 92.6 92.8 92.8
MAX-CS-SIM 96.2 95.6 95.6 96.0 95.6
MIN-3FREQ 89.2 90.1 89.4 89.0 89.0
MAX-PWM 91.2 91.4 90.4 90.8 90.4
10-BEST 92.6 91.4 91.2 91.2 91.8
COMBINED 96.6 96.4 96.2 96.0 96.2
26
Datasets(3)
  • In order to test the accuracy of new species
    detection and classification we devised a regular
    leave one out procedure
  • delete a whole species
  • randomly delete from each remaining species 0 to
    50 percent of the barcodes
  • Deleted seq -gt test
  • Remaining seq -gt train
  • The following table gives accuracy results on
    average for 150x6 different testcases

27
Leave one out Accuracy(in )DAWG train dataset
Classifier Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing Percentage of additional barcodes removed from each species and used for testing
Classifier 0 10 20 30 40 50
MIN-HD 80.9 91.7 92.8 91.6 90.3 88.4
AVG-HD 81.1 91.5 92.3 91.0 89.9 87.8
MAX-AA-SIM 83.4 82.7 82.9 80.2 78.4 74.8
AVG-AA-SIM 83.1 89.5 89.3 88.8 88.3 88.2
MAX-CS-SIM 94.3 94.4 94.0 92.9 91.7 89.7
MIN-3FREQ 82.9 70.3 69.6 67.8 65.8 63.0
MAX-PWM 91.2 91.7 91.6 89.8 88.0 85.4
10-BEST 93.3 94.7 93.8 92.6 91.6 89.6
COMBINED 93.7 97.6 97.8 97.8 97.4 97.0
28
Conclusions
  • Every method shows a tradeoff between new species
    detection and classification accuracy
  • Hamming distance performs very well when no new
    species are present but the accuracy results get
    worse for new species detection
  • The combined method yields good accuracy both on
    seq. classification and new species detection
  • The runtime of all methods is within the same
    order of magnitude
  • Except for k-BEST all other methods are scalable
    to millions of species

29
Ongoing Work
  • Confidence measures for all the methods
  • Further investigate threshold selection and
    weighting schemes
  • Scalable character based method
  • Possible ignoring parts of the given sequences
    could improve accuracy. Are there redundant/noisy
    regions?
  • New species clustering determining the different
    new species present
Write a Comment
User Comments (0)
About PowerShow.com