Prediction of human mRNA donor and acceptor sites from the DNA sequence - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Prediction of human mRNA donor and acceptor sites from the DNA sequence

Description:

combination: decision tree and probabilistic model. Brunak s NN. Haussler s ... However: it might happen, that FNNs become better in the long range, rules might ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 31
Provided by: Barb403
Category:

less

Transcript and Presenter's Notes

Title: Prediction of human mRNA donor and acceptor sites from the DNA sequence


1
Prediction of human mRNA donor and acceptor sites
from the DNA sequence
  • Article by
  • S.Brunak, J.Engelbrecht, S.Knudsen

2
Outline
  • Reminder donor and acceptor sites
  • Reminder Standard feedforward networks
  • Experiments
  • Another classical alternative rule insertion and
    extraction
  • Discussion

3
Reminder donor and acceptor sites
4
Donor and acceptor sites
  • DNA is copied to (pre-m)RNA, non-coding regions
    (introns) are spliced out within the nucleus,
    coding regions (exons) form the gene
  • donor exon/intron boundary, acceptor
    intron/exon boundary
  • It can be expected that splice sites can be
    predicted up to a certain accuracy depending on a
    local window around the possible splice sites
    only
  • We have already seen SVM achieves good
    classification accuracy

branch site
A64G73G100T100G62A68G84T63
C65A100G100
18-40 bp pyrimidines, i.e. T,C
5
Donor and acceptor sites
  • seminal paper which establishes neural networks
    as a very good method for this task
  • previous alternatives finite automata (lt50)
    scoring schemes based on nucleotide weight
    tables, open reading frames, free energy of mRNA
    and snRNP, coding statistics (80) previous
    FNNs (lt95)
  • Data from GenBank (annotated collection of all
    publicly available DNA, http//www.ncbi.nlm.nih.go
    v/Genbank) only human splice sites with
    sufficient information ? 449 donor sites, 449
    acceptor sites, 2/3 training set, 1/3 test set,
    sequences of training/test set have small
    overlap, negatives via shifting the window

6
Reminder standard feedforward networks
7
Standard feedforward networks
.. are based on simple neurons
w1
w2
?
s(wtx - ?)

wn
s(t) sgd(t) (1e-t)-1
8
Standard feedforward networks
.. combine the neurons in a network architecture
x
y
fw Rn ? Ro
9
Standard feedforward networks
  • .. can be trained efficiently
  • goal learn unknown fRn ? Ro given examples
    f(x1),,f(xm)
  • Training
  • Choose an architecture (n input neurons, o output
    neurons, number of hidden neurons determined by
    trial and error)
  • Choose the weights (regression such that the
    examples from the training set are matched as
    accurately as possible)
  • Test the resulting function on the test set

10
Standard feedforward networks
? (1,0)
i.e. f R4k ? R2 is to be learned
11
Experiments
12
Experiments
  • Donor sites test set contains 118 donors, 190987
    non-donors
  • Evaluation highly unbalanced distribution ?
    Matthews correlation coefficient (in -1,1)
  • (Pxtrue positives, Pxffalse positives, Nxtrue
    negatives, Nxffalse negatives)

13
Experiments
  • Value of C(x) for different sizes of the window
    (w) and different numbers of hidden neurons

111 of 118 donors correctly classified
(94.1) 11789 of 11800 non-donors correctly
classified (99.9)
14
Experiments
  • Effect of resetting the output cutoff level

Probability of misclassified donors compared to
the distance from a true site
15
Experiments
  • Acceptor sites test set contains 118 acceptors,
    190987 non-acceptors

100 acceptors classified correctly (87.4) 11800
non-acceptors classified correctly
(99.8) optimum window size includes
polypyrimidine tract
16
Experiments
  • Effect of resetting the output cutoff level

Probability of misclassified acceptors compared
to the distance from a true site
17
Experiments
comparison of methods on another human splice set
Genesplicer a new computational method for
splice site reognition, Pertea,Lin,Salzberg,
Nucleid Acid Research 29(5)1216-1221, 2001
18
Experiments
comparison of methods on UCI-benchmark, from the
SVM-paper (Sonnenburg et al.)
some NN
rules
19
Rule insertion and extraction
20
Rule insertion and extraction
  • seminal approach Knowledge Based Artificial
    Neural Networks (KBANN) by G.Towell,J.Shavlik
  • assumption a set of approximate propositional,
    acyclic if/then rules (acyclic Horn clauses) is
    given (e.g. a - b,c,d,e)
  • transfer it into a FNN

true/false0/1
21
(No Transcript)
22
Rule insertion and extraction
  • find nice rules
  • transfer the rules into a network
  • extend the network by additional neurons (and
    possibly additional inputs) and initialize the
    additional rules with small values
  • find additional nice training examples possibly
    not yet covered by the rules
  • train the network

23
Rule insertion and extraction
24
Rule insertion and extraction
25
Rule insertion and extraction
Training backprop with crossentropy error
(?(1-yi)lg oi yi lg oi) Structure of the
network and initialization improve the
performance However it might happen, that FNNs
become better in the long range, rules might get
lost here KBANN is better for acceptors, but
not for donors and none of both.
26
Experiments
27
Rule insertion and extraction
  • Extraction of rules
  • each neuron represents a variable
  • the weights of each neuron are clustered into
    nearly identical weights, pruning of irrelevant
    weights
  • m-of-n rules are extracted from each neuron via
    enumeration of (some of) the possibilities

if 2 of (2,8,9) and 2 of (1,3,5,7) and none of
(4,6) then on or
yields usually a large and incomprehensible set
of rules
28
Rule insertion and extraction
  • Just a remark there exists a very nice
    alternative from Duch et al. (not applied to
    splice sets) to infer logical rules by the help
    of NNs from scretch
  • MLP2LN

for each output class separately rules of
increasing complexity
train s.t. a formula arises E ?1? wij2
?2?wij2(wij-1)2(wij1)2

L units
R units
29
Discussion
30
Discussion
  • FNNs constitute a very simple and efficient tool
    for splice site recognition
  • interesting often highly unbalanced sets, hence
    adequate error measures are important
    (correlation coefficient), even gt99 correct
    prediction of non-splice sites might lead to 51
    (non-correct splice sites, correct splice sites)
  • prior knowledge is available, hence rule
    insertion is a striking alternative (with some
    benefits)
Write a Comment
User Comments (0)
About PowerShow.com