Title: Prediction of human mRNA donor and acceptor sites from the DNA sequence
1Prediction of human mRNA donor and acceptor sites
from the DNA sequence
- Article by
- S.Brunak, J.Engelbrecht, S.Knudsen
2Outline
- Reminder donor and acceptor sites
- Reminder Standard feedforward networks
- Experiments
- Another classical alternative rule insertion and
extraction - Discussion
3Reminder donor and acceptor sites
4Donor and acceptor sites
- DNA is copied to (pre-m)RNA, non-coding regions
(introns) are spliced out within the nucleus,
coding regions (exons) form the gene - donor exon/intron boundary, acceptor
intron/exon boundary - It can be expected that splice sites can be
predicted up to a certain accuracy depending on a
local window around the possible splice sites
only - We have already seen SVM achieves good
classification accuracy
branch site
A64G73G100T100G62A68G84T63
C65A100G100
18-40 bp pyrimidines, i.e. T,C
5Donor and acceptor sites
- seminal paper which establishes neural networks
as a very good method for this task - previous alternatives finite automata (lt50)
scoring schemes based on nucleotide weight
tables, open reading frames, free energy of mRNA
and snRNP, coding statistics (80) previous
FNNs (lt95) - Data from GenBank (annotated collection of all
publicly available DNA, http//www.ncbi.nlm.nih.go
v/Genbank) only human splice sites with
sufficient information ? 449 donor sites, 449
acceptor sites, 2/3 training set, 1/3 test set,
sequences of training/test set have small
overlap, negatives via shifting the window
6Reminder standard feedforward networks
7Standard feedforward networks
.. are based on simple neurons
w1
w2
?
s(wtx - ?)
wn
s(t) sgd(t) (1e-t)-1
8Standard feedforward networks
.. combine the neurons in a network architecture
x
y
fw Rn ? Ro
9Standard feedforward networks
- .. can be trained efficiently
- goal learn unknown fRn ? Ro given examples
f(x1),,f(xm) - Training
- Choose an architecture (n input neurons, o output
neurons, number of hidden neurons determined by
trial and error) - Choose the weights (regression such that the
examples from the training set are matched as
accurately as possible) - Test the resulting function on the test set
10Standard feedforward networks
? (1,0)
i.e. f R4k ? R2 is to be learned
11Experiments
12Experiments
- Donor sites test set contains 118 donors, 190987
non-donors - Evaluation highly unbalanced distribution ?
Matthews correlation coefficient (in -1,1) - (Pxtrue positives, Pxffalse positives, Nxtrue
negatives, Nxffalse negatives)
13Experiments
- Value of C(x) for different sizes of the window
(w) and different numbers of hidden neurons
111 of 118 donors correctly classified
(94.1) 11789 of 11800 non-donors correctly
classified (99.9)
14Experiments
- Effect of resetting the output cutoff level
Probability of misclassified donors compared to
the distance from a true site
15Experiments
- Acceptor sites test set contains 118 acceptors,
190987 non-acceptors
100 acceptors classified correctly (87.4) 11800
non-acceptors classified correctly
(99.8) optimum window size includes
polypyrimidine tract
16Experiments
- Effect of resetting the output cutoff level
Probability of misclassified acceptors compared
to the distance from a true site
17Experiments
comparison of methods on another human splice set
Genesplicer a new computational method for
splice site reognition, Pertea,Lin,Salzberg,
Nucleid Acid Research 29(5)1216-1221, 2001
18Experiments
comparison of methods on UCI-benchmark, from the
SVM-paper (Sonnenburg et al.)
some NN
rules
19Rule insertion and extraction
20Rule insertion and extraction
- seminal approach Knowledge Based Artificial
Neural Networks (KBANN) by G.Towell,J.Shavlik - assumption a set of approximate propositional,
acyclic if/then rules (acyclic Horn clauses) is
given (e.g. a - b,c,d,e) - transfer it into a FNN
true/false0/1
21(No Transcript)
22Rule insertion and extraction
- find nice rules
- transfer the rules into a network
- extend the network by additional neurons (and
possibly additional inputs) and initialize the
additional rules with small values - find additional nice training examples possibly
not yet covered by the rules - train the network
23Rule insertion and extraction
24Rule insertion and extraction
25Rule insertion and extraction
Training backprop with crossentropy error
(?(1-yi)lg oi yi lg oi) Structure of the
network and initialization improve the
performance However it might happen, that FNNs
become better in the long range, rules might get
lost here KBANN is better for acceptors, but
not for donors and none of both.
26Experiments
27Rule insertion and extraction
- Extraction of rules
- each neuron represents a variable
- the weights of each neuron are clustered into
nearly identical weights, pruning of irrelevant
weights - m-of-n rules are extracted from each neuron via
enumeration of (some of) the possibilities
if 2 of (2,8,9) and 2 of (1,3,5,7) and none of
(4,6) then on or
yields usually a large and incomprehensible set
of rules
28Rule insertion and extraction
- Just a remark there exists a very nice
alternative from Duch et al. (not applied to
splice sets) to infer logical rules by the help
of NNs from scretch - MLP2LN
for each output class separately rules of
increasing complexity
train s.t. a formula arises E ?1? wij2
?2?wij2(wij-1)2(wij1)2
L units
R units
29Discussion
30Discussion
- FNNs constitute a very simple and efficient tool
for splice site recognition - interesting often highly unbalanced sets, hence
adequate error measures are important
(correlation coefficient), even gt99 correct
prediction of non-splice sites might lead to 51
(non-correct splice sites, correct splice sites) - prior knowledge is available, hence rule
insertion is a striking alternative (with some
benefits)