CS5263%20Bioinformatics - PowerPoint PPT Presentation

About This Presentation

Title:

CS5263%20Bioinformatics

Description:

CS5263 Bioinformatics. Lecture 21. RNA Secondary Structure Prediction ... tmRNA - resetting stalled ribosomes, destroy aberrant mRNA. Telomerase - (200-400nt) ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 56

Provided by: jianhu

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS5263%20Bioinformatics

1
CS5263 Bioinformatics

Lecture 21
RNA Secondary Structure Prediction

2
Road map

Biological roles for RNA
Whats secondary structure?
How is it represented?
Why is it important?
How to predict?

3
Central dogma
The flow of genetic information
transcription
translation
DNA
RNA
Protein
Replication
4
Classical Roles for RNA

mRNA - Message RNA
tRNA - Transfer RNA (61 kinds, 75nt)
rRNA - Ribosomal RNA (4 kinds, 120-5k nt)

RNA
Protein
Ribosome
5
Classical Roles for RNA

mRNA
tRNA
rRNA

Ribosome
6
Semi-classical RNA

snRNA - small nuclear RNA (splicing U1, etc,
60-300nt)
RNaseP - tRNA processing (300 nt)
SRP - signal recognition particle membrane
targeting (100-300 nt)
tmRNA - resetting stalled ribosomes, destroy
aberrant mRNA
Telomerase - (200-400nt)
snoRNA - small nucleolar RNA (many varieties
80-200nt)

7
New Roles for RNA

Riboswitch an mRNA regulates its own activity
siRNA (Nobel prize 2006, Fire Mello)
microRNAs
saRNA small activating RNA
Hundreds of families
Rfam release 1, 1/2003 25 families, 55k
instances
Rfam release 7, 3/2005 503 families, 300k
instances

8
Example Riboswitch
9
Non-coding RNAs

Dramatic discoveries in last 5 years
100s of new families
Many roles regulation, transport, stability,
catalysis,
1 of DNA codes for
protein, but 30 of it is copied into RNA, i.e.
ncRNA gtgt mRNA

10
Take-home message

RNAs play many important roles in the cell beyond
the classical roles
Many of which yet to be discovered
RNA functions are determined by structures

11
RNA structure

Primary sequence
Secondary base-pairing
Tertiary 3D shape

12
RNA base-pairing

Watson-Crick Pairing
C-G 3kcal/mole
A-U 2kcal/mole
Wobble Pair G U 1kcal/mole
Non-canonical Pairs

13
tRNA structure
14
Secondary structure prediction

Given CAUUUGUGUACCU.
Goal
How can we compute that?

15
Terminology
Hairpin Loops
Interior loops
Stems
Multi-branched loop
Bulge loop
16
Pseudoknot
ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc
5-
-3
10
20
30
40

Makes structure prediction hard. Not considered
in most algorithms.

17
The Nussinov algorithm

Goal maximizing the number of base-pairs
Idea Dynamic programming
Loop matching
Nussinov, Pieczenik, Griggs, Kleitman 78
Too simple for accurate prediction, but
stepping-stone for later algorithms

18
The Nussinov algorithm

Problem
Find the RNA structure with the maximum
(weighted) number of nested pairings
Nested no pseudoknot

ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACC
GCGAGAGGGAAGACUCGUAUAAGCG
19
The Nussinov algorithm

Given sequence X x1xN,
Define DP matrix F(i, j) maximum number of
base-pairs if xixj folds optimally
Matrix is symmetric, so let i lt j

20
The Nussinov algorithm

Can be summarized into two cases
(i, j) paired optimal score is 1 F(i1, j-1)
(i, j) unpaired optimal score is
maxk F(i, k) F(k1, j)
a number of other ways to summarize, all
equivalent

21
The Nussinov algorithm

F(i, i) 0
F(i1, j-1) S(xi, xj)
F(i, j) max
maxk F(i, k) F(k1, j)
S(xi, xj) 1 if xi, xj can form a base-pair, and
0 otherwise
Generalize S(A, U) 2, S(C, G) 3, S(G, U) 1
Or other types of scores (later)
F(1, N) gives the optimal score for the whole seq

22
How to fill in the DP matrix?

F(i1, j-1) S(xi, xj)
F(i, j) max
maxk F(i, k) F(k1, j)

0
0
0 (i, j)
0
0
0
0
0
0
0
i
i1
j1
j
23
How to fill in the DP matrix?

F(i1, j-1) S(xi, xj)
F(i, j) max
maxk F(i, k) F(k1, j)

0
0
0
0
0
0
0
0
0
0
j i 1
24
How to fill in the DP matrix?

F(i1, j-1) S(xi, xj)
F(i, j) max
maxk F(i, k) F(k1, j)

0
0
0
0
0
0
0
0
0
0
j i 2
25
How to fill in the DP matrix?

F(i1, j-1) S(xi, xj)
F(i, j) max
maxk F(i, k) F(k1, j)

0
0
0
0
0
0
0
0
0
0
j i 3
26
How to fill in the DP matrix?

F(i1, j-1) S(xi, xj)
F(i, j) max
maxk F(i, k) F(k1, j)

0
0
0
0
0
0
0
0
0
0
j i N - 1
27
Minimum Loop length

Sharp turns unlikely
Let minimum length of hairpin loop be 1
F(i, j) 0 for j i lt 2

0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
U ? A G ? C C ? G G
C
28
Algorithm

Initialization
F(i, i) 0 for i 1 to N
F(i, i1) 0 for i 1 to N-1
Iteration
For L 1 to N-1
For i 1 to N l
j min(i L, N)
F(i1, j -1) s(xi, xj)
F(i, j) max
max i ? k lt j F(i, k) F(k1, j)
Termination
Best score is given by F(1, N)
(Need to trace back refer to the Durbin book)

29
Complexity

For L 1 to N-1
For i 1 to N l
j min(i L, N)
F(i1, j -1) s(xi, xj)
F(i, j) max
max i ? k lt j F(i, k) F(k1, j)
Time complexity O(N3)
Memory O(N2)

30
Example

RNA sequence GGGAAAUCC
Only count of base-pairs
A-U 1
G-C 1
G-U 1
Minimum hairpin loop length 1

31
G G G A A A U C C
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
G G G A A A U C C
32
G G G A A A U C C
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
0 0 0
0 0 0
0 0
0
G G G A A A U C C
33
G G G A A A U C C
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
34
G G G A A A U C C
0 0 0 0 0
0 0 0 0 0
0 0 0 0 1
0 0 0 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
35
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
36
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
37
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
38
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
39
Energy minimization

For L 1 to N-1
For i 1 to N l
j min(i L, N)
E(i1, j -1) e(xi, xj)
E(i, j) min
min i ? k lt j E(i, k) E(k1, j)
e(xi, xj) represents the energy for xi base pair
with xj
Energy are negative values. Therefore
minimization rather than maximize.
More complex energy rules energy depends on
neighboring bases

40
Terminology
Hairpin Loops
Interior loops
Stems
Multi-branched loop
Bulge loop
41
The Zuker algorithm main ideas

Instead of base pairs, pairs of base pairs (more
accurate)
Separate score for bulges
Separate score for different-size composition
of loops
Separate score for interactions between stem
beginning of loop
Use additional matrix to remember current state.
similar to affine-gap alignment.

42
Two popular implementation

mFold by Zuker
RNAfold in the Vienna package (Hofacker)
Includes several useful utilities, such as
structure comparison, searching, base-paring
probability from partition functions, etc.

43
Accuracy

50-70 for sequences up to 300 nt
Not perfect, but useful
Possible reasons
Energy rule not perfect 5-10 error
Many alternative structures within this error
range
Alternative structure do exist
Structure may change in presence of other
molecules

44
Comparative structure prediction

Given K homologous aligned RNA sequences
Human aagacuucggaucuggcgacaccc
Mouse uacacuucggaugacaccaaagug
Worm aggucuucggcacgggcaccauuc
Fly ccaacuucggauuuugcuaccaua
Orc aagccuucggagcgggcguaacuc
If ith and jth positions are always base paired
and covary, then they are likely to be paired

45
Mutual information

fab(i,j) of times the pair a, b are in
positions i, j
fa (i) of times the base a is in positions i

aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug
aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua
aagccuucggagcgggcguaacuc
fgc(3,13) 3/5 fcg(3,13) 1/5 fau(3,13) 1/5
fg(3) 3/5 fc(3) 1/5 fa(3) 1/5
fc(13) 3/5 fg(13) 1/5 fu(13) 1/5
46
Mutual information

Also called covariance score
M is high if base a in position i always follow
by base b in position j
Does not require a to base-pair with b
Advantage can detect non-canonical base-pairs
However, M 0 if no mutation at all, even if
perfect base-pairs

aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug
aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua
aagccuucggagcgggcguaacuc
One way to get around is to combine covariance
and energy scores
47
Comparative structure prediction

Given a multiple alignment, can infer structure
that maximizes the sum of mutual information, by
DP
However, alignment is hard, since structure often
more important than sequence

48
Comparative structure prediction

In practice
Get multiple alignment
Find covarying bases deduce structure
Improve multiple alignment (by hand)
Go to 2
A manual EM process!!

49
Comparative structure prediction

Align then fold
Align and fold
Fold then align

50
Context-free Grammar for RNA Secondary Structure

S SS aSu cSg uSa gSc L
L aL cL gL uL ?

S
ag u cg
aaacgg ugcc
S
S
S
L
S
S
a L
L
L
a
?
a c g g a g u g c c c g u
51
Stochastic Context-free Grammar (SCFG)

Probabilistic context-free grammar
Probabilities can be converted into weights
CFG vs SCFG is similar to RG vs HMM
S SS
S aSu uSa L
S cSg gSc L
S uSg gSu L
L aL cL gL uL ?

0
e(xi, xj) F(i1, j-1) F(i, j) max
L(i, j) maxk (F(i, k) F(k1,
j)) L(i, j) 0
2
3
1
0
52
SCFG Decoding

Decoding given a grammar (SCFG/HMM) and a
sequence, find the best parse (highest
probability or score)
CYK algorithm (Viterbi)
The Nussinov and Zuker algorithms are essentially
special cases of CYK
CYK and SCFG are also used in other domains (NLP,
Compiler, etc).

53
SCFG Evaluation

Given a sequence and a SCFG model
Estimate P(seq is generated by model), summing
over all possible paths
Inside-outside algorithm
Analogous to forward-background
Inside bottom-up parsing (P(xi..xj))
Outside top-down parsing (P(x1..xi-1 xj1..xN))
Can calculate base-paring probability
Analogous to posterior decoding
Essentially the same idea implemented in the
Vienna RNAfold package

54
SCFG Learning

Covariance model similar to profile HMMs
Given a set of sequences with common structures,
simultaneously learn SCFG parameters and
optimally parse sequences into states
EM on SCFG
Inside-outside algorithm
Efficiency is a bottleneck
Have been successfully applied to predict tRNA
genes and structures
tRNAScan

55
Future directions