Title: CS5263%20Bioinformatics
1CS5263 Bioinformatics
- Lecture 21
- RNA Secondary Structure Prediction
2Road map
- Biological roles for RNA
- Whats secondary structure?
- How is it represented?
- Why is it important?
- How to predict?
3Central dogma
The flow of genetic information
transcription
translation
DNA
RNA
Protein
Replication
4Classical Roles for RNA
- mRNA - Message RNA
- tRNA - Transfer RNA (61 kinds, 75nt)
- rRNA - Ribosomal RNA (4 kinds, 120-5k nt)
RNA
Protein
Ribosome
5Classical Roles for RNA
Ribosome
6Semi-classical RNA
- snRNA - small nuclear RNA (splicing U1, etc,
60-300nt) - RNaseP - tRNA processing (300 nt)
- SRP - signal recognition particle membrane
targeting (100-300 nt) - tmRNA - resetting stalled ribosomes, destroy
aberrant mRNA - Telomerase - (200-400nt)
- snoRNA - small nucleolar RNA (many varieties
80-200nt)
7New Roles for RNA
- Riboswitch an mRNA regulates its own activity
- siRNA (Nobel prize 2006, Fire Mello)
- microRNAs
- saRNA small activating RNA
- Hundreds of families
- Rfam release 1, 1/2003 25 families, 55k
instances - Rfam release 7, 3/2005 503 families, 300k
instances
8Example Riboswitch
9Non-coding RNAs
- Dramatic discoveries in last 5 years
- 100s of new families
- Many roles regulation, transport, stability,
catalysis, - 1 of DNA codes for
- protein, but 30 of it is copied into RNA, i.e.
- ncRNA gtgt mRNA
10Take-home message
- RNAs play many important roles in the cell beyond
the classical roles - Many of which yet to be discovered
- RNA functions are determined by structures
11RNA structure
- Primary sequence
- Secondary base-pairing
- Tertiary 3D shape
12RNA base-pairing
- Watson-Crick Pairing
- C-G 3kcal/mole
- A-U 2kcal/mole
- Wobble Pair G U 1kcal/mole
- Non-canonical Pairs
13tRNA structure
14Secondary structure prediction
- Given CAUUUGUGUACCU.
- Goal
- How can we compute that?
15Terminology
Hairpin Loops
Interior loops
Stems
Multi-branched loop
Bulge loop
16Pseudoknot
ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc
5-
-3
10
20
30
40
- Makes structure prediction hard. Not considered
in most algorithms.
17The Nussinov algorithm
- Goal maximizing the number of base-pairs
- Idea Dynamic programming
- Loop matching
- Nussinov, Pieczenik, Griggs, Kleitman 78
- Too simple for accurate prediction, but
stepping-stone for later algorithms
18The Nussinov algorithm
- Problem
- Find the RNA structure with the maximum
(weighted) number of nested pairings - Nested no pseudoknot
ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACC
GCGAGAGGGAAGACUCGUAUAAGCG
19The Nussinov algorithm
- Given sequence X x1xN,
- Define DP matrix F(i, j) maximum number of
base-pairs if xixj folds optimally - Matrix is symmetric, so let i lt j
20The Nussinov algorithm
- Can be summarized into two cases
- (i, j) paired optimal score is 1 F(i1, j-1)
- (i, j) unpaired optimal score is
- maxk F(i, k) F(k1, j)
- a number of other ways to summarize, all
equivalent
21The Nussinov algorithm
- F(i, i) 0
- F(i1, j-1) S(xi, xj)
- F(i, j) max
- maxk F(i, k) F(k1, j)
- S(xi, xj) 1 if xi, xj can form a base-pair, and
0 otherwise - Generalize S(A, U) 2, S(C, G) 3, S(G, U) 1
- Or other types of scores (later)
- F(1, N) gives the optimal score for the whole seq
22How to fill in the DP matrix?
- F(i1, j-1) S(xi, xj)
- F(i, j) max
- maxk F(i, k) F(k1, j)
0
0
0 (i, j)
0
0
0
0
0
0
0
i
i1
j1
j
23How to fill in the DP matrix?
- F(i1, j-1) S(xi, xj)
- F(i, j) max
- maxk F(i, k) F(k1, j)
0
0
0
0
0
0
0
0
0
0
j i 1
24How to fill in the DP matrix?
- F(i1, j-1) S(xi, xj)
- F(i, j) max
- maxk F(i, k) F(k1, j)
0
0
0
0
0
0
0
0
0
0
j i 2
25How to fill in the DP matrix?
- F(i1, j-1) S(xi, xj)
- F(i, j) max
- maxk F(i, k) F(k1, j)
0
0
0
0
0
0
0
0
0
0
j i 3
26How to fill in the DP matrix?
- F(i1, j-1) S(xi, xj)
- F(i, j) max
- maxk F(i, k) F(k1, j)
0
0
0
0
0
0
0
0
0
0
j i N - 1
27Minimum Loop length
- Sharp turns unlikely
- Let minimum length of hairpin loop be 1
- F(i, j) 0 for j i lt 2
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
U ? A G ? C C ? G G
C
28Algorithm
- Initialization
- F(i, i) 0 for i 1 to N
- F(i, i1) 0 for i 1 to N-1
- Iteration
- For L 1 to N-1
- For i 1 to N l
- j min(i L, N)
- F(i1, j -1) s(xi, xj)
- F(i, j) max
- max i ? k lt j F(i, k) F(k1, j)
- Termination
- Best score is given by F(1, N)
- (Need to trace back refer to the Durbin book)
29Complexity
- For L 1 to N-1
- For i 1 to N l
- j min(i L, N)
- F(i1, j -1) s(xi, xj)
- F(i, j) max
- max i ? k lt j F(i, k) F(k1, j)
- Time complexity O(N3)
- Memory O(N2)
30Example
- RNA sequence GGGAAAUCC
- Only count of base-pairs
- A-U 1
- G-C 1
- G-U 1
- Minimum hairpin loop length 1
31G G G A A A U C C
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
G G G A A A U C C
32G G G A A A U C C
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
0 0 0
0 0 0
0 0
0
G G G A A A U C C
33G G G A A A U C C
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
34G G G A A A U C C
0 0 0 0 0
0 0 0 0 0
0 0 0 0 1
0 0 0 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
35G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
36G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
37G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
38G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
39Energy minimization
- For L 1 to N-1
- For i 1 to N l
- j min(i L, N)
- E(i1, j -1) e(xi, xj)
- E(i, j) min
- min i ? k lt j E(i, k) E(k1, j)
- e(xi, xj) represents the energy for xi base pair
with xj - Energy are negative values. Therefore
minimization rather than maximize. - More complex energy rules energy depends on
neighboring bases
40Terminology
Hairpin Loops
Interior loops
Stems
Multi-branched loop
Bulge loop
41The Zuker algorithm main ideas
- Instead of base pairs, pairs of base pairs (more
accurate) - Separate score for bulges
- Separate score for different-size composition
of loops - Separate score for interactions between stem
beginning of loop - Use additional matrix to remember current state.
similar to affine-gap alignment.
42Two popular implementation
- mFold by Zuker
-
- RNAfold in the Vienna package (Hofacker)
- Includes several useful utilities, such as
structure comparison, searching, base-paring
probability from partition functions, etc.
43Accuracy
- 50-70 for sequences up to 300 nt
- Not perfect, but useful
- Possible reasons
- Energy rule not perfect 5-10 error
- Many alternative structures within this error
range - Alternative structure do exist
- Structure may change in presence of other
molecules
44Comparative structure prediction
- Given K homologous aligned RNA sequences
- Human aagacuucggaucuggcgacaccc
- Mouse uacacuucggaugacaccaaagug
- Worm aggucuucggcacgggcaccauuc
- Fly ccaacuucggauuuugcuaccaua
- Orc aagccuucggagcgggcguaacuc
- If ith and jth positions are always base paired
and covary, then they are likely to be paired
45Mutual information
- fab(i,j) of times the pair a, b are in
positions i, j - fa (i) of times the base a is in positions i
aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug
aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua
aagccuucggagcgggcguaacuc
fgc(3,13) 3/5 fcg(3,13) 1/5 fau(3,13) 1/5
fg(3) 3/5 fc(3) 1/5 fa(3) 1/5
fc(13) 3/5 fg(13) 1/5 fu(13) 1/5
46Mutual information
- Also called covariance score
- M is high if base a in position i always follow
by base b in position j - Does not require a to base-pair with b
- Advantage can detect non-canonical base-pairs
- However, M 0 if no mutation at all, even if
perfect base-pairs
aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug
aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua
aagccuucggagcgggcguaacuc
One way to get around is to combine covariance
and energy scores
47Comparative structure prediction
- Given a multiple alignment, can infer structure
that maximizes the sum of mutual information, by
DP - However, alignment is hard, since structure often
more important than sequence
48Comparative structure prediction
- In practice
- Get multiple alignment
- Find covarying bases deduce structure
- Improve multiple alignment (by hand)
- Go to 2
- A manual EM process!!
49Comparative structure prediction
- Align then fold
- Align and fold
- Fold then align
50Context-free Grammar for RNA Secondary Structure
- S SS aSu cSg uSa gSc L
- L aL cL gL uL ?
S
ag u cg
aaacgg ugcc
S
S
S
L
S
S
a L
L
L
a
?
a c g g a g u g c c c g u
51Stochastic Context-free Grammar (SCFG)
- Probabilistic context-free grammar
- Probabilities can be converted into weights
- CFG vs SCFG is similar to RG vs HMM
- S SS
- S aSu uSa L
- S cSg gSc L
- S uSg gSu L
- L aL cL gL uL ?
0
e(xi, xj) F(i1, j-1) F(i, j) max
L(i, j) maxk (F(i, k) F(k1,
j)) L(i, j) 0
2
3
1
0
52SCFG Decoding
- Decoding given a grammar (SCFG/HMM) and a
sequence, find the best parse (highest
probability or score) - CYK algorithm (Viterbi)
- The Nussinov and Zuker algorithms are essentially
special cases of CYK - CYK and SCFG are also used in other domains (NLP,
Compiler, etc).
53SCFG Evaluation
- Given a sequence and a SCFG model
- Estimate P(seq is generated by model), summing
over all possible paths - Inside-outside algorithm
- Analogous to forward-background
- Inside bottom-up parsing (P(xi..xj))
- Outside top-down parsing (P(x1..xi-1 xj1..xN))
- Can calculate base-paring probability
- Analogous to posterior decoding
- Essentially the same idea implemented in the
Vienna RNAfold package
54SCFG Learning
- Covariance model similar to profile HMMs
- Given a set of sequences with common structures,
simultaneously learn SCFG parameters and
optimally parse sequences into states - EM on SCFG
- Inside-outside algorithm
- Efficiency is a bottleneck
- Have been successfully applied to predict tRNA
genes and structures - tRNAScan
55Future directions
- Structure prediction
- Secondary
- Tertiary
- Structural comparison tools
- Structural alignment
- Structure search tools
- RNA-BLAST
- Structural motif finding
- RNA-MEME