CS5263%20Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

CS5263%20Bioinformatics

Description:

CS5263 Bioinformatics. Lecture 21. RNA Secondary Structure Prediction ... tmRNA - resetting stalled ribosomes, destroy aberrant mRNA. Telomerase - (200-400nt) ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 56
Provided by: jianhu
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS5263%20Bioinformatics


1
CS5263 Bioinformatics
  • Lecture 21
  • RNA Secondary Structure Prediction

2
Road map
  • Biological roles for RNA
  • Whats secondary structure?
  • How is it represented?
  • Why is it important?
  • How to predict?

3
Central dogma
The flow of genetic information
transcription
translation
DNA
RNA
Protein
Replication
4
Classical Roles for RNA
  • mRNA - Message RNA
  • tRNA - Transfer RNA (61 kinds, 75nt)
  • rRNA - Ribosomal RNA (4 kinds, 120-5k nt)

RNA
Protein
Ribosome
5
Classical Roles for RNA
  • mRNA
  • tRNA
  • rRNA

Ribosome
6
Semi-classical RNA
  • snRNA - small nuclear RNA (splicing U1, etc,
    60-300nt)
  • RNaseP - tRNA processing (300 nt)
  • SRP - signal recognition particle membrane
    targeting (100-300 nt)
  • tmRNA - resetting stalled ribosomes, destroy
    aberrant mRNA
  • Telomerase - (200-400nt)
  • snoRNA - small nucleolar RNA (many varieties
    80-200nt)

7
New Roles for RNA
  • Riboswitch an mRNA regulates its own activity
  • siRNA (Nobel prize 2006, Fire Mello)
  • microRNAs
  • saRNA small activating RNA
  • Hundreds of families
  • Rfam release 1, 1/2003 25 families, 55k
    instances
  • Rfam release 7, 3/2005 503 families, 300k
    instances

8
Example Riboswitch
9
Non-coding RNAs
  • Dramatic discoveries in last 5 years
  • 100s of new families
  • Many roles regulation, transport, stability,
    catalysis,
  • 1 of DNA codes for
  • protein, but 30 of it is copied into RNA, i.e.
  • ncRNA gtgt mRNA

10
Take-home message
  • RNAs play many important roles in the cell beyond
    the classical roles
  • Many of which yet to be discovered
  • RNA functions are determined by structures

11
RNA structure
  • Primary sequence
  • Secondary base-pairing
  • Tertiary 3D shape

12
RNA base-pairing
  • Watson-Crick Pairing
  • C-G 3kcal/mole
  • A-U 2kcal/mole
  • Wobble Pair G U 1kcal/mole
  • Non-canonical Pairs

13
tRNA structure
14
Secondary structure prediction
  • Given CAUUUGUGUACCU.
  • Goal
  • How can we compute that?

15
Terminology
Hairpin Loops
Interior loops
Stems
Multi-branched loop
Bulge loop
16
Pseudoknot
ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc
5-
-3
10
20
30
40
  • Makes structure prediction hard. Not considered
    in most algorithms.

17
The Nussinov algorithm
  • Goal maximizing the number of base-pairs
  • Idea Dynamic programming
  • Loop matching
  • Nussinov, Pieczenik, Griggs, Kleitman 78
  • Too simple for accurate prediction, but
    stepping-stone for later algorithms

18
The Nussinov algorithm
  • Problem
  • Find the RNA structure with the maximum
    (weighted) number of nested pairings
  • Nested no pseudoknot

ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACC
GCGAGAGGGAAGACUCGUAUAAGCG
19
The Nussinov algorithm
  • Given sequence X x1xN,
  • Define DP matrix F(i, j) maximum number of
    base-pairs if xixj folds optimally
  • Matrix is symmetric, so let i lt j

20
The Nussinov algorithm
  • Can be summarized into two cases
  • (i, j) paired optimal score is 1 F(i1, j-1)
  • (i, j) unpaired optimal score is
  • maxk F(i, k) F(k1, j)
  • a number of other ways to summarize, all
    equivalent

21
The Nussinov algorithm
  • F(i, i) 0
  • F(i1, j-1) S(xi, xj)
  • F(i, j) max
  • maxk F(i, k) F(k1, j)
  • S(xi, xj) 1 if xi, xj can form a base-pair, and
    0 otherwise
  • Generalize S(A, U) 2, S(C, G) 3, S(G, U) 1
  • Or other types of scores (later)
  • F(1, N) gives the optimal score for the whole seq

22
How to fill in the DP matrix?
  • F(i1, j-1) S(xi, xj)
  • F(i, j) max
  • maxk F(i, k) F(k1, j)

0
0
0 (i, j)
0
0
0
0
0
0
0
i
i1
j1
j
23
How to fill in the DP matrix?
  • F(i1, j-1) S(xi, xj)
  • F(i, j) max
  • maxk F(i, k) F(k1, j)

0
0
0
0
0
0
0
0
0
0
j i 1
24
How to fill in the DP matrix?
  • F(i1, j-1) S(xi, xj)
  • F(i, j) max
  • maxk F(i, k) F(k1, j)

0
0
0
0
0
0
0
0
0
0
j i 2
25
How to fill in the DP matrix?
  • F(i1, j-1) S(xi, xj)
  • F(i, j) max
  • maxk F(i, k) F(k1, j)

0
0
0
0
0
0
0
0
0
0
j i 3
26
How to fill in the DP matrix?
  • F(i1, j-1) S(xi, xj)
  • F(i, j) max
  • maxk F(i, k) F(k1, j)

0
0
0
0
0
0
0
0
0
0
j i N - 1
27
Minimum Loop length
  • Sharp turns unlikely
  • Let minimum length of hairpin loop be 1
  • F(i, j) 0 for j i lt 2

0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
U ? A G ? C C ? G G
C
28
Algorithm
  • Initialization
  • F(i, i) 0 for i 1 to N
  • F(i, i1) 0 for i 1 to N-1
  • Iteration
  • For L 1 to N-1
  • For i 1 to N l
  • j min(i L, N)
  • F(i1, j -1) s(xi, xj)
  • F(i, j) max
  • max i ? k lt j F(i, k) F(k1, j)
  • Termination
  • Best score is given by F(1, N)
  • (Need to trace back refer to the Durbin book)

29
Complexity
  • For L 1 to N-1
  • For i 1 to N l
  • j min(i L, N)
  • F(i1, j -1) s(xi, xj)
  • F(i, j) max
  • max i ? k lt j F(i, k) F(k1, j)
  • Time complexity O(N3)
  • Memory O(N2)

30
Example
  • RNA sequence GGGAAAUCC
  • Only count of base-pairs
  • A-U 1
  • G-C 1
  • G-U 1
  • Minimum hairpin loop length 1

31
G G G A A A U C C
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
G G G A A A U C C
32
G G G A A A U C C
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
0 0 0
0 0 0
0 0
0
G G G A A A U C C
33
G G G A A A U C C
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
34
G G G A A A U C C
0 0 0 0 0
0 0 0 0 0
0 0 0 0 1
0 0 0 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
35
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
36
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
37
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
38
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G ? U G ? C G ? C
AAA
G G G A A A U C C
A ? U G ? C G ? C G
A ? U G G ? C G ? C
AA
AA
39
Energy minimization
  • For L 1 to N-1
  • For i 1 to N l
  • j min(i L, N)
  • E(i1, j -1) e(xi, xj)
  • E(i, j) min
  • min i ? k lt j E(i, k) E(k1, j)
  • e(xi, xj) represents the energy for xi base pair
    with xj
  • Energy are negative values. Therefore
    minimization rather than maximize.
  • More complex energy rules energy depends on
    neighboring bases

40
Terminology
Hairpin Loops
Interior loops
Stems
Multi-branched loop
Bulge loop
41
The Zuker algorithm main ideas
  1. Instead of base pairs, pairs of base pairs (more
    accurate)
  2. Separate score for bulges
  3. Separate score for different-size composition
    of loops
  4. Separate score for interactions between stem
    beginning of loop
  5. Use additional matrix to remember current state.
    similar to affine-gap alignment.

42
Two popular implementation
  • mFold by Zuker
  • RNAfold in the Vienna package (Hofacker)
  • Includes several useful utilities, such as
    structure comparison, searching, base-paring
    probability from partition functions, etc.

43
Accuracy
  • 50-70 for sequences up to 300 nt
  • Not perfect, but useful
  • Possible reasons
  • Energy rule not perfect 5-10 error
  • Many alternative structures within this error
    range
  • Alternative structure do exist
  • Structure may change in presence of other
    molecules

44
Comparative structure prediction
  • Given K homologous aligned RNA sequences
  • Human aagacuucggaucuggcgacaccc
  • Mouse uacacuucggaugacaccaaagug
  • Worm aggucuucggcacgggcaccauuc
  • Fly ccaacuucggauuuugcuaccaua
  • Orc aagccuucggagcgggcguaacuc
  • If ith and jth positions are always base paired
    and covary, then they are likely to be paired

45
Mutual information
  • fab(i,j) of times the pair a, b are in
    positions i, j
  • fa (i) of times the base a is in positions i

aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug
aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua
aagccuucggagcgggcguaacuc
fgc(3,13) 3/5 fcg(3,13) 1/5 fau(3,13) 1/5
fg(3) 3/5 fc(3) 1/5 fa(3) 1/5
fc(13) 3/5 fg(13) 1/5 fu(13) 1/5
46
Mutual information
  • Also called covariance score
  • M is high if base a in position i always follow
    by base b in position j
  • Does not require a to base-pair with b
  • Advantage can detect non-canonical base-pairs
  • However, M 0 if no mutation at all, even if
    perfect base-pairs

aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug
aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua
aagccuucggagcgggcguaacuc
One way to get around is to combine covariance
and energy scores
47
Comparative structure prediction
  • Given a multiple alignment, can infer structure
    that maximizes the sum of mutual information, by
    DP
  • However, alignment is hard, since structure often
    more important than sequence

48
Comparative structure prediction
  • In practice
  • Get multiple alignment
  • Find covarying bases deduce structure
  • Improve multiple alignment (by hand)
  • Go to 2
  • A manual EM process!!

49
Comparative structure prediction
  • Align then fold
  • Align and fold
  • Fold then align

50
Context-free Grammar for RNA Secondary Structure
  • S SS aSu cSg uSa gSc L
  • L aL cL gL uL ?

S
ag u cg
aaacgg ugcc
S
S
S
L
S
S
a L
L
L
a
?
a c g g a g u g c c c g u
51
Stochastic Context-free Grammar (SCFG)
  • Probabilistic context-free grammar
  • Probabilities can be converted into weights
  • CFG vs SCFG is similar to RG vs HMM
  • S SS
  • S aSu uSa L
  • S cSg gSc L
  • S uSg gSu L
  • L aL cL gL uL ?

0
e(xi, xj) F(i1, j-1) F(i, j) max
L(i, j) maxk (F(i, k) F(k1,
j)) L(i, j) 0
2
3
1
0
52
SCFG Decoding
  • Decoding given a grammar (SCFG/HMM) and a
    sequence, find the best parse (highest
    probability or score)
  • CYK algorithm (Viterbi)
  • The Nussinov and Zuker algorithms are essentially
    special cases of CYK
  • CYK and SCFG are also used in other domains (NLP,
    Compiler, etc).

53
SCFG Evaluation
  • Given a sequence and a SCFG model
  • Estimate P(seq is generated by model), summing
    over all possible paths
  • Inside-outside algorithm
  • Analogous to forward-background
  • Inside bottom-up parsing (P(xi..xj))
  • Outside top-down parsing (P(x1..xi-1 xj1..xN))
  • Can calculate base-paring probability
  • Analogous to posterior decoding
  • Essentially the same idea implemented in the
    Vienna RNAfold package

54
SCFG Learning
  • Covariance model similar to profile HMMs
  • Given a set of sequences with common structures,
    simultaneously learn SCFG parameters and
    optimally parse sequences into states
  • EM on SCFG
  • Inside-outside algorithm
  • Efficiency is a bottleneck
  • Have been successfully applied to predict tRNA
    genes and structures
  • tRNAScan

55
Future directions
  • Structure prediction
  • Secondary
  • Tertiary
  • Structural comparison tools
  • Structural alignment
  • Structure search tools
  • RNA-BLAST
  • Structural motif finding
  • RNA-MEME
Write a Comment
User Comments (0)
About PowerShow.com