Introduction to Bioinformatics

About This Presentation

Title:

Introduction to Bioinformatics

Description:

... (D2, T4) can alternatively come from two cells (same score): high-road' or low-road' ... DROSOPHILA MELANOGASTER (FRUIT FLY). OC EUKARYOTA; METAZOA; ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 33

Provided by: jaaphe

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics

1
Introduction to bioinformaticsLecture
7 Pairwise Sequence Alignment (III) and Deriving
Amino Acid Substitution Matrices (I)
2
Global or Local Pairwise alignment
B
B
C
A
A
B
A
A
C
A
B
C
A
Local
B
Local
A
B
C
A
B
C
B
A
Global
Global
A
B
C
A
3
Globin fold ? protein myoglobin PDB 1MBN
Alpha-helices are labelled A (blue) to H
(red). The D helix can be missing in some
globins What happens with the alignment if
D-helix containing globin sequences are aligned
with D-less ones?
4
? sandwich ? protein immunoglobulin PDB 7FAB
Immunoglobulinstructures have variable regions
where numbers of amino acids can vary
substantially
5
TIM barrel ? / ? protein Triose phosphate
IsoMerase PDB 1TIM
The evolutionary history of this protein family
has been the subject of rigorous debate.
Arguments have been made in favor of both
convergent and divergent evolution. Because of
the general lack of sequence homology, the
ancestry of this molecule is still a mystery.
6
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel
catalytic substrate binding domain a/b
nucleotide binding domain
7
What does all this mean for alignments?

Alignments need to be able to skip secondary
structural elements to complete domains (i.e.
putting gaps opposite these motifs in the shorter
sequence).
Depending on gap penalties chosen, the algorithm
might have difficulty with making such long gaps
(for example when using high affine gap
penalties), resulting in incorrect alignment.
Alignments are only meaningful for homologous
sequences (with a common ancestor)

8
There are three kinds of pairwise alignments

Global alignment align all residues in both
sequences all gaps are penalised
Semi-global alignment align all residues in
both sequences end gaps are not penalised (zero
end gap penalties)
Local alignment align one part of each
sequence end gaps are not applicable

9
Easy global DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
i-1

Mi,j is optimal alignment (highest scoring
alignment until i, j)
At each cell i, j in search matrix, check Max
coming from
any cell in preceding row until j-2 add score
for celli, j minus appropriate gap penalties
any cell in preceding column until i-2 add score
for celli, j minus appropriate gap penalties
or celli-1, j-1 add score for celli, j
Select highest scoring cell in bottom row and
rightmost column and do trace-back

10
Lets do an example global alignmentGotohs DP
algorithm with affine gap penalties (PAM250,
Pi10, Pe2)
PAM250
Cell (D2, T4) can alternatively come from two
cells (same score) high-road or low-road
Row and column 0 are filled with 0, -12, -14,
-16, if global alignment is used (for
N-terminal end-gaps) also extra row and column
at the end to calculate the score including
C-terminal end-gap penalties. Note that only
non-diagonal arrows are indicated for clarity
(no arrow means that you go back to earlier
diagonal cell).
11
Lets do another example semi-global
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
PAM250
Starting row and column 0, and extra column at
right or extra row at bottom is not necessary
when using semi global alignment (zero end-gaps).
Rest works as under global alignment.
12
Easy local DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
i-1

Mi,j is optimal alignment (highest scoring
alignment until i, j)
At each cell i, j in search matrix, check Max
coming from
any cell in preceding row until j-2 add score
for celli, j minus appropriate gap penalties
any cell in preceding column until i-2 add score
for celli, j minus appropriate gap penalties
or celli-1, j-1 add score for celli, j
Select highest scoring cell anywhere in matrix
and do trace-back until zero-valued cell or start
of sequence(s)

13
Lets do yet another example local
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
PAM250
Extra start/end columns/rows not necessary (no
end-gaps). Each negative scoring cell is set to
zero. Highest scoring cell may be found anywhere
in search matrix after calculating it. Trace
highest scoring cell back to first cell with zero
value (or the beginning of one or both sequences)
14
For your first exam D1Make sure you
understand and can carry out 1. the simple DP
algorithm (for linear gap penalties but with the
extension for affine gap penalties) and 2.
Gotohs algorithm for global, semi-global and
local alignment!Gotoh, O. An Improved
Algorithm for Matching Biological Sequences. J.
Mol. Biol., 162, pp. 705-708, 1982.
15
Introduction to Bioinformatics

Deriving Amino Acid Substitution matrices (I)

16
Sequence Analysis Finding relationships between
genes and gene products of different species,
including those at large evolutionary distances
17
Archaea
Domain Archaea is mostly composed of cells that
live in extreme environments. While they are able
to live elsewhere, they are usually not found
there because outside of extreme environments
they are competitively excluded by other
organisms. Species of the domain Archaea are not
inhibited by antibiotics, lack peptidoglycan in
their cell wall (unlike bacteria, which have this
sugar/polypeptide compound), and can have
branched carbon chains in their membrane lipids
of the phospholipid bilayer. It is believed
that Archaea are very similar to prokaryotes
(e.g. bacteria) that inhabited the earth billions
of years ago. It is also believed that eukaryotes
evolved from Archaea, because they share many
mRNA sequences, have similar RNA polymerases, and
have introns. Therefore, it is believed that the
domains Archaea and Bacteria branched from each
other very early in history, and membrane
infolding produced eukaryotic cells in the
archaean branch approximately 1.7 billion years
ago. There are three main groups of Archaea
extreme halophiles (salt), methanogens (methane
producing anaerobes), and hyperthermophiles (e.g.
living at temperatures gt100º C!). Membrane
infolding is believed to have led to the nucleus
of eukaryotic cells, which is a
membrane-enveloped cell organelle that holds the
cellular DNA. Prokaryotic cells are more
primitive and do not have a nucleus.
18
The 20 common amino acids
19
Example of sequence database entry for Genbank
LOCUS DRODPPC 4001 bp INV 15-MAR-1990 DEFINITION
D.melanogaster decapentaplegic gene complex
(DPP-C), complete cds. ACCESSION M30116 KEYWORDS .
SOURCE D.melanogaster, cDNA to
mRNA. ORGANISM Drosophila melanogaster Eurkaryo
te mitochondrial eukaryotes Metazoa
Arthropoda Tracheata Insecta Pterygota
Diptera Brachycera Muscomorpha Ephydroidea
Drosophilidae Drosophilia. REFERENCE 1 (bases 1
to 4001) AUTHORS Padgett, R.W., St Johnston,
R.D. and Gelbart, W.M. TITLE A transcript from a
Drosophila pattern gene predicts a
protein homologous to the transforming growth
factor-beta family JOURNAL Nature 325, 81-84
(1987) MEDLINE 87090408 COMMENT The initiation
codon could be at either 1188-1190 or
1587-1589 FEATURES Location/Qualifiers source 1
..4001 /organismDrosophila
melanogaster /db_xreftaxon7227 mRNA lt1..
3918 /genedpp /notedecapentaplegic
protein mRNA /db_xrefFlyBaseFBgn0000490 g
ene 1..4001 /notedecapentaplegic /gene
dpp /allele /db_xrefFlyBaseFBgn000049
0 CDS 1188..2954 /genedpp /notedecap
entaplegic protein (1188 could be
1587) /codon_start1 /db_xrefFlyBaseFBgn
0000490 /db_xrefPIDg157292 /translation
MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA
SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKK
PSKSDANR LGYDAYYCHGKCPFPLADHFNSTNAV
VQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEM
TBBGCGCR BASE COUNT 1170 a 1078 c 956 g 797
t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat
ctataccgaa atgagcggcg gaaagtgagc 61
cacttggcgt gaacccaaag ctttcgagga aaattctcgg
acccccatat acaaatatcg 121 gaaaaagtat
cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag
atctccgtgc 181 ggaaacaaag aaattgaggc
actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc
241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg
aaaccctgaa atccgaacgg 301 ccagccaaag
caaataaagc tgtgaatacg aattaagtac aacaaacagt
tactgaaaca 361 gatacagatt cggattcgaa
tagagaaaca gatactggag atgcccccag aaacaattca
421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa
tatgtggatt acctgcgaac 481 cgtccgccca
aggagccgcc gggtgacagg tgtatccccc aggataccaa
cccgagccca 541 gaccgagatc cacatccaga
tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat
601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa
tgcggcaaca caattttcaa . 3841
aactgtataa acaaaacgta tgccctataa atatatgaat
aactatctac atcgttatgc 3901 gttctaagct
aagctcgaat aaatccgtac acgttaatta atctagaatc
gtaagaccta 3961 acgcgtaagc tcagcatgtt
ggataaatta atagaaacga g //
20
Example of sequence database entry for SWISS-PROT
(now UNIPROT)
ID DECA_DROME STANDARD PRT 588AA. AC P07713 DT
01-APR-1988 (REL. 07, CREATED) DT 01-APR-1988
(REL. 07, LAST SEQUENCE UPDATE) DT 01-FEB-1995
(REL. 31, LAST ANNOTATION UPDATE) DE DECAPENTAPLEG
IC PROTEIN PRECURSOR (DPP-C PROTEIN). GN DPP. OS D
ROSOPHILA MELANOGASTER (FRUIT FLY). OC EUKARYOTA
METAZOA ARTHROPODA INSECTA DIPTERA. RN 1 RP S
EQUENCE FROM N.A. RM 87090408 RA PADGETT R.W., ST
JOHNSTON R.D., GELBART W.M. RL NATURE 32581-84
(1987) RN 2 RP CHARACTERIZATION, AND SEQUENCE
OF 457-476. RM 90258853 RA PANGANIBAN G.E.F.,
RASHKA K.E., NEITZEL M.D., HOFFMANN F.M. RL MOL.
CELL. BIOL. 102669-2677(1990). CC -!- FUNCTION
DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF
THE CC EMBRYONIC DOORSAL HYPODERM, FOR
VIABILITY OF LARVAE AND FOR CELL CC VIABILITY
OF THE EPITHELIAL CELLS IN THE IMAGINAL
DISKS. CC -!- SUBUNIT HOMODIMER,
DISULFIDE-LINKED. CC -!- SIMILARITY TO OTHER
GROWTH FACTORS OF THE TGF-BETA FAMILY. DR EMBL
M30116 DMDPPC. DR PIR A26158 A26158. DR HSSP
P08112 1TFG. DR FLYBASE FBGN0000490
DPP. DR PROSITE PS00250 TGF_BETA. KW GROWTH
FACTOR DIFFERENTIATION SIGNAL. FT SIGNAL 1 ? POT
ENTIAL. FT PROPEP ? 456 FT CHAIN 457 588 DECAPENT
APLEGIC PROTEIN. FT DISULFID 487 553 BY
SIMILARITY. FT DISULFID 516 585 BY
SIMILARITY. FT DISULFID 520 587 BY
SIMILARITY. FT DISULFID 552 552 INTERCHAIN (BY
SIMILARITY). FT CARBOHYD 120 120 POTENTIAL. FT CAR
BOHYD 342 342 POTENTIAL. FT CARBOHYD 377 377 POTEN
TIAL. FT CARBOHYD 529 529 POTENTIAL. SQ SEQUENCE
588 AA 65850MW 1768420 CN MRAWLLLLAV
LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG
SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK
TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN
HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE
SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA
LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP
EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD
SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL
SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD
TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV
RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH
DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH
SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS
TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS
TVVLKNYQEM TVVGCGCR
21
What to align, nucleotide or amino acid sequences?

If you think you have an Open Reading Frame (ORF)
then align at protein level
(i) Many mutations within DNA are synonymous,
leading to overestimation of sequence divergence
if compared at the DNA level.
(ii) Evolutionary relationships can be more
finely expressed using a 2020 amino acid
exchange table than using nucleotide exchanges.
(iii) DNA sequences contain non-coding regions
which should be avoided in homology searches.
Still an issue when translating into (six)
protein sequences through a codon table.
(iv) Searching at protein level frameshifts can
occur, leading to stretches of incorrect amino
acids and possibly elongation of sequences due to
missed stop codons. But frameshifts normally
result in stretches of highly unlikely amino
acids can be used as a signal to trace.

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4
-5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2
4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3
1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3
-3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0
-2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4
0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0
9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5
6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1
2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0
1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0
-6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4
-2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2
4 2 -2 2 -1 -1 -1 0 -6 -2 4 B 0 -1 2 3 -4
1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 Z
0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0
-1 -6 -4 -2 2 3 A R N D C Q E G H I
L K M F P S T W Y V B Z
PAM250 matrix
WR exchange is too large (due to paucity of data)
31
PAM model

The scores derived through the PAM model are an
accurate description of the information content
(or the relative entropy) of an alignment
(Altschul, 1991).
PAM-1 corresponds to about 1 million years of
evolution
PAM-120 has the largest information content of
the PAM matrix series
PAM-250 is the traditionally most popular matrix

PAM / MDM / Dayhoff -- summary
The late Margaret Dayhoff was a pioneer in
protein databasing and comparison. She and her
coworkers developed a model of protein evolution
which resulted in the development of a set of
widely used substitution matrices. These are
frequently called Dayhoff, MDM (Mutation Data
Matrix), or PAM (Percent Accepted Mutation)
matrices
Derived from global alignments of closely related
sequences.
Matrices for greater evolutionary distances are
extrapolated from those for lesser ones.
The number associated with the matrix (PAM40,
PAM100) refers to the evolutionary distance
greater numbers correspond to greater distances.
Several later groups have attempted to extend
Dayhoff's methodology or re-apply her analysis
using later databases with more examples.
Extensions
Jones, Thornton and coworkers used the same
methodology as Dayhoff but with modern databases
(CABIOS 8275 - 1992)
Gonnett and coworkers (Science 2561443 - 1992)
used a slightly different (but theoretically
equivalent) methodology
Henikoff Henikoff (Proteins 1749 - 1993)
compared these two newer versions of the PAM
matrices with Dayhoff's originals.