Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Multiple Sequence Alignment

Description:

Since Biological Sequences often occur in families, we may ... Expected patterns of insertions and deletions that tend to alternate with conserved sequences ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 19
Provided by: publicGe
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment


1
Multiple Sequence Alignment An Inexact Science
2
Why Do Multiple Sequence Alignments? Since
Biological Sequences often occur in families, we
may want to know to which family our sequence of
interest belongs. For the most part we are
interested in families of proteins since
homologous sequences retain similar structures
and functions. Thus, if we know a feature of one
of the proteins, we can possibly identify similar
features in homologous proteins and predict that
they have similar functions. Since most proteins
have been identified by the sequencing of genomic
DNA, the functions of most proteins have been
assigned on the basis of homology to other known
proteins rather than on the basis of results from
biochemical or functional (cell biological)
assays. Aligned residues tend to occupy the
corresponding positions in the 3D-structure of
each aligned protein. While protein structures
evolve over time, their sequences evolve more
rapidly than the structures, so a good multiple
sequence alignment can tell us something about
the evolution of the protein.
3
  • For these reasons multiple sequence alignments
    are used to
  • provide insight into likely function and
    structure of a protein
  • provide a more sensitive method for finding
    distantly related members of a protein family
  • find conserved residues or motifs in a subset of
    the results from a database search
  • gain understanding of the evolution of a
    particular protein
  • help understand the gene products of a newly
    sequenced genome

4
Any Multiple Sequence Alignment starts with a
primary sequence. In the following we start with
hMSH3 - Swiss Prot Accession Number P20585
spP20585MSH3_HUMAN GRGTSTHDGIAIAYATLEYFIRDVKSLT
LFVTHYPPVCELEKNYSHQVGNYHMGFLVS 1035
spP43246MSH2_HUMAN GRGTSTYDGFGLAWAISEYIATKIGA-
-FCMFATHFHELTALAN-QIPTVNNLHVTALTT807
spO15457MSH4_HUMAN GRGTNTEEGIGICYAVCEYLLSLKAF-
--TLFATHFLELCHIDALYPNVENMHFEVQHVK 818
spO43196MSH5_HUMAN GKGTNTVDGLALLAAVLRHWLARGPTCP
HIFVATNFLSLVQLQLLPQGPLVQYLTMETCE 733
spP52701MSH6_HUMAN GRGTATFDGTAIANAVVKELAETIKC-
-RTLFSTHYHSLVEDYSQNVAVRLGHMACMVEN 1273
spP13705MSH3_MOUSE GRGTSTHDGIAIAYATLEYFIRDVKS--
LTLFVTHYPPVCELEKCYPEQVGNYHMGFLVN 989
spP25336MSH3_YEAST GRGTGTHDGIAISYALIKYFSELSDCP-
LILFTTHFPMLGEIKS---PLIRNYHMDYVEE 957 trQ73MQ8
KKGKWLVAVDNLKLTIAEDDIEVCENQEKLKLSKPIVSIIS
DTSAPSSRPS--------- 740
. . . .
Interpreting the notation at the bottom of the
alignment entirely conserved column
all residues have approximately the same size
and hydropathy . one of either size or
hydropathy have been preserved In general, our
lab manual says that a good block in an alignment
is at least 10-30 aa long and contains at least
one to three s, five to seven s, and a few
.s. If this is the case, we suspect that we
have identified a conserved region within the
alignment.
5
  • The previous example clearly illustrates the
    following observations
  • We can not expect two protein structures with
    different sequences to be completely superposable
    (i.e. to be able to completely superimpose one
    upon the other.)
  • Research by Chothia Lesk in 1986 found that
    given two protein sequence alignments that were
    clearly homologous (30 identical) usually only
    about 50 of the individual residues were
    superposable.

6
  • Evolutionarily correct alignment is more
    difficult to infer than structural alignment.
  • Evolutionary history of the residues from a
    sequence of a family is not usually known from
    any source.
  • It must be inferred from sequence alignment.
  • Sequence alignment has an independent source of
    reference this may not be the common ancestor.

This does not say that multiple sequence
alignment is straight forward there is no way
to define an unambiguously correct alignment.
7
In general, humans can do better than machines
aligning sequences. Many Biologists can do high
quality alignments by hand.
  • Some of the Factors to be considered are
  • Highly conserved sequences
  • Buried hydorphobic residues
  • Expected patterns of insertions and deletions
    that tend to alternate with conserved sequences
  • Phylogenetic relationships
  • Influence of secondary and tertiary structure
    e.g. alternation of hydrophobic and
    hydrophilic columns in an exposed ß sheet.

But, it is HARD and TEDIOUS work!! Most times it
is best to start with a machine generated
sequence alignment.
8
  • The basic algorithm is called progressive
    alignment developed in the late 1980s by Da-Fei
    Feng and Russell Doolittle.
  • The algorithm follows these steps
  • First a group of proteins is chosen to be
    aligned
  • Every protein in the group is globally aligned
    using Needleman-Wunsch with every other protein
    in the group to be aligned
  • A distance matrix is used to score the pairwise
    alignments. These scores are used to generate a
    guide tree to construct the alignment.
  • The guide tree which reflects the relatedness of
    the sequences to be aligned shows the order in
    which sequences should be added to the multiple
    alignment starting with the two most closely
    related sequences
  • One by one the next most closely related
    sequences are added to the alignment (for
    example, one naïve way is to create a consensus
    alignment for the first two, which may include
    gaps, then align the next sequence with this
    consensus alignment continue the process until
    all of the sequences are aligned.)

9
We will illustrate this process for the following
three sequences gtgi62901522spP01865GCAM_MOUS
E Ig gamma-2A chain C region, membrane-bound
form KTTAPSVYPLAPVCGDTTGSSVTLGCLVKGYFPEPVTLTWNSGSL
SSGVHTFPAVLQSDLYTLSSSVTVT SSTWPSQSITCNVAHPASSTKVDK
KIEPRGPTIKPCPPCKCPAPNLLGGPSVFIFPPKIKDVLMISLSPI VTC
VVVDVSEDDPDVQISWFVNNVEVHTAQTQTHREDYNSTLRVVSALPIQHQ
DWMSGKEFKCKVNNKDL PAPIERTISKPKGSVRAPQVYVLPPPEEEMTK
KQVTLTCMVTDFMPEDIYVEWTNNGKTELNYKNTEPVL DSDGSYFMYSK
LRVEKKNWVERNSYSCSVVHEGLHNHHTTKSFSRTPGLDLDDVCAEAQDG
ELDGLWTTI TIFISLFLLSVCYSASVTLFKVKWIFSSVVELKQTISPDY
RNMIGQGA gtgi121048spP01863GCAA_MOUSE Ig
gamma-2A chain C region, A allele AKTTAPSVYPLAPVCG
DTTGSSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSS
SVTV TSSTWPSQSITCNVAHPASSTKVDKKIEPRGPTIKPCPPCKCPAP
NLLGGPSVFIFPPKIKDVLMISLSP IVTCVVVDVSEDDPDVQISWFVNN
VEVHTAQTQTHREDYNSTLRVVSALPIQHQDWMSGKEFKCKVNNKD LPA
PIERTISKPKGSVRAPQVYVLPPPEEEMTKKQVTLTCMVTDFMPEDIYVE
WTNNGKTELNYKNTEPV LDSDGSYFMYSKLRVEKKNWVERNSYSCSVVH
EGLHNHHTTKSFSRTPGK gtgi113588spP01878IGHA_MOUS
E Ig alpha chain C region ESARNPTIYPLTLPPALSSDPVII
GCLIHDYFPSGTMNVTWGKSGKDITTVNFPPALASGGRYTMSNQLT LPA
VECPEGESVKCSVQHDSNPVQELDVNCSGPTPPPPITIPSCQPSLSLQRP
ALEDLLLGSDASITCTL NGLRNPEGAVFTWEPSTGKDAVQKKAVQNSCG
CYSVSSVLPGCAERWNSGASFKCTVTHPESGTLTGTIA KVTVNTFPPQV
HLLPPPSEELALNELLSLTCLVRAFNPKEVLVRWLHGNEELSPESYLVFE
PLKEPGEGA TTYLVTSVLRVSAETWKQGDQYSCMVGHEALPMNFTQKTI
DRLSGKPTNVSVSVIMSEGDGICY
10
Pairwise alignments SeqA Name
Len(aa)SeqB Name Len(aa) Score


1 gi62901522spP01865GCAM_MOU 398 2
gi121048spP01863GCAA_MOUSE 330 99 1
gi62901522spP01865GCAM_MOU 398 3
gi113588spP01878IGHA_MOUSE 344 24 2
gi121048spP01863GCAA_MOUSE 330 3
gi113588spP01878IGHA_MOUSE 344 25


Guide Tree ( gi62901522spP01865GCAM_MOU0.009
66, gi121048spP01863GCAA_MOUSE-0.00360,
gi113588spP01878IGHA_MOUSE0.74906)
11
The Alignment
12
  • Basic problem is assigning a score to the
    alignment.
  • Producing a meaningful score is a very inexact
    science at this point and may continue to be
    that.
  • A good scoring system should take into account
  • The fact that some positions are more conserved
    than others thus, we infer that it needs to be
    position specific scoring.
  • The fact that some sequences are not independent,
    but related by a phylogenetic tree. Sometimes
    there is a small set of sequences that can be
    aligned unambiguously and used as a starting
    point for the complete alignment.

13
We will tend to ignore the phylogenetic tree and
concentrate on the first criterion. Simplifying
assumption individual columns of alignment are
statistically independent. Note Gaps will be
aligned with gaps. Only restriction is that
every column must have at least one non-gap
character. Typical scoring function Let m be
some alignment S(m) G
sum( S( mi ) ) Where mi is column i in the
alignment S( mi ) is the score of that column,
and G is a function for scoring the comparison of
gaps.
14
The score for a column, S(mi), is usually
computed as a Sum of Pairs, i.e. each residue in
the column is paired with the residues of
previous column entries and evaluated using a
standard scoring scheme (BLOSUMn or PAMn n
your favorite number). Gaps next to a residue are
usually scored using the gap penalty and adjacent
gaps are usually scored as 0. The formula for
the score of a column

15
Problem with sum of pairs scoring Suppose you
are comparing N sequences and using BLOSUM50 as
your comparison matrix. Suppose all N rows in a
column have L (leucine). The BLOWSUM50 score for
comparing L to L is 5. for a total score for the
column of 5N(N 1)/2 (5 times the number of
symbol pairs in the column) Seq 1 ..
L. Seq 2 .. L. Seq 3
.. L. Seq 4 ..
L. Score (Seq 2 Seq 1) 5
(Seq 4 Seq 1) 5 (Seq 3 Seq 1) 5
(Seq 4 Seq 2) 5 Column Score 30
(Seq 3 Seq 2) 5 (Seq 4 Seq 3) 5
16
Now suppose one of the Ls is replaced by a G
(Glycine) The BLOSUM50 score for comparing L to G
is -4 instead of 5 so (N 1) pairs are reduced
by 9 (in our example 3 pairs are reduced by 9)
Seq 1 .. L. Seq 2
.. L. Seq 3 ..
G. Seq 4 ..
L. Score (Seq 2 Seq 1) 5
(Seq 4 Seq 1) 5 (Seq 3 Seq 1) -4
(Seq 4 Seq 2) 5 Column Score 3
(Seq 3 Seq 2) -4 (Seq 4 Seq 3)
-4 The fraction reduction in the SP score is

17
However, if we were comparing a similar
collection of 60 sequences, the score for a
column of all Ls is 1770 5 8850 If one of
the Ls were a G, then the score would be reduced
by a rescoring of 59 pairs or 59 9 531 The
fraction reduction in the SP score is only
In general if we
have N sequences to be aligned and suppose they
all contain an L in the same column except for
one G then the fraction reduction in the SP score
as a result of the G is
The problem is N in the denominator. As more
sequences are added (and say they all have Ls)
the effect of this G is reduced. It should be
amplified to show a disparity.
18
None the less SP is commonly used. But the
problem does not stop here. To find the best
scoring alignment multidimensional Dynamic
Programming similar to that we did earlier is
used This becomes intractable very fast.
Suppose only aligning 3 sequences.
Write a Comment
User Comments (0)
About PowerShow.com