# Multiple%20Sequence%20Alignment - PowerPoint PPT Presentation

Title:

## Multiple%20Sequence%20Alignment

Description:

### Multiple Sequence Alignment – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 49
Provided by: resi80
Category:
Tags:
Transcript and Presenter's Notes

Title: Multiple%20Sequence%20Alignment

1
Multiple Sequence Alignment
2
Definition
• Homology related by descent
• Homologous sequence positions

? ATTGCGC
ATTGCGC
ATTGCGC
?
AT-ACGC
ATTGCGC
? ATACGC
A
3
Reasons for aligning sets of sequences
• Organise data to reflect sequence homology
• Estimate evolutionary distance
• Infer phylogenetic trees from homologous sites
• Highlight conserved sites/regions
• Highlight variable sites/regions
• Uncover changes in gene structure
• Look for evidence of selection
• Summarise information

4
Alignments help to
Organise
Visualise
Analyze
Sequence Data
5
• The process of aligning sequences is a game
involving playing off gaps and mismatches

6
Ways of aligning multiple sequences
• By hand
• Automated
• Combination

7
Definition
• Optimality criteria some kind rule or scoring
be the best alignment

8
Pairwise vs Multiple Sequences
• Pairs of sequences typically aligned using
exhaustive algorithms (dynamic programming)
• complexity of exhaustive methods is O(2n mn) n
number of sequences m sequence length
• Multiple sequence alignment usually performed
using heuristic methods

9
The Correct Alignment
? ATTGCGC
ATTGCGC
ATTGCGC
?
AT-ACGC
ATTGCGC
? ATACGC
A
10
The Correct Alignment
Correct according to optimality criteria Correct according to homology
Exhaustive methods Always Not always
Heuristic methods Not always Not always
11
• Sequence alignment is easy with sufficiently
closely related sequences
• Below a certain level of identity sequence
alignment may become meaningless
• twilight zone for aa sequences 30
• In the twilight zone it is good to make use of
structure)

12
Consensus Sequences
• Simplest FormA single sequence which represents
the most common amino acid/base in that position
• Y D D G A V - E A L
• Y D G G - - - E A L
• F E G G I L V E A L
• F D - G I L V Q A V
• Y E G G A V V Q A L
• Y D G G A/I V/L V E A L

13
Multiple Alignment Formats
• e.g. Clustal, Phylip, MSF, MEGA etc. etc.

14
Clustal Format
• CLUSTAL X (1.81) multiple sequence alignment
• CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ------
--EVLNEN-
• CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP------
--EVLNEN-
• CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSR
E--------
• CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESS
E--------
• CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESS
EQEILKERK
• CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ----
--QQHSSSE
• CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ----
---------
• .. .

15
Phylip Format (Interleaved)
• 7 100
• SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA
MSLSGLFANA VLRAQHLHQL
• SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA
MSLSGLFANA VLRAQHLHQL
MPLSSLFANA VLRAQHLHQL
• SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA
MPLSSLFSNA VLRAQHLHQL
• SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA
MPLSSLFANA VLRAQHLHQL
• SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA
MPLSSLFANA VLRAQHLHQL
• SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT
IPLSRLFDNA MLRAHRLHQL
FSETIPAPTG KNEAQQKSDL
FSETIPAPTG KNEAQQKSDL
FSETIPAPTG KEEAQQRTDM
FSETIPAPTG KEEAQQRTDM
FSETIPAPTG KDEAQQRSDM
FSETIPAPTG KDEAQQRSDV
• AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC
FSESIPTPSN REETQQKSNL

16
Phylip Format (Sequential)
• 3 100
• Rat
• ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG
• TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA
• Mouse
• ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT
• TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG
• Rabbit
• ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC
• TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG

17
Mega Format
• mega
• TITLE No title
• Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
• Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
• Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
• Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
• Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
• Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
• Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

18
Progressive Multiple Alignment
• Heuristic
• Perform pairwise alignments
• Align sequences to alignments or alignments to
existing alignments (profile alignments
• Do the alignments in some sensible order

19
Progressive versus Simultaneous
• speed versus accuracy
• simultaneous methods are capable of working out
an exact solution to the problem of multiple
sequence alignment (e.g. NCBIs MSA user
interface QAlign)

20
Iterative methods
• Several progressive alignment methods can be
iterated
• e.g. Barton-Sternberg, ClustalX

21
ClustalX Algorithm
• Perform pairwise alignments and calculate
distances for all pairs of sequences
• Construct guide tree (dendrogram) joining the
most similar sequences using Neighbour Joining
• Align sequences, starting at the leaves of the
guide tree. This involves the pair-wise
comparisons as well as comparison of single
sequence with a group of seqs (Profile)

22
• ClustalX is not optimal
• There are known areas in which ClustalX performs
• errors introduced early cannot be corrected by
subsequent information
• alignments of sequences of differing lengths
cause strange guide trees and unpredictable
effects
• edges ClustalX does not penalise gaps at edges
• There are alternatives to ClustalX available

23
T-Coffee
• JMB 2000
• Also a progressive alignment method
• Designed to solve some of the problems with
clustal (in particular the problem of clustals
inability to correct errors that appear early in
the process of alignment)
• Can consider global and local pair-wise
alignments

24
Using ClustalX
existing alignment in Clustal format
• Do Alignment on the alignment menu

25
(No Transcript)
26
ClustalX Parameters
• Scoring Matrix
• Gap opening penalty
• Gap extension penalty
• Protein gap parameters
• Secondary structure penalties

27
Score Matrices
• Pairwise matrices and multiple alignment matrix
series
• PAM (Dayhoff), BLOSUM (Hennikof), GONNET
(default), user defined
• Transition (Alt-gtG)/Transversion (Clt-T) ratio
low for distantly related sequences

28
Gap Penalties
• Linear gap penalties Affine gap penalties
• p (o l.e)
• Gap opening
• Gap extension
• Protein specific penalties (on by default)
• Increase the probability of gaps associated with
certain residues
• Increase the chances of gaps in loop regions (gt 5
hydrophilic residues)

29
Algorithm parameters
• Slow-accurate pair-wise alignment
• Do alignment from guide tree
• Reset gaps before aligning (iteration)
• Delay Divergent sequences ()

30
• Column Scores
• Low quality regions
• Exceptional residues

31
Multiple Alignment Tips
• Align pairs of sequences using an optimal method
• Progressive alignment programs such as ClustalX
for multiple alignment
• Choose representative sequences to align
carefully
• Choose sequences of comparable lengths
• Progressive alignment programs may be combined
• Review alignment by eye and edit
• If you have a choice align amino acid sequences
rather than nucleotides

32
Alignment of coding regions
• Nucleotide sequences much harder to align
accurately than proteins
• Protein coding sequences can be aligned using the
protein sequences
• e.g. BioEdit toggle translation to amino acid,
call clustalw to align, edit alignment by hand,
toggle back to nucleotide
• In-frame nucleotide alignments can be used, e.g.
to determine non-synonymous and synonymous
distances separately

33
Multiple Alignments and Phylogenetic Trees
• You can make a more accurate multiple sequence
alignment if you know the tree already
• A phylogenetic tree is only as good as the
alignment from which it was produced
• The process of constructing a multiple alignment
(unlike pair-wise) needs to take account of
phylogenetic relationships

34
Editing a multiple sequence alignment
• It is NOT fraud to edit a multiple sequence
alignment
• Incorporate additional knowledge if possible
• Alignment editors help to keep the data organised
and help to prevent unwanted mistakes

35
Alignment Editors
• e.g. GDE, Bioedit, Seaview, Jalview etc.
• Some alignment editors have begun to function as
sequence analysis platforms (e.g. tools on
BioEdit, GDE)
• Construct sub-sequences (GDE, Seaview)
• Annotate sequences (Seaview)

36
Aligning weakly similar sequences
37
Sequence contains conserved regions
• e.g. DIALIGN (Morgenstern, Dress, Werner)
• re-aligns regions between conserved blocks
• http//bibiserv.techfak.uni-bielefeld.de/
• useful if sequences contains consistent conserved
blocks
• Block Maker searches for conserved words that
may be inconsistent http//blocks.fhcrc.org/

38
Profile Alignment
• Gribskov et al. 1987
• Position specific scores
• Allows addition of extra sequence(s) to an
alignment
• Allows alignment of alignments
• Gaps introduced as whole columns in the separate
alignments
• Optimal alignment in time O(a2l2)
• a alphabet size, l sequence length
• Information about the degree of conservation of
sequence positions is included

39
Good reasons to use profile alignments
• Adding a new sequence to an existing multiple
alignment that you want to keep fixed(align
sequence to profile)
• Searching a database for new members of your
protein family(pfsearch)
• Searching a database of profiles to find out
which one your sequence belongs to(pfscan)
• Combining two multiple sequence
alignments(profile to profile)

40
Profile Alignment Using ClustalX
• Profile Alignment Mode
• Align sequence to profile
• Align profile 1 to profile 2
• Secondary structure parameters

41
(No Transcript)
42
Profile searching using PSI-BLAST
• Position Specific Iterative
• Perform search construct profile perform
search
• Convergence (hopefully)
• Increased sensitivity for distantly related
sequences
• Available on-line (NCBI)

43
Databases of Aligned Sequences
• Hovergen http//pbil.univ-lyon1.fr/databases/hover
gen.html (vertebrate alignments)
• Pfam http//www.sanger.ac.uk/Software/Pfam/
(protein domain alignments and profile HMMs)
• BLOCKS http//blocks.fhcrc.org/
• Ribosomal Database Project http//rdp.cme.msu.edu/
html/ alignments and trees derived from rRNA
sequences
• Interpro combines information from other
sources
• Many more

44
Probabilistic Models of Sequence Alignment
• Hidden Markov Models
• sequence of states and associated symbol
probabilities
• Produces a probabilistic model of a sequence
alignment
• Align a sequence to a Profile Hidden Markov Model
• Algorithms exist to find the most efficient
pathway through the model

45
• Markov Chain A chain of things. The
probability of the next thing depends only on the
current thing
• Hidden Markov Model A sequence of states which
form a Markov Chain. The states are not
observable. The observable characters have
emission probabilities which depend on the
current state.

46
Some more recent developments
• The need to align genomes
• alignment tools required that can align very
large regions of genomes
• poses a computational challenge
• programmes such as dialign can be run in parallel
on multiprocessor machines

47
Some more recent developments
• MUSCLE
• Faster (uses a k-mer frequency to calculate first
pair-wise alignments)
• Progressive (repeats the MSA using the more
accurate kimura distance between aligned amino
acid sequences)
• Has a third optimisation stage that involves
making profile alignments of sub-trees and
accepting the new alignment if it improves the SP
score.

48
• MuSiC - multiple sequence alignment with
constraints
• web server that allows a user to enter a set of