Multiple%20Sequence%20Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Multiple%20Sequence%20Alignment

Description:

Multiple Sequence Alignment – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 49
Provided by: resi80
Category:

less

Transcript and Presenter's Notes

Title: Multiple%20Sequence%20Alignment


1
Multiple Sequence Alignment
2
Definition
  • Homology related by descent
  • Homologous sequence positions

? ATTGCGC
ATTGCGC
ATTGCGC
?
AT-ACGC
ATTGCGC
? ATACGC
A
3
Reasons for aligning sets of sequences
  • Organise data to reflect sequence homology
  • Estimate evolutionary distance
  • Infer phylogenetic trees from homologous sites
  • Highlight conserved sites/regions
  • Highlight variable sites/regions
  • Uncover changes in gene structure
  • Look for evidence of selection
  • Summarise information

4
Alignments help to
Organise
Visualise
Analyze
Sequence Data
5
  • The process of aligning sequences is a game
    involving playing off gaps and mismatches

6
Ways of aligning multiple sequences
  • By hand
  • Automated
  • Combination

7
Definition
  • Optimality criteria some kind rule or scoring
    scheme to help you to decide what you consider to
    be the best alignment

8
Pairwise vs Multiple Sequences
  • Pairs of sequences typically aligned using
    exhaustive algorithms (dynamic programming)
  • complexity of exhaustive methods is O(2n mn) n
    number of sequences m sequence length
  • Multiple sequence alignment usually performed
    using heuristic methods

9
The Correct Alignment
? ATTGCGC
ATTGCGC
ATTGCGC
?
AT-ACGC
ATTGCGC
? ATACGC
A
10
The Correct Alignment
Correct according to optimality criteria Correct according to homology
Exhaustive methods Always Not always
Heuristic methods Not always Not always
11
  • Sequence alignment is easy with sufficiently
    closely related sequences
  • Below a certain level of identity sequence
    alignment may become meaningless
  • twilight zone for aa sequences 30
  • In the twilight zone it is good to make use of
    additional information if possible (e.g.
    structure)

12
Consensus Sequences
  • Simplest FormA single sequence which represents
    the most common amino acid/base in that position
  • Y D D G A V - E A L
  • Y D G G - - - E A L
  • F E G G I L V E A L
  • F D - G I L V Q A V
  • Y E G G A V V Q A L
  • Y D G G A/I V/L V E A L

13
Multiple Alignment Formats
  • e.g. Clustal, Phylip, MSF, MEGA etc. etc.

14
Clustal Format
  • CLUSTAL X (1.81) multiple sequence alignment
  • CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ------
    --EVLNEN-
  • CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP------
    --EVLNEN-
  • CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSR
    E--------
  • CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESS
    E--------
  • CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESS
    EQEILKERK
  • CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ----
    --QQHSSSE
  • CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ----
    ---------
  • .. .

15
Phylip Format (Interleaved)
  • 7 100
  • SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA
    MSLSGLFANA VLRAQHLHQL
  • SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA
    MSLSGLFANA VLRAQHLHQL
  • SOMA_RAT_P -MAADSQTPW LLTFSLLCLL WPQEAGAFPA
    MPLSSLFANA VLRAQHLHQL
  • SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA
    MPLSSLFSNA VLRAQHLHQL
  • SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA
    MPLSSLFANA VLRAQHLHQL
  • SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA
    MPLSSLFANA VLRAQHLHQL
  • SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT
    IPLSRLFDNA MLRAHRLHQL
  • AADTFKEFER TYIPEGQRYS -IQNTQVAFC
    FSETIPAPTG KNEAQQKSDL
  • AADTFKEFER TYIPEGQRYS -IQNTQVAFC
    FSETIPAPTG KNEAQQKSDL
  • AADTYKEFER AYIPEGQRYS -IQNAQAAFC
    FSETIPAPTG KEEAQQRTDM
  • AADTYKEFER AYIPEGQRYS -IQNAQAAFC
    FSETIPAPTG KEEAQQRTDM
  • AADTYKEFER AYIPEGQRYS -IQNAQAAFC
    FSETIPAPTG KDEAQQRSDM
  • AADTYKEFER AYIPEGQRYS -IQNAQAAFC
    FSETIPAPTG KDEAQQRSDV
  • AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC
    FSESIPTPSN REETQQKSNL

16
Phylip Format (Sequential)
  • 3 100
  • Rat
  • ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG
  • TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA
  • Mouse
  • ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT
  • TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG
  • Rabbit
  • ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC
  • TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG

17
Mega Format
  • mega
  • TITLE No title
  • Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
  • Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
  • Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
  • Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
  • Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
  • Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
  • Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

18
Progressive Multiple Alignment
  • Heuristic
  • Perform pairwise alignments
  • Align sequences to alignments or alignments to
    existing alignments (profile alignments
  • Do the alignments in some sensible order

19
Progressive versus Simultaneous
  • speed versus accuracy
  • simultaneous methods are capable of working out
    an exact solution to the problem of multiple
    sequence alignment (e.g. NCBIs MSA user
    interface QAlign)

20
Iterative methods
  • Several progressive alignment methods can be
    iterated
  • e.g. Barton-Sternberg, ClustalX

21
ClustalX Algorithm
  • Perform pairwise alignments and calculate
    distances for all pairs of sequences
  • Construct guide tree (dendrogram) joining the
    most similar sequences using Neighbour Joining
  • Align sequences, starting at the leaves of the
    guide tree. This involves the pair-wise
    comparisons as well as comparison of single
    sequence with a group of seqs (Profile)

22
  • ClustalX is not optimal
  • There are known areas in which ClustalX performs
    badly e.g.
  • errors introduced early cannot be corrected by
    subsequent information
  • alignments of sequences of differing lengths
    cause strange guide trees and unpredictable
    effects
  • edges ClustalX does not penalise gaps at edges
  • There are alternatives to ClustalX available

23
T-Coffee
  • JMB 2000
  • Also a progressive alignment method
  • Designed to solve some of the problems with
    clustal (in particular the problem of clustals
    inability to correct errors that appear early in
    the process of alignment)
  • Can consider global and local pair-wise
    alignments

24
Using ClustalX
  • Start with sequences in FASTA format (or an
    existing alignment in Clustal format
  • Do Alignment on the alignment menu

25
(No Transcript)
26
ClustalX Parameters
  • Scoring Matrix
  • Gap opening penalty
  • Gap extension penalty
  • Protein gap parameters
  • Additional algorithm parameters
  • Secondary structure penalties

27
Score Matrices
  • Pairwise matrices and multiple alignment matrix
    series
  • PAM (Dayhoff), BLOSUM (Hennikof), GONNET
    (default), user defined
  • Transition (Alt-gtG)/Transversion (Clt-T) ratio
    low for distantly related sequences

28
Gap Penalties
  • Linear gap penalties Affine gap penalties
  • p (o l.e)
  • Gap opening
  • Gap extension
  • Protein specific penalties (on by default)
  • Increase the probability of gaps associated with
    certain residues
  • Increase the chances of gaps in loop regions (gt 5
    hydrophilic residues)

29
Algorithm parameters
  • Slow-accurate pair-wise alignment
  • Do alignment from guide tree
  • Reset gaps before aligning (iteration)
  • Delay Divergent sequences ()

30
Additional displays
  • Column Scores
  • Low quality regions
  • Exceptional residues

31
Multiple Alignment Tips
  • Align pairs of sequences using an optimal method
  • Progressive alignment programs such as ClustalX
    for multiple alignment
  • Choose representative sequences to align
    carefully
  • Choose sequences of comparable lengths
  • Progressive alignment programs may be combined
  • Review alignment by eye and edit
  • If you have a choice align amino acid sequences
    rather than nucleotides

32
Alignment of coding regions
  • Nucleotide sequences much harder to align
    accurately than proteins
  • Protein coding sequences can be aligned using the
    protein sequences
  • e.g. BioEdit toggle translation to amino acid,
    call clustalw to align, edit alignment by hand,
    toggle back to nucleotide
  • In-frame nucleotide alignments can be used, e.g.
    to determine non-synonymous and synonymous
    distances separately

33
Multiple Alignments and Phylogenetic Trees
  • You can make a more accurate multiple sequence
    alignment if you know the tree already
  • A phylogenetic tree is only as good as the
    alignment from which it was produced
  • The process of constructing a multiple alignment
    (unlike pair-wise) needs to take account of
    phylogenetic relationships

34
Editing a multiple sequence alignment
  • It is NOT fraud to edit a multiple sequence
    alignment
  • Incorporate additional knowledge if possible
  • Alignment editors help to keep the data organised
    and help to prevent unwanted mistakes

35
Alignment Editors
  • e.g. GDE, Bioedit, Seaview, Jalview etc.
  • Some alignment editors have begun to function as
    sequence analysis platforms (e.g. tools on
    BioEdit, GDE)
  • Construct sub-sequences (GDE, Seaview)
  • Annotate sequences (Seaview)

36
Aligning weakly similar sequences
37
Sequence contains conserved regions
  • e.g. DIALIGN (Morgenstern, Dress, Werner)
  • re-aligns regions between conserved blocks
  • http//bibiserv.techfak.uni-bielefeld.de/
  • useful if sequences contains consistent conserved
    blocks
  • Block Maker searches for conserved words that
    may be inconsistent http//blocks.fhcrc.org/

38
Profile Alignment
  • Gribskov et al. 1987
  • Position specific scores
  • Allows addition of extra sequence(s) to an
    alignment
  • Allows alignment of alignments
  • Gaps introduced as whole columns in the separate
    alignments
  • Optimal alignment in time O(a2l2)
  • a alphabet size, l sequence length
  • Information about the degree of conservation of
    sequence positions is included

39
Good reasons to use profile alignments
  • Adding a new sequence to an existing multiple
    alignment that you want to keep fixed(align
    sequence to profile)
  • Searching a database for new members of your
    protein family(pfsearch)
  • Searching a database of profiles to find out
    which one your sequence belongs to(pfscan)
  • Combining two multiple sequence
    alignments(profile to profile)

40
Profile Alignment Using ClustalX
  • Profile Alignment Mode
  • Align sequence to profile
  • Align profile 1 to profile 2
  • Secondary structure parameters

41
(No Transcript)
42
Profile searching using PSI-BLAST
  • Position Specific Iterative
  • Perform search construct profile perform
    search
  • Convergence (hopefully)
  • Increased sensitivity for distantly related
    sequences
  • Available on-line (NCBI)

43
Databases of Aligned Sequences
  • Hovergen http//pbil.univ-lyon1.fr/databases/hover
    gen.html (vertebrate alignments)
  • Pfam http//www.sanger.ac.uk/Software/Pfam/
    (protein domain alignments and profile HMMs)
  • BLOCKS http//blocks.fhcrc.org/
  • Ribosomal Database Project http//rdp.cme.msu.edu/
    html/ alignments and trees derived from rRNA
    sequences
  • Interpro combines information from other
    sources
  • Many more

44
Probabilistic Models of Sequence Alignment
  • Hidden Markov Models
  • sequence of states and associated symbol
    probabilities
  • Produces a probabilistic model of a sequence
    alignment
  • Align a sequence to a Profile Hidden Markov Model
  • Algorithms exist to find the most efficient
    pathway through the model

45
  • Markov Chain A chain of things. The
    probability of the next thing depends only on the
    current thing
  • Hidden Markov Model A sequence of states which
    form a Markov Chain. The states are not
    observable. The observable characters have
    emission probabilities which depend on the
    current state.

46
Some more recent developments
  • The need to align genomes
  • alignment tools required that can align very
    large regions of genomes
  • poses a computational challenge
  • programmes such as dialign can be run in parallel
    on multiprocessor machines

47
Some more recent developments
  • MUSCLE
  • Faster (uses a k-mer frequency to calculate first
    pair-wise alignments)
  • Progressive (repeats the MSA using the more
    accurate kimura distance between aligned amino
    acid sequences)
  • Has a third optimisation stage that involves
    making profile alignments of sub-trees and
    accepting the new alignment if it improves the SP
    score.

48
  • MuSiC - multiple sequence alignment with
    constraints
  • web server that allows a user to enter a set of
Write a Comment
User Comments (0)
About PowerShow.com