COT 6930 HPC and Bioinformatics Multiple Sequence Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment

Description:

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering Outline Multiple Sequence Alignment What, Why, and ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 49
Provided by: Anne1206
Learn more at: https://www.cse.fau.edu
Category:

less

Transcript and Presenter's Notes

Title: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment


1
COT 6930HPC and BioinformaticsMultiple
Sequence Alignment
  • Xingquan Zhu
  • Dept. of Computer Science and Engineering

2
Outline
  • Multiple Sequence Alignment
  • What, Why, and How
  • Multiple Sequence Alignment Methods
  • Multidimensional dynamic programming
  • Star Alignment
  • Tree Alignment
  • Progressive Alignment
  • Clustalw a widely used algorithm
  • Iterative Alignment
  • Genetic Algorithm

3
What is a Multiple Sequence Alignment?
  • Pairwise alignments involve two sequences
  • Multiple sequence alignments involve more than 2
    sequences (often 100s, either nucleotide or
    protein).
  • A formal definition
  • A multiple alignment of strings S1, Sk is a
    series of strings with spaces such that S1
    Sk
  • Sj is an extension of Sj by insertion of spaces
  • Goal Find an optimal multiple alignment.

Hs ---MK----- --LSLVAAML LLLSAARAEE EDKK-EDVGT
VVGIDLGTTY Sp ---MKKFQLF SILSYFVALF LLPMAFASGD
DNST-ESYGT VIGIDLGTTY Tg MTAAKKLSLF SLAALFCLLS
VATLRPVAAS DAEEGKVKDV VIGIDLGTTY Pf --------MN
QIRPYILLLI VSLLKFISAV DSN---IEGP VIGIDLGTTY
4
Why we do multiple alignments?
  • In order to reveal the relationship between a
    group of sequences (homology)
  • Simultaneous alignment of similar gene sequences
    may
  • Discover the conserved regions in genes
  • Determine the consensus sequence of these aligned
    sequences
  • Help defines a protein family that may share a
    common biochemical function or evolutionary
    origin and thus reveals an evolutionary history
    of the sequences.
  • Help prediction of the secondary and tertiary
    structures of new sequences

5
MSA Methods
  • Multidimensional dynamic programming
  • Extension of DP to multiple (3) sequences
  • Star Alignment, Tree Alignment, Progressive
    Alignment
  • Starting with an alignment of the most alike
    sequences and building an alignment by adding
    more sequences
  • Iterative methods
  • Making an initial alignment of groups of
    sequences and revising the alignment to achieve a
    more reasonable result

6
Outline
  • Multiple Sequence Alignment
  • What, Why, and How
  • Multiple Sequence Alignment Methods
  • Multidimensional dynamic programming
  • Star Alignment
  • Tree Alignment
  • Progressive Alignment
  • Clustalw a widely used algorithm
  • Iterative Alignment
  • Genetic Algorithm

7
Multiple Sequence Alignment by DP
  • Pairwise sequence alignment
  • a scoring matrix where each position provides the
    best alignment up to that point
  • Extension to 3 sequences
  • the lattice of a cube that is to be filled with
    calculated dynamic programming scores.
  • Scoring positions
  • on 3 surfaces of the cube represent the alignment
    of a pair

8
Scoring of MSA Sum of Pairs
  • Scores summation of all possible combinations
    of amino acid pairs
  • Using BLOSUM62 matrix, gap penalty -8
  • In column 1, we have pairs
  • -,S
  • -,S
  • S,S
  • k(k-1)/2 pairs per column

- I K
S I K
S S E
-8 - 8 4 -12
9
Sum of Pairs
  • Given 5 sequences
  • N C C E
  • N N C E
  • N - C N
  • S C S N
  • S C S E
  • How many possible combinations of pairwise
    alignments for each position?

10
Sum of Pairs
  • Assume match/mismatch/gap 1/0/-1
  • N C C E
  • N N C E
  • N - C N
  • S C S N
  • S C S E
  • The 1st position of N-N (3), of S-S (1),
    of N-S (6)
  • SP(1) 41 06 (-1)0 4
  • The 2nd position of C-C (3), of N-C (3),
    of gaps (4),
  • SP(2) 31 03 (-1)4 -1

11
Dynamic programming matrix
Pairwise alignment
Seq 2
G T G C T T G A
T G G C C T







Gap in sequence 1
Match/Mismatch
Seq 1
Gap in sequence 2
12
Dynamic programming matrix
Multiple sequence alignment
Seq 1
S M V
V
M
A
Seq 3

many possibilities
S M T
Seq 2
13
DP Alignment Examples
  • All three match/mismatch
  • Sequence 1 2 match/mismatch with gap in 3
  • Sequence 1 3 match/mismatch with gap in 2
  • Sequence 2 3 match/mismatch with gap in 1
  • Sequence 1 with gaps in 2 3
  • Sequence 2 with gaps in 1 3
  • Sequence 3 with gaps in 1 2
  • Choose the largest value among the above seven
    possibilities

14
Computational Complexity
  • For protein sequences each 300 amino acid in
    length excluding gaps, with DP algorithm
  • Two sequences, 3002 comparisons
  • Three sequences, 3003 comparisons
  • N sequences, 300N comparisons
  • O(LN) L length of the sequences N number of
    sequences
  • The number of comparisons memory required are
    too large for n gt 3 and not practical

15
Outline
  • Multiple Sequence Alignment
  • What, Why, and How
  • Multiple Sequence Alignment Methods
  • Multidimensional dynamic programming
  • Star Alignment
  • Tree Alignment
  • Progressive Alignment
  • Clustalw a widely used algorithm
  • Iterative Alignment
  • Genetic Algorithm

16
Star Alignments
  • Heuristic method for multiple sequence alignments
  • Select a sequence sc as the center of the star
  • For each sequence s1, , sk such that index i ?
    c, perform a global alignment (using DP)
  • Aggregate alignments with the principle once a
    gap, always a gap.

17
Star Alignments Example
MPE MKE
MSKE - MKE
s1 MPE s2 MKE s3 MSKE s4 SKE
s3
s1
s2
MKE SKE
-MPE -MKE MSKE -SKE
-MPE -MKE MSKE
MPE MKE
s4
18
Choosing a center
  • Try them all and pick the one with the best score
  • Calculate all O(k2) alignments, and pick the
    sequence sc that maximizes

19
Star Alignment Example
  • S1ATTGCCATT
  • S2ATGGCCATT
  • S3ATCCAATTTT
  • S4ATCTTCTT
  • S5ATTGCCGATT

s1 s2 s3 s4 s5
s1 7 -2 0 -3
s2 7 -2 0 -4
s3 -2 -2 0 -7
s4 0 0 0 -3
s5 -3 -4 -7 -3
2
1
-11
-3
-17
20
Star Alignments Example
Merging Pairwise Alignment
21
Star Alignment Example
Merging Pairwise Alignment
22
Analysis
  • Assuming all sequences have length n
  • O(n2) to calculate global alignment
  • O(k2) global alignments to calculate
  • Using a reasonable data structure for joining
    alignments, no worse than O(kl), where l is upper
    bound on alignment lengths
  • O(k2n2kl)O(k2n2) overall cost

23
Outline
  • Multiple Sequence Alignment
  • What, Why, and How
  • Multiple Sequence Alignment Methods
  • Multidimensional dynamic programming
  • Star Alignment
  • Tree Alignment
  • Progressive Alignment
  • Clustalw a widely used algorithm
  • Iterative Alignment
  • Genetic Algorithm

24
Tree Alignment
  • Compute the overall similarity based on pairwise
  • alignment along the edge
  • The sum of all these weights is the score of the
    tree

Consensus String
sequence
sequence S1
sequence S2
sequence
The consensus string derived from multiple
alignment is the concatenation of the consensus
characters for each column. The consensus
character for column is the character that
minimizes the summed distance to it from all the
characters in column
25
Tree Alignment Example
  • Scoring system used is

CAT - GT
CTG C - G
CAT
CTG
3
3
CTG
CAT
1
0
1
CG
GT
We have a score of 8
26
Tree Alignment Example
27
Example
28
Example
29
Example
30
Example
31
Example
32
Example
33
Example
34
Analysis
  • We dont know the correct tree
  • Without the tree, the tree alignment problem is
    NP-complete
  • Likely only exponential time solution available
    (for optimal answers)

35
Outline
  • Multiple Sequence Alignment
  • What, Why, and How
  • Multiple Sequence Alignment Methods
  • Multidimensional dynamic programming
  • Star Alignment
  • Tree Alignment
  • Progressive Alignment
  • Clustalw a widely used algorithm
  • Iterative Alignment
  • Genetic Algorithm

36
Progressive Methods
  • DP-based MSA program is limited in 3 sequences or
    to a small of relatively short sequences
  • Progressive alignments uses DP to build a msa
    starting with the most related sequences and then
    progressively adding less-related sequences or
    groups of sequences to the initial alignment
  • Most commonly used approach

37
Progressive Methods
  • Progressive alignment is heuristic.
  • It does not separate the process of scoring an
    alignment from the optimization algorithm
  • It does not directly optimize any global scoring
    scoring function of alignment correctness.
  • It is fast, efficient and the results are
    reasonable.
  • We will illustrate this using ClustalW.

38
Progressive MSA occurs in 3 stages
  1. Do a set of global pairwise alignments (Needleman
    and Wunsch)
  2. Create a guide tree
  3. Progressively align the sequences

39
ClustalW Procedure
40
Progressive Methods ClustalW
  • http//www.ebi.ac.uk/clustalw/
  • ClustalW is a general purpose multiple alignment
    program for DNA or proteins.
  • ClustalW The W standing for weighting to
    represent the ability of the program to provide
    weights to the sequence and program parameters.
  • CLUSTALX provides a graphic interface

41
Use Clustal W to do a progressive MSA
42
Progressive MSA stage 3 of 3 progressive
alignment
  • Make a MSA based on the order in the guide tree
  • Start with the two most closely related sequences
  • Then add the next closest sequence
  • Continue until all sequences are added to the MSA

43
Problems w/ Progressive Alignment
  • Highly sensitive to the choice of initial pair to
    align.
  • The very first sequences to be aligned are the
    most closely related on the sequence tree. If
    alignment good, few errors in the initial
    alignment
  • The more distantly related these sequences, the
    more errors
  • Errors in alignment propagated to the MSA

44
Outline
  • Multiple Sequence Alignment
  • What, Why, and How
  • Multiple Sequence Alignment Methods
  • Multidimensional dynamic programming
  • Star Alignment
  • Tree Alignment
  • Progressive Alignment
  • Clustalw a widely used algorithm
  • Iterative Alignment
  • Genetic Algorithm

45
Iterative Methods
  • Results do NOT depend on the initial pairwise
    alignment (recall progressive methods)
  • Starting with an initial alignment and repeatedly
    realigning groups of the sequences
  • Repeat until one MSA doesnt change significantly
    from the next.
  • After iterations, alignments are better and
    better.
  • An example is genetic algorithm approach.

46
Genetic Algorithms
  • A general problem solving method modeled on
    evolutionary change.
  • Inspired by the biological evolution process
  • Uses concepts of Natural Selection and Genetic
    Inheritance (Darwin 1859)
  • Create a set of candidate solutions to your
    problem, and cause these solutions to evolve and
    become more and more fit over repeated
    generations.
  • Use survival of the fittest, mutation, and
    crossover to guide evolution.

47
Genetic Search Algorithms
Random generation (candidate solutions)
Evaluation (fitness function)
Crossover Mutation (change some selected
candidate solutions to converge to the optimal
solution and to prevent a local extreme
Selection (candidate solutions with larger
fitness values will have larger chance to be
included)
48
Outline
  • Multiple Sequence Alignment
  • What, Why, and How
  • Multiple Sequence Alignment Methods
  • Multidimensional dynamic programming
  • Star Alignment
  • Tree Alignment
  • Progressive Alignment
  • Clustalw a widely used algorithm
  • Iterative Alignment
  • Genetic Algorithm
Write a Comment
User Comments (0)
About PowerShow.com