MSA- multiple sequence alignment - PowerPoint PPT Presentation

About This Presentation
Title:

MSA- multiple sequence alignment

Description:

How many cars changed spaces during this 2 hour interval? Parking lot 'A' at 2:00 ... are closer to mice or to cattle because different results have been obtained ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 99
Provided by: Jan5155
Learn more at: http://physics.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: MSA- multiple sequence alignment


1
MSA- multiple sequence alignment
  • Aligning many sequences is often preferable to
    pairwise comparisons.
  • Problem- Computational complexity of multiple
    alignments grows rapidly with the number of
    sequences being aligned.

2
Even using supercomputers or networks of
workstations, multiple sequence alignment is an
intractable problem for more than 20 or so
sequences of average length and complexity.
3
As a result, alignment methods using heuristics
have been developed. These methods, (including
ClustalW) cannot guarantee an optimal alignment,
but can find near-optimal alignments for larger
number of sequences.
4
CLUSTALW
  • Developed in 1988
  • Begins by aligning closely related sequences and
    then adds increasingly divergent sequences to
    produce a complete msa.

5
  • http//www.ncbi.nlm.nih.gov/
  • http//www.ebi.ac.uk/clustalw/

6
Introduction to Molecular Phylogeny
  • Phylogeny- the evolutionary history of a group

7
Mutations Happen!
  • 3 types possible
  • Deleterious
  • Advantageous
  • ???

8
Important Point
  • Much of variation that is observed among
    individuals must have little beneficial or
    detrimental effect and be essentially selectively
    neutral.
  • Deleterious mutations are screened out.
    Advantageous mutations are rare.

9
Functional Constraints?
  • Portions of genes that especially important are
    said to be under functional constraint and tend
    to accumulate changes very slowly.
  • Ex. histone proteins- practically every amino
    acid is important. A yeast histone can replace a
    human histone.

10
Relative Rate of Change within ?-globin gene (4
mammals)
11
Basis of Molecular Phylogenetics
  • The evolution of species can be modeled as a
    bifurcating process- speciation is initiated
    when two populations become reproductively
    isolated.

12
Basis of Molecular Phylogenetics
  • Once these two populations cease to interbreed,
    it is inevitable that they diverge due to random
    mutational processes.

13
Basis of Molecular Phylogenetics
  • Over time, this branching process may repeat
    itself.
  • A species is said to be related to some other
    species with which it shares a direct common
    ancestor.

14
(No Transcript)
15
Basis of Molecular Phylogenetics
  • The amount of DNA sequence difference between a
    pair of organisms should indicate how recently
    those two organisms shared a common ancestor.

16
(No Transcript)
17
Basis of Molecular Phylogenetics
  • The longer two populations remain reproductively
    isolated, the more DNA divergence will occur.
  • The longer two populations remain reproductively
    isolated, the more protein divergence will occur.

18
Molecular Phylogeny is relatively new.
  • Evolution by Natural Selection- Darwin/Wallace
    1858
  • Molecular Phylogeny 1960s ??

19
How it started . . ..
  • In 1959, scientists determined the
    three-dimensional structures of two proteins that
    are found in almost every animal hemoglobin and
    myoglobin.
  • During the next two decades, myoglobin and
    hemoglobin sequences were determined for dozens
    of mammals, birds, reptiles, amphibians, fish,
    etc.

20
What they found . . .
  • This tree agreed completely with observations
    derived from paleontology and anatomy about the
    common descent of the corresponding organisms.
  • from Science and Creationism A View from the
    National Academy of Sciences, 2nd Ed., 1999.

21
Organisms with high degrees of molecular
similarity are expected to be more closely
related than those that are dissimilar.
22
Advantages of Molecular Phylogeny
  • Can be used to decipher relationships between all
    living things
  • Relying on anatomy can be misleading- Similar
    traits can evolve in organisms that are not
    closely related (i.e. convergent evolution lead
    to eyes in vertebrates, insects, and molluscs).

23
Word of Caution
  • Phylogenetic analysis is controversial. There
    are a wide variety of different methods for
    analyzing the data, and even the experts often
    disagree on the best method for analyzing the
    data.

24
Why so controversial??
  • 2 Reasons

25
1 - Molecular vs. Classical
  • How much weight is given to molecular
    phylogenetic data, when it contrasts the findings
    of the traditional taxonomist??

26
. . .
  • The phylogeny of whales

27
  • The phylogeny of whales

28
How many cars changed spaces during this 2 hour
interval?
  • Parking lot A at 200 ?
  • Parking lot A at 400 ?

29
2- Molecular Phylogeny requires statistical
estimations.
  • Parking lot A at 200 ?
  • Parking lot A at 400 ?

30
Phylogenetic Data Analysis requires 4 steps
  • 1) Alignment
  • 2) Determine the substitution model
  • 3) Tree Building
  • 4) Tree Evaluation

31
STEP 1- Alignment
  • Molecular phylogenetic analysis is dependent on a
    good alignment. An evolutionary tree based on an
    improper alignment is an erroneous tree.

32
(No Transcript)
33
Homology
  • It is critical to phylogenetic analysis that
    homologous characters be compared across species.
  • Websters New Collegiate- Fundamental
    similarity of structure due to descent from a
    common ancestral form.

34
(No Transcript)
35
(No Transcript)
36
Compare homologous genes and homologous
characters
  • For DNA and proteins, this means that gaps must
    be placed correctly in multiple alignments to
    ensure that the same position is being compared
    for each species.

37
Homologous Genes? When could you accidentally
compare nonhomologous genes?
  • Be careful if you comparing genes that are
    members of a gene family.
  • Comparing a tubulin-3 from one species with a
    tubulin-6 from another will not generate accurate
    results.

38
What to align?
  • Phylogenetic trees are generated by comparing DNA
    or protein. The molecule of choice depends on
    the question you are attempting to answer.

39
DNA
  • contains more evolutionary information than
    protein
  • ATT GCG AAA CAC

  • ATA GCC AAG CTC

40
Protein
  • (same region analyzed ? only 1 difference)
  • Ile-Ala-Lys- His
  • Ile-Ala-Lys- Leu

41
DNA
  • high rate of base substitution makes DNA best for
    very short term studies, e.g. closely-related
    species

42
(No Transcript)
43
Homoplasy
  • Return of a character to its original state, thus
    masking intervening mutational events. Every
    fourth mutation should result in a homoplasy.

44
Protein
  • more reliable alignment than DNA
  • fewer homoplasies than DNA
  • lower rate of substitution than DNA better
    for wide species comparisons

45
(No Transcript)
46
rRNA ribosomal RNA
  • Best for very long term evolutionary studies
    spanning biological kingdoms
  • Selective processes constraining sequence
    evolution should be roughly the same across
    species boundaries

47
STEP 2- Determine the substitution model.
48
A nucleotide substitution rate matrix
A T C G
A 5 -4 -4 -4
T -4 5 -4 -4
C -4 -4 5 -4
G -4 -4 -4 5
49
Step 3- Tree Building
50
Step 3- Tree Building
Tree terminology Nodes branching points
Branches lines Topology branching pattern
51
Branches can be rotated at a node, without
changing the relationships.
52
(No Transcript)
53
Unrooted trees explain phylogenetic
relationships they say nothing about the
directions of evolution- the order of descent
54
(No Transcript)
55
(No Transcript)
56
There are two main tree drawing methods.
  • - Character Methods
  • - Distance Methods
  • Both approaches are widely used and work well
    with most data sets.

57
Distance methods
  • Distance- a measure of the overall pairwise
    difference between two data sets.
  • The raw material for tree reconstruction is
    tabular summaries of the pairwise differences
    between all data sets to be analyzed

58
In distance methods, the first step is to
calculate a matrix of all pairwise differences
between a set of sequences.
Species A B C D
B 9 ----- ----- -----
C 8 11 ----- -----
D 12 15 10 -----
E 15 18 13 5
59
Distance methods
  • Identify the sequence pairs that have the
    smallest number of sequence changes between them
    and are identified as neighbors. On a tree,
    these sequences share a common ancestor and are
    joined by a short branch.

60
UPGMA, pairwise distance and neighbor joining are
distance methods.
  • They progressively group sequences, starting with
    those that are most alike.
  • UPGMA unweighted-pair-group method with
    arithmetic mean

61
Phylogenetic trees based on distance methods.
  • The two sequences that are closest together are
    connected at a node.
  • The process is repeated until all sequences are
    joined.
  • Addition of the last sequence defines the root of
    the tree.

62
The branch lengths may reflect the degree of
similarity (and theoretically reflect
evolutionary time).
  • Scaled trees- when branch length are proportional
    to the differences between base pairs.
  • In the best of cases, scaled trees are additive
    (the physical length of branches connecting any
    two nodes is an accurate representation of their
    accumulated differences).

63
(No Transcript)
64
Phylogenetic trees based on distance methods.
  • Relatively simple.
  • Problem
  • May not be accurate!!

65
Character Methods
  • There is no denying that distance-based methods
    look at the big picture and pointedly ignore
    much potentially valuable information.

66
Character Methods
  • Analysis of individual characters are translated
    into evolutionary trees.
  • Character- a well-defined feature that can exist
    in a limited number of different states. (Ex.
    DNA and protein sequences)

67
The concept of parsimony is at the heart of all
character-based methods of phylogenetic
reconstruction.
  • The process of attaching preference to one
    evolutionary pathway over another on the basis of
    which pathway requires the invocation of the
    smallest number of mutational events.

68
Character-based methods of phylogenetic
reconstruction.
  • The relationship that requires the fewest
    number of mutations to explain the current state
    of affairs is most likely to be correct

69
First Step in Character Methods Identify all
of the informative sites
70
2nd step Calculate the minimum number of
substitutions at each informative site
71
Final step
  • After sequences are aligned, algorithms model
    each tree.

72
Maximum parsimony is a character method
  • Character methods require a multiple sequence
    align. Analysis of informative characters is
    used to construct an evolutionary tree.

73
Maximum Parsimony General scientific criterion
for choosing among competing hypotheses states
that we should accept the hypothesis that
explains the data most simply and efficiently.
  • The tree requiring the _______ number of nucleic
    acid or amino acid substitutions is selected.

74
Maximum Parsimony
  • The algorithm searches for a tree that requires
    the smallest number of changes to explain the
    differences observed among the groups under study.

75
Character methods are best suited for . . .
  • Sequences that are quite similar.
  • Small number of sequences
  • The method is computationally time consuming as
    all possible trees are examined.

76
Phylogenetic trees based on maximum likelihood
  • The aim is to find the tree (among all possible
    trees)
  • that has the highest likelihood of producing the
    observed data (statistical methods).

77
Phylogenetic trees based on maximum likelihood
  • are similar to maximum parsimony methods but
    also take into account the likelihood of specific
    mutations (ex. A ? G).

78
Mutation Rates Vary
  • Transitions (purine to purine or pyrimidine to
    pyrimidine) occur more frequently than
    transversions (purine to pyrimidine or pyrimidine
    to purine).

79
Many of the methods described require significant
amounts of computer time.
  • Why?

80
Number of possible rooted and unrooted trees
of Data Sets of Rooted Trees of Unrooted Trees
2 1 1
3 3 1
4 15 3
5 105 15
10 34,459,425 2,027,025
15 213,458,046,767,875 7,905,853,580,625
20 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875
81
(No Transcript)
82
Programs take shortcuts.
  • When a large number of tree is being compared, it
    is impossible to score each tree. A shortcut
    algorithm establishes an upper limit. As it
    evaluates other trees, it throws out any tree
    exceeding the upper bound before the calculation
    is completed.

83
  • Here are some 194 of the phylogeny packages, and
    16 free servers, that I know about. Updates to
    these pages are made about twice a year.

84
Tree Evaluation
  • Every tree drawing program will generate a
    tree. The important question is whether or not
    the tree drawn is the right one.
  • In some cases, there are many trees of similar
    probabilities.

85
Vertebrate b-globins
86
(No Transcript)
87
Bootstrap method of assessing tree reliability
  • Inferred tree is constructed from data set.
  • Re-run the calculation on subsets of the data
    (resampling).
  • Resampling is repeated several (100-1000) times.

88
(No Transcript)
89
Bootstrap method
  • Bootstrap trees are constructed from the
    resampled data sets.
  • Bootstrap tree is compared to original inferred
    tree.
  • of bootstrap trees supporting a node are
    determined for each node in the tree.

90
Molecular Clock
  • Addition of time to phylogenetic tree. Units of
    time are often in millions of years.
  • Assumption- substitution rates are constant over
    millions of years.

91
Molecular Clock
  • Rates of molecular evolution for genes with
    similar functional constraints can be quite
    uniform. (Clock may run at different rates in
    different proteins.)

92
The End
93
  • Evolutionary biology also has benefited greatly
    from genome-sequencing projects. The wealth of
    new genome data is helping to better resolve the
    tree of life, particularly its major branches.
    This has been especially true for prokaryotes,
    where more than 80 genomes have been sequenced so
    far and the results have greatly improved our
    view of the early history of life.

94
Problem- As the of sequences increases, the
of possible trees increases dramatically
of sequences of trees
3 1
4 3
5 15
6 105
7 945
8 10,395
9 135,135
10 1,027,025
50 2.8 x 1074
95
Phylogenetic trees based on neighbor joining.
  • Also utilizes a distance matrix
  • Neighbor joining algorithm searches for sets of
    neighbors that minimize the total length of the
    tree.
  • Can produce reasonable trees, especially when
    evolutionary distances are short.

96
  • For vertebrates, many thorny issues remain to be
    resolved, such as the phylogeny of families and
    other major groups in the tree of life. For
    example, it is not yet known whether humans are
    closer to mice or to cattle because different
    results have been obtained with different gene
    analyses. On the other hand, there is no
    guarantee that complete genome sequences will
    immediately solve all phylogenetic questions, as
    evidenced by the continuing debate over the
    relationships among humans, flies, and nematodes.
    We will need to develop new statistical methods
    and bioinformatics tools to handle the greater
    volume of data and to unravel the complexities of
    molecular evolution.

97
Today
  • The examination of molecular structure offers an
    extremely powerful tool for studying evolutionary
    relationships. The quantity of information is
    huge--as large as the thousands of different
    proteins contained in living organisms, and
    limited only by the time and resources of
    molecular biologists.

98
  • Choice of individual genes or proteins.

99
Determine the substitution model
  • May be an amino acid substitution rate matrix
    such as PAM or BLOSUM. ADD DEMO.

100
Maximum parsimony and maximum likelihood are
character methods
  • Character methods attempt to reconstruct
    ancestral nodes of trees in order to fit the tree
    to an evolutionary model. They therefore use more
    of the information in the data, at the expense of
    longer execution time.

101
Distance matrices
  • Scoring matrices include values for all possible
    substitutions. Each mismatch between two
    sequences adds to the distance, and each identity
    subtracts from the distance.  
Write a Comment
User Comments (0)
About PowerShow.com