Pairwise and Multiple Sequence Alignment Lesson 2 - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Pairwise and Multiple Sequence Alignment Lesson 2

Description:

Pairwise and Multiple Sequence Alignment Lesson 2 Motivation What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. – PowerPoint PPT presentation

Number of Views:511
Avg rating:3.0/5.0
Slides: 60
Provided by: ibisTauA
Category:

less

Transcript and Presenter's Notes

Title: Pairwise and Multiple Sequence Alignment Lesson 2


1
Pairwise and Multiple Sequence
AlignmentLesson 2
2
Motivation
ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAA
CAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTAT
GGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGA
GACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTG
CTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTT
CTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATG
CTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGT
ATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCT
GCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAAT
ATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGG
ACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCA
TCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTAT
AGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTC
TCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATT
TTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACT
TTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTT
GCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGA
ATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTG
CATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGG
GCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAA
ACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCA
GCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGG
AATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTG
GCTAATGCCCTGGCTCACAAGTACCATTGA
MV
HLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE
MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE
3
What is sequence alignment?
  • Alignment Comparing two (pairwise) or more
    (multiple) sequences. Searching for a series of
    identical or similar characters in the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE
MVHLTPEEKTAVNALWGKVNVDAVGGE
4
Why perform a pairwise sequence alignment?
Finding homology between two sequences
  • e.g., predicting characteristics of a protein
  • premised on
  • similar sequence (or structure)
  • similar function

5
Local vs. Global
  • Local alignment finds regions of high
    similarity in parts of the sequences
  • Global alignment finds the best alignment
    across the entire two sequences

ADLGAVFALCDRYFQ ADLGRTQN CDRYYQ
ADLGAVFALCDRYFQ ADLGRTQN-CDRYYQ
6
Evolutionary changes in sequences
  • Three types of nucleotide changes
  • Substitution a replacement of one (or more)
    sequence characters by another
  • Insertion - an insertion of one (or more)
    sequence characters
  • Deletion a deletion of one (or more) sequence
    characters

AAGA
AACA
?
T
A
AAG
GA
A
A
Insertion Deletion ? Indel
7
Choosing an alignment
  • Many different alignments between two sequences
    are possible

AAGCTGAATTCGAA AGGCTCATTTCTGA
A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
. . .
How do we determine which is the best alignment?
8
Toy exercise
Compute the scores of each of the following
alignments using this naïve scoring scheme
Scoring scheme
  • Match 1
  • Mismatch -2
  • Indel -1

-2 -2 -2 1
-2 -2 1 -2
-2 1 -2 -2
1 -2 -2 -2
Substitution matrix
Gap penalty (opening extending)
AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
9
Substitution matrices accounting for biological
context
  • Which best reflects the biological reality
    regarding nucleotide mismatch penalty?
  • Tr gt Tv gt 0
  • Tv gt Tr gt 0
  • 0 gt Tr gt Tv
  • 0 gt Tv gt Tr

Tr Transition Tv Transversion
10
Scoring schemes accounting for biological context
  • Which best reflects the biological reality
    regarding these mismatch penalties?
  • Arg-gtLys gt Ala-gtPhe
  • Arg-gtLys gt Thr-gtAsp
  • Asp-gtVal gt Asp-gtGlu

11
PAM matrices
  • Family of matrices PAM 80, PAM 120, PAM 250,
  • The number with a PAM matrix (the n in PAMn)
    represents the evolutionary distance between the
    sequences on which the matrix is based
  • The (ith,jth) cell in a PAMn matrix denotes the
    probability that amino-acid i will be replaced by
    amino-acid j in time n Pi?j,n
  • Greater n numbers denote greater distances

12
PAM - limitations
  • Based on only one original dataset
  • Examines proteins with few differences (85
    identity)
  • Based mainly on small globular proteins so the
    matrix is biased

13
BLOSUM matrices
  • Different BLOSUMn matrices are calculated
    independently from BLOCKS (ungapped, manually
    created local alignments)
  • BLOSUMn is based on a cluster of BLOCKS of
    sequences that share at least n percent identity
  • The (ith,jth) cell in a BLOSUM matrix denotes the
    log of odds of the observed frequency and
    expected frequency of amino acids i and j in the
    same position in the data log(Pij/qiqj)
  • Higher n numbers denote higher identity between
    the sequences on which the matrix is based

14
PAM Vs. BLOSUM
  • PAM100 BLOSUM90
  • PAM120 BLOSUM80
  • PAM160 BLOSUM60
  • PAM200 BLOSUM52
  • PAM250 BLOSUM45

More distant sequences
  • BLOSUM62 for general use
  • BLOSUM80 for close relations
  • BLOSUM45 for distant relations
  • PAM120 for general use
  • PAM60 for close relations
  • PAM250 for distant relations

15
Substitution matrices exercise
  • Pick the best substitution matrix (PAM and
    BLOSUM) for each pairwise alignment
  • Human chimp
  • Human - yeast
  • Human fish

PAM options PAM60 PAM120 PAM250
BLOSUM options BLOSUM45 BLOSUM62 BLOSUM80
16
Substitution matrices
  • Nucleic acids
  • Transition-transversion
  • Amino acids
  • Evolutionary (empirical data) based (PAM,
    BLOSUM)
  • Physico-chemical properties based (Grantham,
    McLachlan)

17
Gap penalty
AAGCGAAATTCGAAC A-G-GAA-CTCGAAC
AAGCGAAATTCGAAC AGG---AACTCGAAC
  • Which alignment has a higher score?
  • Which alignment is more likely?

18
Pairwise alignment algorithm matrix
representation formulation
2 sequences S1 and S2 and a Scoring scheme
match 1, mismatch -1, gap -2
Vi,j value of the optimal alignment between
S11i and S21j
Vi,j S(S1i1,S2j1) Vi1,j1 max
Vi1,j S(gap) Vi,j1 S(gap)
Vi,j Vi1,j
Vi,j1 Vi1,j1
19
Pairwise alignment algorithm matrix
representation initialization
S1
S2
20
Pairwise alignment algorithm matrix
representation filling the matrix
S1
S2
21
Pairwise alignment algorithm matrix
representation trace back
S1
S2
22
Pairwise alignment algorithm matrix
representation trace back
S1
S2
AAAC AG-C
23
Assessing the significance of an alignment score
True
AAGCTGAATTC-GAA AGGCTCATTTCTGA-
AAGCTGAATTCGAA AGGCTCATTTCTGA
28.0
Random
AGATCAGTAGACTA GAGTAGCTATCTCT
AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT
26.0
. .
CGATAGATAGCATA GCATGTCATGATTC
CGATAGATAGCATA--------- ---------GCATGTCATGATTC
16.0
24
Web servers for pairwise alignment
25
BLAST 2 sequences (bl2Seq) at NCBI
  • Produces the local alignment of two given
    sequences using BLAST (Basic Local Alignment
    Search Tool) engine for local alignment
  • Does not use an exact algorithm but a heuristic

26
Back to NCBI
27
BLAST bl2seq
28
Bl2Seq - query
blastn nucleotide blastp protein
29
Bl2seq results
30
Bl2seq results
Dissimilarity
Low complexity
Similarity
Gaps
Match
31
Query type AA or DNA?
  • For coding sequences, AA (protein) data are
    better
  • Selection operates most strongly at the protein
    level ? the homology is more evident
  • AA 20 char alphabet DNA - 4 char alphabet
  • lower chance of random homology for AA

?
32
BLAST programs
33
BLAST Blastp
34
Blastp - results
35
Blastp results (cont)
36
Blast scores
  • Bits score A score for the alignment according
    to the number of similarities, identities, etc.
    It has a standard set of units and is thus
    independent of the scoring scheme
  • Expected-score (E-value) The number of
    alignments with the same or higher score one can
    expect to see by chance when searching a random
    database with a random sequence of particular
    sizes. The closer the e-value is to zero, the
    greater the confidence that the hit is really a
    homolog

37
Multiple Sequence Alignment (MSA)
38
Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
Similar to pairwise alignment BUT n sequences are
aligned instead of just 2
Each row represents an individual sequence Each
column represents the same position
39
Why perform an MSA?
MSAs are at the heart of comparative genomics
studies which seek to study evolutionary
histories, functional and structural aspects of
sequences, and to understand phenotypic
differences between species
40
Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI
GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S
eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE
DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se
q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD
--IAVEWWSNG--
41
Alignment methods
  • There is no available optimal solution for MSA
    all methods are heuristics
  • Progressive/hierarchical alignment (ClustalX)
  • Iterative alignment (MAFFT, MUSCLE)

42
Progressive alignment
A B C D E
First step compute pairwise distances
Compute the pairwise alignments for all against
all (10 pairwise alignments). The similarities
are converted to distances and stored in a table
E D C B A
A
8 B
17 15 C
10 14 16 D
32 31 31 32 E
43
E D C B A
A
8 B
17 15 C
10 14 16 D
32 31 31 32 E
Second step build a guide tree
  • Cluster the sequences to create a tree (guide
    tree)
  • represents the order in which pairs of
    sequences are to be aligned
  • similar sequences are neighbors in the tree
  • distant sequences are distant from each other in
    the tree

The guide tree is imprecise and is NOT the tree
which truly describes the evolutionary
relationship between the sequences!
44
Third step align sequences in a bottom up order
  1. Align the most similar (neighboring) pairs
  2. Align pairs of pairs
  3. Align sequences clustered to pairs of pairs
    deeper in the tree

45
Main disadvantages of progressive alignments
Guide-tree topology may be considerably wrong
Globally aligning pairs of sequences may create
errors that will propagate through to the final
result
46
Iterative alignment
A B C DE
Pairwise distance table
Iterate until the MSA does not change
(convergence)
Guide tree
47
Blastp acquiring sequences
48
blastp acquiring sequences
49
blastp acquiring sequences
50
MSA input multiple sequence Fasta file
gtgi4504351refNP_000510.1 delta globin Homo
sapiens MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQR
FFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLS
ELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN
ALAHKYH gtgi4504349refNP_000509.1 beta
globin Homo sapiens MVHLTPEEKSAVTALWGKVNVDEVGGEA
LGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLA
HLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPV
QAAYQKVVAGVAN ALAHKYH gtgi4885393refNP_005321.1
epsilon globin Homo sapiens MVHFTAEEKAAVTSLWSK
MNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKV
LT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILAT
HFGKEFTPEVQAAWQKLVSAVAI ALAHKYH gtgi6715607refN
P_000175.1 G-gamma globin Homo
sapiens MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQR
FFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLS
ELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS
ALSSRYH gtgi28302131refNP_000550.2 A-gamma
globin Homo sapiens MGHFTEEDKATITSLWGKVNVEDAGGET
LGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATK
HLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEV
QASWQKMVTAVAS ALSSRYH gtgi4885397refNP_005323.1
hemoglobin, zeta Homo sapiens MSLTKTERTIIVSMWA
KISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAA
VGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFP
ADFTAEAHAAWDKFLSVVSSVLTEK YR
51
MSA using ClustalX
52
Step1 Load the sequences
53
A little unclear
54
Edit Fasta headers
gtgi4504351refNP_000510.1 delta globin Homo
sapiens MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQR
FFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLS
ELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN
ALAHKYH gtgi4504349refNP_000509.1 beta
globin Homo sapiens MVHLTPEEKSAVTALWGKVNVDEVGGEA
LGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLA
HLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPV
QAAYQKVVAGVAN ALAHKYH gtgi4885393refNP_005321.1
epsilon globin Homo sapiens MVHFTAEEKAAVTSLWSK
MNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKV
LT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILAT
HFGKEFTPEVQAAWQKLVSAVAI ALAHKYH gtgi6715607refN
P_000175.1 G-gamma globin Homo
sapiens MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQR
FFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLS
ELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS
ALSSRYH gtgi28302131refNP_000550.2 A-gamma
globin Homo sapiens MGHFTEEDKATITSLWGKVNVEDAGGET
LGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATK
HLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEV
QASWQKMVTAVAS ALSSRYH gtgi4885397refNP_005323.1
hemoglobin, zeta Homo sapiens MSLTKTERTIIVSMWA
KISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAA
VGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFP
ADFTAEAHAAWDKFLSVVSSVLTEK YR
gt delta globin
gt beta globin
gt epsilon globin
gt G-gamma globin
gt A-gamma globin
gt hemoglobin zeta
55
Step2 Perform alignment
56
MSA and conservation view
57
(No Transcript)
58
Messing-up alignment of HIV-1 env
59
MSA tools
  • Progressive
  • CLUSTALX/CLUSTALW
  • Iterative
  • MUSCLE, MAFFT, PRANK
Write a Comment
User Comments (0)
About PowerShow.com