1 / 59

Pairwise and Multiple Sequence

AlignmentLesson 2

Motivation

ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAA

CAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTAT

GGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGA

GACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTG

CTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTT

CTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATG

CTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGT

ATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCT

GCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAAT

ATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGG

ACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCA

TCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTAT

AGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTC

TCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATT

TTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACT

TTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTT

GCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGA

ATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTG

CATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGG

GCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAA

ACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCA

GCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGG

AATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTG

GCTAATGCCCTGGCTCACAAGTACCATTGA

MV

HLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE

MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE

What is sequence alignment?

- Alignment Comparing two (pairwise) or more

(multiple) sequences. Searching for a series of

identical or similar characters in the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE

MVHLTPEEKTAVNALWGKVNVDAVGGE

Why perform a pairwise sequence alignment?

Finding homology between two sequences

- e.g., predicting characteristics of a protein
- premised on
- similar sequence (or structure)
- similar function

Local vs. Global

- Local alignment finds regions of high

similarity in parts of the sequences - Global alignment finds the best alignment

across the entire two sequences

ADLGAVFALCDRYFQ ADLGRTQN CDRYYQ

ADLGAVFALCDRYFQ ADLGRTQN-CDRYYQ

Evolutionary changes in sequences

- Three types of nucleotide changes
- Substitution a replacement of one (or more)

sequence characters by another - Insertion - an insertion of one (or more)

sequence characters - Deletion a deletion of one (or more) sequence

characters

AAGA

AACA

?

T

A

AAG

GA

A

A

Insertion Deletion ? Indel

Choosing an alignment

- Many different alignments between two sequences

are possible

AAGCTGAATTCGAA AGGCTCATTTCTGA

A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-

AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-

. . .

How do we determine which is the best alignment?

Toy exercise

Compute the scores of each of the following

alignments using this naïve scoring scheme

Scoring scheme

- Match 1
- Mismatch -2
- Indel -1

-2 -2 -2 1

-2 -2 1 -2

-2 1 -2 -2

1 -2 -2 -2

Substitution matrix

Gap penalty (opening extending)

AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-

A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-

Substitution matrices accounting for biological

context

- Which best reflects the biological reality

regarding nucleotide mismatch penalty? - Tr gt Tv gt 0
- Tv gt Tr gt 0
- 0 gt Tr gt Tv
- 0 gt Tv gt Tr

Tr Transition Tv Transversion

Scoring schemes accounting for biological context

- Which best reflects the biological reality

regarding these mismatch penalties? - Arg-gtLys gt Ala-gtPhe
- Arg-gtLys gt Thr-gtAsp
- Asp-gtVal gt Asp-gtGlu

PAM matrices

- Family of matrices PAM 80, PAM 120, PAM 250,
- The number with a PAM matrix (the n in PAMn)

represents the evolutionary distance between the

sequences on which the matrix is based - The (ith,jth) cell in a PAMn matrix denotes the

probability that amino-acid i will be replaced by

amino-acid j in time n Pi?j,n - Greater n numbers denote greater distances

PAM - limitations

- Based on only one original dataset
- Examines proteins with few differences (85

identity) - Based mainly on small globular proteins so the

matrix is biased

BLOSUM matrices

- Different BLOSUMn matrices are calculated

independently from BLOCKS (ungapped, manually

created local alignments) - BLOSUMn is based on a cluster of BLOCKS of

sequences that share at least n percent identity - The (ith,jth) cell in a BLOSUM matrix denotes the

log of odds of the observed frequency and

expected frequency of amino acids i and j in the

same position in the data log(Pij/qiqj) - Higher n numbers denote higher identity between

the sequences on which the matrix is based

PAM Vs. BLOSUM

- PAM100 BLOSUM90
- PAM120 BLOSUM80
- PAM160 BLOSUM60
- PAM200 BLOSUM52
- PAM250 BLOSUM45

More distant sequences

- BLOSUM62 for general use
- BLOSUM80 for close relations
- BLOSUM45 for distant relations

- PAM120 for general use
- PAM60 for close relations
- PAM250 for distant relations

Substitution matrices exercise

- Pick the best substitution matrix (PAM and

BLOSUM) for each pairwise alignment - Human chimp
- Human - yeast
- Human fish

PAM options PAM60 PAM120 PAM250

BLOSUM options BLOSUM45 BLOSUM62 BLOSUM80

Substitution matrices

- Nucleic acids
- Transition-transversion
- Amino acids
- Evolutionary (empirical data) based (PAM,

BLOSUM) - Physico-chemical properties based (Grantham,

McLachlan)

Gap penalty

AAGCGAAATTCGAAC A-G-GAA-CTCGAAC

AAGCGAAATTCGAAC AGG---AACTCGAAC

- Which alignment has a higher score?
- Which alignment is more likely?

Pairwise alignment algorithm matrix

representation formulation

2 sequences S1 and S2 and a Scoring scheme

match 1, mismatch -1, gap -2

Vi,j value of the optimal alignment between

S11i and S21j

Vi,j S(S1i1,S2j1) Vi1,j1 max

Vi1,j S(gap) Vi,j1 S(gap)

Vi,j Vi1,j

Vi,j1 Vi1,j1

Pairwise alignment algorithm matrix

representation initialization

S1

S2

Pairwise alignment algorithm matrix

representation filling the matrix

S1

S2

Pairwise alignment algorithm matrix

representation trace back

S1

S2

Pairwise alignment algorithm matrix

representation trace back

S1

S2

AAAC AG-C

Assessing the significance of an alignment score

True

AAGCTGAATTC-GAA AGGCTCATTTCTGA-

AAGCTGAATTCGAA AGGCTCATTTCTGA

28.0

Random

AGATCAGTAGACTA GAGTAGCTATCTCT

AGATCAGTAGACTA----- ----GAGTAG-CTATCTCT

26.0

. .

CGATAGATAGCATA GCATGTCATGATTC

CGATAGATAGCATA--------- ---------GCATGTCATGATTC

16.0

Web servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at NCBI

- Produces the local alignment of two given

sequences using BLAST (Basic Local Alignment

Search Tool) engine for local alignment - Does not use an exact algorithm but a heuristic

Back to NCBI

BLAST bl2seq

Bl2Seq - query

blastn nucleotide blastp protein

Bl2seq results

Bl2seq results

Dissimilarity

Low complexity

Similarity

Gaps

Match

Query type AA or DNA?

- For coding sequences, AA (protein) data are

better - Selection operates most strongly at the protein

level ? the homology is more evident - AA 20 char alphabet DNA - 4 char alphabet
- lower chance of random homology for AA

?

BLAST programs

BLAST Blastp

Blastp - results

Blastp results (cont)

Blast scores

- Bits score A score for the alignment according

to the number of similarities, identities, etc.

It has a standard set of units and is thus

independent of the scoring scheme - Expected-score (E-value) The number of

alignments with the same or higher score one can

expect to see by chance when searching a random

database with a random sequence of particular

sizes. The closer the e-value is to zero, the

greater the confidence that the hit is really a

homolog

Multiple Sequence Alignment (MSA)

Multiple sequence alignment

Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI

GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S

eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE

DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se

q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD

--IAVEWWSNG--

Similar to pairwise alignment BUT n sequences are

aligned instead of just 2

Each row represents an individual sequence Each

column represents the same position

Why perform an MSA?

MSAs are at the heart of comparative genomics

studies which seek to study evolutionary

histories, functional and structural aspects of

sequences, and to understand phenotypic

differences between species

Multiple sequence alignment

Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI

GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S

eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE

DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se

q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD

--IAVEWWSNG--

Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG Seq2 VTISCTGTSSNI

GS--ITVNWYQQLPG Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG S

eq4 LSLTCTVSGTSFDD--YYSTWVRQPPG Seq5 PEVTCVVVDVSHE

DPQVKFNWYVDG-- Seq6 ATLVCLISDFYPGA--VTVAWKADS-- Se

q7 AALGCLVKDYFPEP--VTVSWNSG--- Seq8 VSLTCLVKGFYPSD

--IAVEWWSNG--

Alignment methods

- There is no available optimal solution for MSA

all methods are heuristics - Progressive/hierarchical alignment (ClustalX)
- Iterative alignment (MAFFT, MUSCLE)

Progressive alignment

A B C D E

First step compute pairwise distances

Compute the pairwise alignments for all against

all (10 pairwise alignments). The similarities

are converted to distances and stored in a table

E D C B A

A

8 B

17 15 C

10 14 16 D

32 31 31 32 E

E D C B A

A

8 B

17 15 C

10 14 16 D

32 31 31 32 E

Second step build a guide tree

- Cluster the sequences to create a tree (guide

tree) - represents the order in which pairs of

sequences are to be aligned - similar sequences are neighbors in the tree
- distant sequences are distant from each other in

the tree

The guide tree is imprecise and is NOT the tree

which truly describes the evolutionary

relationship between the sequences!

Third step align sequences in a bottom up order

- Align the most similar (neighboring) pairs
- Align pairs of pairs
- Align sequences clustered to pairs of pairs

deeper in the tree

Main disadvantages of progressive alignments

Guide-tree topology may be considerably wrong

Globally aligning pairs of sequences may create

errors that will propagate through to the final

result

Iterative alignment

A B C DE

Pairwise distance table

Iterate until the MSA does not change

(convergence)

Guide tree

Blastp acquiring sequences

blastp acquiring sequences

blastp acquiring sequences

MSA input multiple sequence Fasta file

gtgi4504351refNP_000510.1 delta globin Homo

sapiens MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQR

FFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLS

ELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN

ALAHKYH gtgi4504349refNP_000509.1 beta

globin Homo sapiens MVHLTPEEKSAVTALWGKVNVDEVGGEA

LGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLA

HLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPV

QAAYQKVVAGVAN ALAHKYH gtgi4885393refNP_005321.1

epsilon globin Homo sapiens MVHFTAEEKAAVTSLWSK

MNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKV

LT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILAT

HFGKEFTPEVQAAWQKLVSAVAI ALAHKYH gtgi6715607refN

P_000175.1 G-gamma globin Homo

sapiens MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQR

FFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLS

ELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS

ALSSRYH gtgi28302131refNP_000550.2 A-gamma

globin Homo sapiens MGHFTEEDKATITSLWGKVNVEDAGGET

LGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATK

HLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEV

QASWQKMVTAVAS ALSSRYH gtgi4885397refNP_005323.1

hemoglobin, zeta Homo sapiens MSLTKTERTIIVSMWA

KISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAA

VGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFP

ADFTAEAHAAWDKFLSVVSSVLTEK YR

MSA using ClustalX

Step1 Load the sequences

A little unclear

Edit Fasta headers

gtgi4504351refNP_000510.1 delta globin Homo

sapiens MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQR

FFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLS

ELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN

ALAHKYH gtgi4504349refNP_000509.1 beta

globin Homo sapiens MVHLTPEEKSAVTALWGKVNVDEVGGEA

LGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLA

HLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPV

QAAYQKVVAGVAN ALAHKYH gtgi4885393refNP_005321.1

epsilon globin Homo sapiens MVHFTAEEKAAVTSLWSK

MNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKV

LT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILAT

HFGKEFTPEVQAAWQKLVSAVAI ALAHKYH gtgi6715607refN

P_000175.1 G-gamma globin Homo

sapiens MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQR

FFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLS

ELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS

ALSSRYH gtgi28302131refNP_000550.2 A-gamma

globin Homo sapiens MGHFTEEDKATITSLWGKVNVEDAGGET

LGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATK

HLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEV

QASWQKMVTAVAS ALSSRYH gtgi4885397refNP_005323.1

hemoglobin, zeta Homo sapiens MSLTKTERTIIVSMWA

KISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAA

VGDA VKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFP

ADFTAEAHAAWDKFLSVVSSVLTEK YR

gt delta globin

gt beta globin

gt epsilon globin

gt G-gamma globin

gt A-gamma globin

gt hemoglobin zeta

Step2 Perform alignment

MSA and conservation view

(No Transcript)

Messing-up alignment of HIV-1 env

MSA tools

- Progressive
- CLUSTALX/CLUSTALW
- Iterative
- MUSCLE, MAFFT, PRANK