GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT

OF 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID

SEQUENCES

Assumptions Life is monophyletic Biological

entities (sequences, taxa) share common ancestry

Any two organisms share a common ancestor in

their past

(5 MYA)

(120 MYA)

ancestor

(1,500 MYA)

ancestor

(1) Speciation events (2) Gene duplication (3)

Duplicative transposition

Homologous sequences

Homology A term coined by Richard Owen in 1843.

Definition Similarity resulting from common

ancestry.

Homology

- There are three main types of molecular homology

orthology, paralogy (including ohnology) and

xenology.

Homology General Definition

- Homology designates a qualitative relationship of

common descent between entities - Two genes are either homologous or they are not!
- it doesnt make sense to say two genes are 43

homologous. - it doesnt make sense to say Linda is 43

pregnant.

Orthology Paralogy

- Two genes are orthologs if they originated from a

single ancestral gene in the most recent common

ancestor of their respective genomes - Two genes are paralogs if they are related by

gene duplication. Two genes are ohnologs if they

are related by gene duplication due to genome

duplication

(No Transcript)

Gene death

Xenology is due to horizontal (lateral) gene

transfer (HGT or LGT)

XA and XB are xenologs

Distinguishing orthologs from xenologs is

impossible in pairwise genomic comparisons, but

possible when multiple genomes are compared

Orthology, Paralogy, Xenology(Fitch, Trends in

Genetics, 2000. 16(5)227-231)

Homology

By comparing homologous characters, we can

reconstruct the evolutionary events that have led

to the formation of the extant sequences from the

common ancestor.

Homology

When comparing sequences, we are interested in

POSITIONAL HOMOLOGY. We identify POSITIONAL

HOMOLOGY through SEQUENCE ALIGNMENT.

Positional homology In pairwise alignment, a

pair of nucleotides from two homologous sequences

that have descended from one nucleotide in the

ancestor of the two sequences.

Alignment A hypothesis concerning positional

homology among residues from two or more sequence.

Sequence alignment involves the identification of

the correct location of deletions and insertions

that have occurred in either of the two lineages

since their divergence from a common ancestor.

(No Transcript)

Unknown sequence

Unknown events unknown sequence of events

Unknown events unknown sequence of events

The true alignment is unknown.

There are two modes of alignment. Global

alignment each residue of sequence A is compared

with each residue in sequence B. Global alignment

algorithms are used in comparative and

evolutionary studies. Local alignment

Determining if sub-segments of one sequence are

present in another. Local alignment methods have

their greatest utility in database searching and

retrieval (e.g., BLAST).

For reasons of computational complexity, sequence

alignment is divided into two categories

Pairwise alignment (i.e., the alignment of two

sequences). Multiple-sequence alignment (i.e.,

the alignment of three or more sequences).

Pairwise alignment problems have exact

solutions. Multiple-sequence alignment problems

only have approximate (heuristic) solutions.

A pairwise alignment consists of a series of

paired bases, one base from each sequence. There

are three types of pairs(1) matches the same

nucleotide appears in both sequences. (2)

mismatches different nucleotides are found in

the two sequences. (3) gaps a base in one

sequence and a null base in the other.

GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG

-Two DNA sequences A and B.-Lengths are m and

n, respectively. -The number of matched pairs is

x. -The number of mismatched pairs is y. -

Total number of bases in gaps is z.

There are internal and terminal gaps.

GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG

A terminal gap may indicate missing data.

GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG

An internal gap indicates that a deletion or an

insertion has occurred in one of the two

lineages.

GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG

When sequences are compared through alignment, it

is impossible to tell whether a deletion has

occurred in one sequence or an insertion has

occurred in the other. Thus, deletions and

insertions are collectively referred to as indels

(short for insertion or deletion).

GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG

The alignment is the first step in many

functional and evolutionary studies. Errors in

alignment tend to amplify in later stages of the

study.

Motivation for sequence alignment

- Function
- Similarity may be indicative of similar function.
- Evolution
- Similarity may be indicative of common ancestry.

Some definitions

An example of pairwise alignment of an unknown

protein with a known one

- Glutaredoxin, Bacteriophage T4 from E. coli, 87

aa - (B) Unknown protein - 93 aa

Unknown protein, Bacteriophage 65 from Aeromonas

sp. 93 aa

10 20 30 40

50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVK

KQPFEFINIMPEKGV---FDDEKIAELLTKLGR ..

.. . .. .. . . .

.. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKK

ANNQLGFDYILEKFDECKARANM 10 20

30 40 50 60

60 70 80 Glutar

DTQIGLTMPQVFAPDGSHIGGFDQLREYF ..

..... .... ... .Unknow QTR-PTSFPRIFV-DGQYI

GSLKQFKDLY 70 80 90

Is the unknown protein a glutaredoxin?

Methods of alignment 1. Manual 2. Dot

matrix 3. Distance Matrix 4. Combined (Distance

Manual)

- Manual alignment. When there are few gaps and the

two sequences are not too different from each

other, a reasonable alignment can be obtained by

visual inspection.

GCG-TCCATCAGGTAGTTGGTGTG GCGATCCATCAGGTGGTTGGTGTG

Advantages of manual alignment (1) use of a

powerful and trainable tool (the brain, well

some brains).(2) ability to integrate

additional data, e.g., domain structure,

biological function.

(No Transcript)

Protein Alignment may be guided by Secondary and

Tertiary Structures

Escherichia coli DjlA protein

Homo sapiens DjlA protein

Disadvantages of manual alignment

subjectivity (the algorithm is unspecified)

irreproducibility (the results cannot be

independently reproduced) unscalability

(inapplicable to long sequences)incommensurabili

ty (the results cannot be compared to those

obtained by other methods)

The dot-matrix method (Gibbs and McIntyre, 1970)

The two sequences are written out as column and

row headings of a two-dimensional matrix. A dot

is put in the dot-matrix plot at a position where

the nucleotides in the two sequences are

identical.

The alignment is defined by a path from the

upper-left element to the lower-right element.

There are 4 possible steps in the path

- (1) a diagonal step through a dot match.
- (2) a diagonal step through an empty element of

the matrix mismatch. - (3) a horizontal step a gap in the sequence on

the left of the matrix. - (4) a vertical step a gap in the sequence on

the top of the matrix.

A dot matrix may become cluttered. With DNA

sequences, 25 of the elements will be occupied

by dots by chance alone.

window size 1 stringency 1 alphabet size 4

The number of spurious matches is determined by

window size (how many residues are compared),

stringency (the minimum number of matches for a

hit), alphabet size (number of characters

states). Window size must be an odd number.

window size 1 stringency 1 alphabet size 4

window size 3 stringency 2 alphabet size 4

window size 1 stringency 1 alphabet size 20

Dot-matrix methodsAdvantages By being a

visual representation, and humans being visual

animals, the method may unravel information on

the evolution of sequences that cannot easily be

gleaned from a line alignment.Disadvantages

May not identify the best possible alignment.

Window size 60 amino acids Stringency 24

matches

Advantages Highlighting Information

Window size 60 amino acids Stringency 24

matches

Advantages Highlighting Information

The two pairs of diagonally oriented parallel

lines most probably indicate that two small

internal duplications occurred in the bacterial

gene.

Disadvantages Not possible to identify the

best alignment.

Scoring Matrices Gap Penalties

The true alignment between two sequences is the

one that reflects accurately the evolutionary

relationships between the sequences. Since the

true alignment is unknown, in practice we look

for the optimal alignment, which is the one in

which the numbers of mismatches and gaps are

minimized according to certain criteria.

Unfortunately, reducing the number of mismatches

results in an increase in the number of gaps, and

vice versa.

a matches b mismatches g nucleotides in

gaps d gaps

The scoring scheme comprises a gap penalty and a

scoring matrix, M(a,b), that specifies the score

for each type of match (a b) or mismatch (a ?

b). The units in a scoring matrix may be the

nucleotides in the DNA or RNA sequences, the

codons in protein-coding regions, or the amino

acids in protein sequences.

DNA scoring matrices are usually simple. In the

simplest scheme all mismatches are given the same

penalty. M(a,b) is positive if a b and

negative otherwise. In more complicated

matrices a distinction may be made between

transition and transversion mismatches or each

type of mismatch may be penalized differently.

Further complications Distinguishing among

different matches and mismatches.For example, a

mismatched pair consisting of Leu Ile, which

are very similar biochemically to each other, may

be given a lesser penalty than a mismatched pair

consisting of Arg Glu, which are very

dissimilar from each other.

Lesser penalty than

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

B asx (asp or asn) X unknown Z glx (glu or

gln) termination codon

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

The matrix is symmetrical

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Positive numbers on the diagonal

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Mismatches are usually penalized

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Some mismatches are not penalized

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

A few mismatches are even rewarded

Gap penalty (or cost) is a factor (or a set of

factors) by which the gap values (numbers and

lengths of gaps) are mathematically manipulated

to make the gaps equivalent in value to the

mismatches. The gap penalties are based on our

assessment of how frequent different types of

insertions and deletions occur in evolution in

comparison with the frequency of occurrence of

point substitutions.

Mismatches

Gaps

The gap penalty has two components a gap-opening

penalty and a gap-extension penalty.

Three main gap-penalty systems (1) Fixed

gap-penalty system 0 gap-extension costs.

Three main gap-penalty systems (2) Linear

gap-penalty system the gap-extension cost is

calculated by multiplying the gap length minus 1

by a constant representing the gap-extension

penalty for increasing the gap by 1.

Three main gap-penalty systems (3)

Logarithmic gap-penalty system the

gap-extension penalty increases with the

logarithm of the gap length, i.e., slower.

Alignment algorithms

Aim Given a predetermined set of criteria, find

the alignment associated with the best score from

among all possible alignments.The OPTIMAL

ALIGNMENT

The number of possible alignments may be

astronomical.

where n and m are the lengths of the two

sequences to be aligned.

The number of possible alignments may be

astronomical. For example, when two DNA

sequences 200 residues long each are compared,

there are more than 10153 possible alignments.

In comparison, the number of protons in the

universe is only 1080.

FORTUNATELYThere are computer algorithms for

finding the optimal alignment between two

sequences that do not require an exhaustive

search of all the possibilities.

The Needleman-Wunsch (1970) algorithmuses

Dynamic Programming

Dynamic programming a computational technique.

It is applicable when large searches can be

divided into a succession of small stages, such

that (1) the solution of the initial search stage

is trivial, (2) each partial solution in a later

stage can be calculated by reference to only a

small number of solutions in an earlier stage,

and (3) the last stage contains the overall

solution.

Dynamic programming can be applied to problems of

alignment because ALIGNMENT SCORES obey the

following rules

Path Graph for aligning two sequences

allowed

not allowed

(No Transcript)

Scoring scheme match 5 mismatch

3 gap-opening penalty 4 gap-extension penalty

0

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Matrix initialization

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Matrix initialization

0 match 5

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Matrix initialization

0 gap 4

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Matrix initialization

0 gap 4

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Matrix fill

0 match 5

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Matrix fill

5 gap 1

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Matrix fill

0 gap 4

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

and so on and so forth

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Complete matrix fill

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Trace back

The alignment is produced by either starting at

the highest score in either the rightmost column

or the bottom row, and proceeding from right to

left by following the best pointers, or at the

bottom rightmost cell.This stage is called the

traceback. The graph of pointers in the traceback

is also referred to as the path graph because it

defines the paths through the matrix that

correspond to the optimal alignment or

alignments.

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Trace back (if we DO allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Trace back (if we DO NOT allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

10 gap ? 11

14 mismatch 11

10 gap ? 11

Trace back (if we DO NOT allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

10 gap ? 14

9 match 14

5 gap ? 14

Trace back (if we DO NOT allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

4 mismatch ? 9

13 gap 9

0 gap ? 9

Trace back (if we DO NOT allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

8 match 13

4 gap ? 13

9 gap ? 13

Trace back (if we DO NOT allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

1 gap ? 8

12 gap 8

3 match 8

Trace back (if we DO NOT allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

7 gap ? 12

7 match 12

3 gap ? 12

7 gap 3

6 gap ? 3

2 mismatch ? 3

Trace back (if we DO NOT allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

Trace back (if we DO NOT allow terminal gaps)

match 5, mismatch 3, gap-opening penalty

4, gap-extension penalty 0

high road/low road/middle road

Trace back (complete)

Two possible alignments GAATTCAGT GGA-TC-GA

GAATTCAGT GGAT-C-GA

Scoring Matrices

Mismatch and gap penalties should be inversely

proportional to the frequencies with which

changes occur.

Transitions (68) occur more frequently than

transversions (32). Mismatch penalties for

transitions should be smaller than those for

transversions.

Empirical substitution matrices

PAM (Percent/Point Accepted Mutation) BLOSUM

(BLOcks SUbstitution Matrix)

PAM

- Developed by Margaret Dayhoff in 1978.
- Based on comparisons of very similar protein

sequences.

Log-odds ratios

- A scoring matrix is a table of values that

describe the probability of a residue (amino acid

or base) pair occurring in an alignment. - The values in a scoring matrix are log ratios of

two probabilities. - One is the random probability. The other

is the probability of a empirical pair

occurrence. - Because the scores are logarithms of probability

ratios, they can be added to give a meaningful

score for the entire alignment. The more

positive the score, the better the alignment!

The PAM matrices (Percent accepted mutations)

- Align sequences that are at least 85 identical.
- Minimizes ambiguity in alignments and the number

of coincident mutations. - Reconstruct phylogenetic trees and infer

ancestral sequences. - Tally replacements "accepted" by natural

selection, in all pairwise comparisons. - Meaning, the number of times j was replaced by i

in all comparisons. - Compute amino acid mutability (i.e., the

propensity of a given amino acid, j, to be

replaced).

The PAM matrices

- Combine data to produce a Mutation Probability

Matrix for one PAM of evolutionary distance,

which is used to calculate the Log Odds Matrix

for similarity scoring. - Thus, depending on the protein family used,

various PAM matrices result - some of which are

good at locating evolutionary distant conserved

mutations and some that are good at locating

evolutionary close conserved mutations.

More on log-odds ratios

In PAM log-odds scores are multiplied by 10 to

avoid decimals. Therefore, a PAM score of 2

actually corresponds to a log-odds ratio of 0.2.

0.2 substitioni to j log10 (observed ij

mutation rate) / (expected rate) The value

0.2 is log10 of the relative expectation value of

the mutation. Therefore, the expectation value

is 100.2 1.6. So, a PAM score of 2 indicates

that (in related sequences) the mutation would be

expected to occur 1.6 times more frequently than

random.

PAM250

- Calculated for families of related proteins (gt85

identity) - 1 PAM is the amount of evolutionary change that

yields, on average, one substitution in 100 amino

acid residues - A positive score signifies a common replacement

whereas a negative score signifies an unlikely

replacement - PAM250 matrix assumes/is optimized for sequences

separated by 250 PAM, i.e. 250 substitutions in

100 amino acids (longer evolutionary time)

PAM250

Sequence alignment matrix that allows 250

accepted point mutations per 100 amino acids.

PAM250 is suitable for comparing distantly

related sequences, while a lower PAM is suitable

for comparing more closely related sequences.

Selecting a PAM Matrix

- Low PAM numbers short sequences, strong local

similarities. - High PAM numbers long sequences, weak

similarities. - PAM60 for close relations (60 identity)
- PAM120 recommended for general use (40 identity)
- PAM250 for distant relations (20 identity)
- If uncertain, try several different matrices
- PAM40, PAM120, PAM250 recommended.

BLOSUM

- Blocks Substitution Matrix
- Steven and Jorga G. Henikoff (1992).
- Based on BLOCKS database (www.blocks.fhcrc.org)
- Families of proteins with identical function.
- Highly conserved protein domains.
- Ungapped local alignment to identify motifs
- Each motif is a block of local alignment.
- Counts amino acids observed in same column.
- Symmetrical model of substitution.

BLOSUM62

- BLOSUM matrices are based on local alignments

(blocks or conserved amino acid patterns). - BLOSUM 62 is a matrix calculated from comparisons

of sequences with no less than 62 divergence. - All BLOSUM matrices are based on observed

alignments they are not extrapolated from

comparisons of closely related proteins. - BLOSUM 62 is the default matrix in BLAST 2.0.

BLOSUM Matrices

- Different BLOSUMn matrices are calculated

independently from BLOCKS - BLOSUMn is based on sequences that are at most n

percent identical.

BLOSUM62

The procedure for calculating a BLOSUM matrix is

based on a likelihood method estimating the

occurrence of each possible pairwise

substitution. Only aligned blocks are used to

calculate the BLOSUMs. The higher the score The

more closely related sequences.

Why is BLOSUM62 called BLOSUM62?

Because all blocks whose members shared at least

62 identity with ANY other member of that block

were averaged and represented as 1 sequence.

Selecting a BLOSUM Matrix

- For BLOSUMn, higher n suitable for sequences

which are more similar - BLOSUM62 recommended for general use
- BLOSUM80 for close relations
- BLOSUM45 for distant relations

- Equivalent PAM and Blosum matricesThe following

matrices are roughly equivalent... - PAM100 gt Blosum90
- PAM120 gt Blosum80
- PAM160 gt Blosum60
- PAM200 gt Blosum52
- PAM250 gt Blosum45Generally speaking...
- The Blosum matrices are best for detecting local

alignments. - The Blosum62 matrix is the best for detecting the

majority of weak protein similarities. - The Blosum45 matrix is the best for detecting

long and weak alignments.

Less divergent

More divergent

Comparison of PAM250 and BLOSUM62

The relationship between BLOSUM and PAM

substitution matrices BLOSUM matrices with

higher numbers and PAM matrices with low numbers

are both designed for comparisons of closely

related sequences. BLOSUM matrices with low

numbers and PAM matrices with high numbers are

designed for comparisons of distantly related

proteins. If distant relatives of the query

sequence are specifically being sought, the

matrix can be tailored to that type of search.

Scoring matrices commonly used

- PAM250
- Shown to be appropriate for searching for

sequences of 17-27 identity. - BLOSUM62
- Though it is tailored for comparisons of

moderately distant proteins, it performs well in

detecting closer relationships. - BLOSUM50
- Shown to be better for FASTA searches.

Effect of gap penalties on amino-acid alignment

Human pancreatic hormone precursor versus

chicken pancreatic hormone (a) Penalty

for gaps is 0 (b) Penalty for a gap of size k

nucleotides is wk 1 0.1k (c) The same

alignment as in (b), only the similarity between

the two sequences is further enhanced by showing

pairs of biochemically similar amino acids

Alignments things to keep in mind

- Optimal alignment means having the highest

possible score, given a substitution matrix and a

set of gap penalties - This is NOT necessarily the most meaningful

alignment - The assumptions of the algorithm are often wrong

- - substitutions are not equally frequent at all

positions, - - it is very difficult to realistically model

insertions and deletions. - Pairwise alignment programs ALWAYS produce an

alignment (even when it does not make sense to

align sequences)

(No Transcript)