1 / 65

ALIGNMENT OF NUCLEOTIDEAMINO-ACID SEQUENCES

(No Transcript)

Homology The term was coined by Richard Owen in

1843. Definition Similarity resulting from

common ancestry.

Homology A qualitative statment

- Homology designates a relationship of common

descent between entities - Two genes are either homologs or not
- it doesnt make sense to say two genes are 43

homologous. - it doesnt make sense to say Linda is 24

pregnant.

Homology

By comparing homologous characters, we can

reconstruct the evolutionary events that have led

to the formation of the extant sequences from the

common ancestor.

Homology

When dealing with sequences, we are interested in

POSITIONAL HOMOLOGY. We identify positional

homology by ALIGNMENT.

ACTGGGCCCAAATC

1 deletion 1 substitution

1 insertion 1 substitution

AACAGGGCCCAAATC

CTGGGCCCAGATC

Correct alignment

Incorrect alignment

CTGGGCCCAGATC-- AACAGGGCCCAAATC ..........

--CTGGGCCCAGATC AACAGGGCCCAAATC ..

Unknown!

unknown processes

unknown processes

AACAGGGCCCAAATC

CTGGGCCCAGATC

Correct alignment?

Incorrect alignment?

CTGGGCCCAGATC-- AACAGGGCCCAAATC ..........

--CTGGGCCCAGATC AACAGGGCCCAAATC ..

ACCTGAATTTGCCC

T9 G5T ACA12

-A6 -A7 T8A G2

ACCTTAATTGCACACC

AGCCTGATTGCCC

ACCTTAATTGCACACC

AGCCTGATTGCCC---

C2G, T4C, A6G, A12C, -ACC14

Positional homology A pair of nucleotides from

two aligned sequences that have descended from

one nucleotide in the ancestor of the two

sequences.

Alignment A hypothesis concerning positional

homology among residues in a sequence.

An alignment consists of a series of paired

bases, one base from each sequence. There are

three types of pairs(1) matches the same

nucleotide appears in both sequences. (2)

mismatches different nucleotides are found in

the two sequences. (3) gaps a base in one

sequence and a null base in the other.

GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG

.. ..

Sequence alignment The identification of the

location of deletion or insertions that might

have occurred in either of the two lineages since

their divergence from a common ancestor.

Insertion Deletion Indel or Gap

Sequence alignment 1. Pairwise alignment 2.

Multiple alignment

- Two DNA sequences A and B.- Lengths are m and

n, respectively. - The number of matched pairs

is x. - The number of mismatched pairs is y. -

Total number of bases in gaps is z.

There are terminal and internal gaps.

GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG

A terminal gap may indicate missing data.

GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG

An internal gap indicates that a deletion or an

insertion has occurred in one of the two

lineages.

GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG

The alignment is the first step in many

evolutionary and functional studies. Errors in

alignment tend to amplify in later computational

stages.

Methods of alignment 1. Manual 2. Dot

matrix 3. Distance Matrix 4. Combined (Distance

Manual)

- Manual alignment. When there are few gaps and the

two sequences are not too different from each

other, a reasonable alignment can be obtained by

visual inspection.

GCG-TCCATCAGGTAGTTGGTGTG GCGTTCCATCAGGTGGTTGGTGTG

.

Advantages of manual alignment (1) use of a

powerful and trainable tool (the brain, well,

some brains).(2) ability to integrate

additional data, e.g., domain structure,

biological function.

(No Transcript)

(No Transcript)

Protein Alignment may be guided by Tertiary

Structures

Escherichia coli DjlA protein

Homo sapiens DjlA protein

Disadvantages of manual alignment (1) The

method is subjective and unscalable.

The dot-matrix method The two sequences are

written out as column and row headings of a

two-dimensional matrix. A dot is put in the

dot-matrix plot at a position where the

nucleotides in the two sequences are identical.

The alignment is defined by a path from the

upper-left element to the lower-right element.

There are 4 possible steps in the path

- (1) a diagonal step through a dot match.
- (2) a diagonal step through an empty element of

the matrix mismatch. - (3) a horizontal step a gap in the sequence on

the top of the matrix. - (4) a vertical step a gap in the sequence on

the left of the matrix.

forbiddendirections

alloweddirections

A dot matrix may become cluttered. With DNA

sequences, 25 of the elements will be occupied

by dots by chance alone.

window size 1 stringency 1 alphabet size 4

The number of spurious matches is determined by

window size, stringency, alphabet size.

window size 1 stringency 1 alphabet size 4

window size 3 stringency 2 alphabet size 4

window size 1 stringency 1 alphabet size 20

Dot-matrix methodsAdvantages May unravel

information on the evolution of sequences.

Window size 60 amino acids Stringency 24

matches

Advantages Highlighting Information

Window size 60 amino acids Stringency 24

matches

Advantages Highlighting Information

The two diagonally oriented parallel lines most

probably indicate that a small internal

duplication has occurred in the bacterial gene.

Dot-matrix methodsDisadvantage May not

identify the best alignment.

Distance and similarity methods

The best possible alignment (optimal alignment)

is the one in which the numbers of mismatches and

gaps are minimized according to certain criteria.

Unfortunately, reducing the number of mismatches

results in an increase in the number of gaps, and

vice versa.

a matches b mismatches g nucleotides in

gaps d gaps

Gap penalty (or cost) is a factor (or a set of

factors) by which the gap values (numbers and

lengths of gaps) are multiplied to make the gaps

equivalent in value to the mismatches. The gap

penalties are based on our assessment of how

frequent different types of insertions and

deletions occur in evolution in comparison with

the frequency of occurrence of point

substitutions.

Mismatch penalty is an assessment of how

frequently substitutions occur.

- The distance (dissimilarity) index (D) between

two sequences in an alignment is

where yi is the number of mismatches of type i,

mi is the mismatch penalty for an i-type of

mismatch, zk is the number of gaps of length k,

and wk is a positive number representing the

penalty for gaps of length k.

- The similarity index (S) between two sequences in

an alignment is

where x is the number of matches, zk is the

number of gaps of length k, and wk is a positive

number representing the penalty for gaps of

length k.

The gap penalty has two components a gap-opening

penalty and a gap-extension penalty.

Three main systems (1) Fixed gap-penalty

system 0 gap-extension costs. (2) Linear

gap-penalty system the gap-extension cost is

calculated by multiplying the gap length minus 1

by a constant representing the gap-extension

penalty for increasing the gap by 1. (3)

Logarithmic gap-penalty system the

gap-extension penalty increases with the

logarithm of the gap length, i.e., slower.

(No Transcript)

Further complications Distinguishing among

different matches and mismatches.For example, a

mismatched pair consisting of Leu Ile, which

are very similar biochemically to each other, may

be given a lesser penalty than a mismatched pair

consisting of Arg Glu, which are very

dissimilar from each other.

Lesser penalty than

Alignment algorithms

Aim Find the alignment associated with the

smallest D (or largest S) from among all possible

alignments.

The number of possible alignments may be

astronomical. For example, when two sequences

300 residues long each are compared, there are

1088 possible alignments. In comparison, the

number of elementary particles in the universe is

only 1080.

There are computer algorithms for finding the

optimal alignment between two sequences that do

not require an exhaustive search of all the

possibilities.

The Needleman-Wunsch algorithmuses Dynamic

Programming

Dynamic programming a computational technique.

It is applicable when large searches can be

divided into a succession of small stages, such

that (1) the solution of the initial search stage

is trivial, (2) each partial solution in a later

stage can be calculated by reference to only a

small number of solutions in an earlier stage,

and (3) the last stage contains the overall

solution.

Multiple Sequence Alignment

Alignments can be easy or difficult

GCGGCCCA TCAGGTAGTT GGTGG

GCGGCCCA TCAGGTAGTT GGTGG

Easy

GCGTTCCA TCAGCTGGTT GGTGG

GCGTCCCA TCAGCTAGTT GGTGG

GCGGCGCA TTAGCTAGTT GGTGA

...

... .

TTGACATG CCGGGG---A AACCG

T-GACATG CCGGTG--GT AAGCC

TTGGCATG -CTAGG---A ACGCG

Difficult

TTGACATG -CTAGGGAAC ACGCG

TTGACATC -CTCTG---A ACGCG

.. ...

.

...

(No Transcript)

Multiple Alignment

- 2 methods
- Dynamic programming (exhaustive, exact)
- Consider 2 protein sequences of 100 amino acids

in length. - If it takes 1002 seconds to exhaustively align

these sequences, then it will take 1003 seconds

to align 3 sequences, 1004 to align 4

sequences...etc. - More time than the universe has existed to align

20 sequences exhaustively. - Progressive alignment (heuristic, approximate)

Progressive Alignment

- Devised by Feng and Doolittle in 1987.
- Essentially a heuristic method and as such is not

guaranteed to find the optimal alignment. - Requires n-1n-2n-3...n-n1 pairwise alignments

as a starting point - Most successful implementation is Clustal (Des

Higgins)

Overview ofClustal Procedure

CLUSTAL

Hbb_Human 1 -

Hbb_Horse 2 .17 -

1. Quick pairwise alignments 2. Distances for

each pair 3. Distance matrix

Hba_Human 3 .59 .60 -

Hba_Horse 4 .59 .59 .13 -

Myg_Whale 5 .77 .77 .75 .75 -

Hbb_Human

4

1

3

Hbb_Horse

Neighbor-joining tree (guide tree)

Hba_Human

2

Hba_Horse

Myg_Whale

1 PEEKSAVTALWGKVN--VDEVGG

Progressive alignment following guide tree

4

1

3

2 GEEKAAVLALWDKVN--EEEVGG

3 PADKTNVKAAWGKVGAHAGEYGA

2

4 AADKTNVKAAWSKVGGHAGEYGA

5 EHEWQLVLHVWAKVEADVAGHGQ

Clustal good points/bad points

- Advantages
- Speed.
- Disadvantages
- No way of knowing if the alignment is correct.

Effect of gap penalties on amino-acid alignment

Human pancreatic hormone precursor versus

chicken pancreatic hormone (a) Penalty

for gaps is 0 (b) Penalty for a gap of size k

nucleotides is wk 1 0.1k (c) The same

alignment as in (b), only the similarity between

the two sequences is further enhanced by showing

pairs of biochemically similar amino acids

An Alignment

GCGGCTCA TCAGGTAGTT GGTG-G

Spinach

GCGGCCCA TCAGGTAGTT GGTG-G

Rice

GCGTTCCA TC--CT-GTT GGTGTG

Mosquito

GCGTCCCA TCAGCTAGTT GTTG-G

Monkey

GCGGCGCA TTAGCTAGTT GGTG-A

Human

...

. . .