Homework assignments - PowerPoint PPT Presentation

1 / 105
About This Presentation
Title:

Homework assignments

Description:

Biological entities share common ancestry. 4 ... Similarity is mostly indicative of common ancestry. 31. Identify functional parts ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 106
Provided by: nsm5
Category:

less

Transcript and Presenter's Notes

Title: Homework assignments


1
Homework assignments
  • Kindly provide details on searches and other
    procedures.
  • Type it. Handwritten assignments will not be
    accepted.
  • Staple. Pages (and grades) may go missing
    otherwise.
  • Do not include questions. Waste of paper.
  • The assignments are individual. Collaborative
    efforts will be graded proportionally. That is,
    the grade will be divided by the number of
    participants in the effort.

2
PAIRWISE ALIGNMENT orALIGNMENT OF TWO
NUCLEOTIDEOR AMINO-ACID SEQUENCES
3
Assumptions Life is monophyletic Biological
entities share common ancestry
4
Any two organisms share a common ancestor in
their past
5
(5 MYA)
6
(120 MYA)
ancestor
7
(1,500 MYA)
ancestor
8
Gene duplication results in homologous entities
9
Homology The term was coined by Richard Owen in
1843. Definition Similarity resulting from
common ancestry.
10
Homology
  • There are three types of homology orthology,
    paralogy, and xenology
  • The distinction among the three types of homology
    was introduced by Walter Fitch in 1970

11
Homology General Definition
  • Homology designates a relationship of common
    descent between entities
  • Two genes are either homologs or not
  • it doesnt make sense to say two genes are 43
    homologous.
  • it doesnt make sense to say Linda is 24
    pregnant.

12
Orthology vs. Paralogy
  • Two genes are orthologs if they originated from a
    single ancestral gene in the most recent common
    ancestor of their respective genomes
  • Two genes are paralogs if they are related by
    duplication

13
(No Transcript)
14
(No Transcript)
15
Xenology is due to horizontal (lateral) gene
transfer (HGT or LGT)
XA and XB are xenologs
Distinguishing orthologs from xenologs is
impossible in pairwise genomic comparisons, but
possible when multiple genomes are compared
16
Orthology vs. Paralogy(Figure from Fitch, Trends
in Genetics, 2000. 16(5)227-231)
17
Homology
By comparing homologous characters, we can
reconstruct the evolutionary events that have led
to the formation of the extant sequences from the
common ancestor.
18
Homology
When dealing with sequences, we are interested in
POSITIONAL HOMOLOGY. We identify positional
homology by ALIGNMENT.
19
ACTGGGCCCAAATC
1 deletion 1 substitution
1 insertion 1 substitution
AACAGGGCCCAAATC
CTGGGCCCAGATC
Correct alignment
Incorrect alignment
CTGGGCCCAGATC-- AACAGGGCCCAAATC ..........
--CTGGGCCCAGATC AACAGGGCCCAAATC ..
20
Unknown!
unknown processes
unknown processes
AACAGGGCCCAAATC
CTGGGCCCAGATC
Correct alignment?
Incorrect alignment?
CTGGGCCCAGATC-- AACAGGGCCCAAATC ..........
--CTGGGCCCAGATC AACAGGGCCCAAATC ..
21
ACCTGAATTTGCCC
T9 G5T ACA12
-A6 -A7 T8A G2
ACCTTAATTGCACACC
AGCCTGATTGCCC
ACCTTAATTGCACACC
AGCCTGATTGCCC---
C2G, T4C, A6G, A12C, -ACC14
22
Positional homology A pair of nucleotides from
two aligned sequences that have descended from
one nucleotide in the ancestor of the two
sequences.
Alignment A hypothesis concerning positional
homology among residues in a sequence.
23
A pairwise alignment consists of a series of
paired bases, one base from each sequence. There
are three types of pairs(1) matches the same
nucleotide appears in both sequences. (2)
mismatches different nucleotides are found in
the two sequences. (3) gaps a base in one
sequence and a null base in the other.
GCGGCCCATCAGGTAGTTGGTGG GCGTTCCATCCTGGTTGGTGTG
.. .........
24
Sequence alignment The identification of the
location of deletion or insertions that might
have occurred in either of the two lineages since
their divergence from a common ancestor.
Insertion Deletion Indel or Gap
25
-Two DNA sequences A and B.-Lengths are m and
n, respectively. -The number of matched pairs is
x. -The number of mismatched pairs is y. -
Total number of bases in gaps is z.
26
There are terminal and internal gaps.
GCGG-CCATCAGGTAGTTGGTG- GCGTTCCATCCTGGTTGGTGTG
. ..........
27
A terminal gap may indicate missing data.
GCGG-CCATCAGGTAGTTGGTG- GCGTTCCATCCTGGTTGGTGTG
. ..........
28
An internal gap indicates that a deletion or an
insertion has occurred in one of the two
lineages.
GCGG-CCATCAGGTAGTTGGTG- GCGTTCCATCCTGGTTGGTGTG
. ..........
29
The alignment is the first step in many
evolutionary and functional studies. Errors in
alignment tend to amplify in later computational
stages.
30
Motivation for sequence alignment
  • Study function
  • Sequences that are similar probably have similar
    functions.
  • Study evolution
  • Similarity is mostly indicative of common
    ancestry.

31
Motivation for sequence alignment
  • Identify functional parts
  • Strong conservation may be indicative of
    functional importance.
  • Identify disease etiology
  • Sequence differences may be indicative of cause
    of disease.

32
Some definitions
33
An example of pairwise alignment of an unknown
protein with a known one
  • Glutaredoxin, Bacteriophage T4 from E. coli, 87
    aa
  • (B) Unknown protein - 93 aa

Glutaredoxin, Bacteriophage 65 from Aeromonas sp.
93 aa
10 20 30 40
50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVK
KQPFEFINIMPEKGV---FDDEKIAELLTKLGR ..
.. . .. .. . . .
.. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKK
ANNQLGFDYILEKFDECKARANM 10 20
30 40 50 60
60 70 80 Glutar
DTQIGLTMPQVFAPDGSHIGGFDQLREYF ..
..... .... ... .Unknow QTR-PTSFPRIFV-DGQYI
GSLKQFKDLY 70 80 90
Is the unknown protein a glutaredoxin?
34
Methods of alignment 1. Manual 2. Dot
matrix 3. Distance Matrix 4. Combined (Distance
Manual)
35
  • Manual alignment. When there are few gaps and the
    two sequences are not too different from each
    other, a reasonable alignment can be obtained by
    visual inspection.

GCG-TCCATCAGGTAGTTGGTGTG GCGATCCATCAGGTGGTTGGTGTG
.
36
Advantages of manual alignment (1) use of a
powerful and trainable tool (the brain, well
some brains).(2) ability to integrate
additional data, e.g., domain structure,
biological function.
37
(No Transcript)
38
Protein Alignment may be guided by Secondary and
Tertiary Structures
Escherichia coli DjlA protein
Homo sapiens DjlA protein
39
Disadvantages of manual alignment The method
is subjective and unscalable.
40
The dot-matrix method (Gibbs and McIntyre, 1970)
The two sequences are written out as column and
row headings of a two-dimensional matrix. A dot
is put in the dot-matrix plot at a position where
the nucleotides in the two sequences are
identical.
41
The alignment is defined by a path from the
upper-left element to the lower-right element.
42
There are 4 possible steps in the path
  • (1) a diagonal step through a dot match.
  • (2) a diagonal step through an empty element of
    the matrix mismatch.
  • (3) a horizontal step a gap in the sequence on
    the top of the matrix.
  • (4) a vertical step a gap in the sequence on
    the left of the matrix.

43
A dot matrix may become cluttered. With DNA
sequences, 25 of the elements will be occupied
by dots by chance alone.
44
window size 1 stringency 1 alphabet size 4
The number of spurious matches is determined by
window size, stringency, alphabet size.
45
window size 1 stringency 1 alphabet size 4
window size 3 stringency 2 alphabet size 4
46
window size 1 stringency 1 alphabet size 20
47
Dot-matrix methodsAdvantages May unravel
information on the evolution of
sequences.Disadvantages May not identify the
best possible alignment.
48
Window size 60 amino acids Stringency 24
matches
Advantages Highlighting Information
49
Window size 60 amino acids Stringency 24
matches
Advantages Highlighting Information
The two diagonally oriented parallel lines most
probably indicate that a small internal
duplication has occurred in the bacterial gene.
50
Disadvantages Not possible to identify the
best alignment.
51
Distance and similarity methods
52
The best possible alignment (optimal alignment)
is the one in which the numbers of mismatches and
gaps are minimized according to certain criteria.
53
Unfortunately, reducing the number of mismatches
results in an increase in the number of gaps, and
vice versa.
54
a matches b mismatches g nucleotides in
gaps d gaps
55
Gap penalty (or cost) is a factor (or a set of
factors) by which the gap values (numbers and
lengths of gaps) are multiplied to make the gaps
equivalent in value to the mismatches. The gap
penalties are based on our assessment of how
frequent different types of insertions and
deletions occur in evolution in comparison with
the frequency of occurrence of point
substitutions.
56
Mismatch penalty is an assessment of how
frequently substitutions occur.
57
  • The distance (dissimilarity) index (D) between
    two sequences in an alignment is

where yi is the number of mismatches of type i,
mi is the mismatch penalty for an i-type of
mismatch, zk is the number of gaps of length k,
and wk is a positive number representing the
penalty for gaps of length k.
58
  • The similarity index (S) between two sequences in
    an alignment is

where x is the number of matches, zk is the
number of gaps of length k, and wk is a positive
number representing the penalty for gaps of
length k.
59
The gap penalty has two components a gap-opening
penalty and a gap-extension penalty.
60
Three main systems (1) Fixed gap-penalty
system 0 gap-extension costs. (2) Linear
gap-penalty system the gap-extension cost is
calculated by multiplying the gap length minus 1
by a constant representing the gap-extension
penalty for increasing the gap by 1. (3)
Logarithmic gap-penalty system the
gap-extension penalty increases with the
logarithm of the gap length, i.e., slower.
61
(No Transcript)
62
Further complications Distinguishing among
different matches and mismatches.For example, a
mismatched pair consisting of Leu Ile, which
are very similar biochemically to each other, may
be given a lesser penalty than a mismatched pair
consisting of Arg Glu, which are very
dissimilar from each other.
63
Lesser penalty than
64
Alignment algorithms
65
Aim Given certain criteria, find the alignment
associated with the smallest D (or largest S)
from among all possible alignments.
66
The number of possible alignments may be
astronomical. For example, when two sequences
300 residues long each are compared, there are
1088 possible alignments. In comparison, the
number of elementary particles in the universe is
only 1080.
67
There are computer algorithms for finding the
optimal alignment between two sequences that do
not require an exhaustive search of all the
possibilities.
68
The Needleman-Wunsch algorithmuses Dynamic
Programming
69
Dynamic programming a computational technique.
It is applicable when large searches can be
divided into a succession of small stages, such
that (1) the solution of the initial search stage
is trivial, (2) each partial solution in a later
stage can be calculated by reference to only a
small number of solutions in an earlier stage,
and (3) the last stage contains the overall
solution.
70
Dynamic programming can be applied to problems of
alignment because similarity indices obey the
following rules
71
Path Graph for aligning two sequences
72
allowed
73
not allowed
74
  • An alignment is calculated in two stages

75
The two sequences are arranged in a dot matrix.
For each element, the similarity index is
calculated.
76
A possible path (alignment)
1
Scoring Scheme Match 1 Mismatch 0 Indel
1
1
0
1
Score for this path 2
0
-1
77
Another possible path (alignment)
1
Scoring Scheme Match 1 Mismatch 0 Indel
1
1
0
1
Score for this path 1
-1
-1
78
The two sequences are arranged in a dot matrix.
For each element, the similarity index is
calculated. At the same time, the position of the
best score in the previous row or column is
stored. This stored value is called a pointer.
The relationship between the new value and the
pointer is represented by an arrow.
79
(No Transcript)
80
The alignment is produced by starting at the
highest similarity score in either the rightmost
column or the bottom row, and proceeding from
right to left by following the best pointers.
This stage is called the traceback. The graph of
pointers in the traceback is also referred to as
the path graph because it defines the paths
through the matrix that correspond to the optimal
alignment or alignments.
81
Scoring Matrices
Mismatch and gap penalties should be inversely
proportional to the frequencies with which
changes occur.
82
Transitions (68) occur more frequently than
transversions (32). Mismatch penalties for
transitions should be smaller than those for
transversions.
83
Empirical substitution matrices
PAM (Percent/Point Accepted Mutation) BLOSUM
(BLOcks SUbstitution Matrix)
84
PAM
  • Developed by Margaret Dayhoff in 1978.
  • Based on comparisons of very similar protein
    sequences.

85
Log-odds ratios
  • A scoring matrix is a table of values that
    describe the probability of a residue (amino acid
    or base) pair occurring in an alignment.
  • The values in a scoring matrix are log ratios of
    two probabilities.
  • One is the random probability. The other
    is the probability of a empirical pair
    occurrence.
  • Because the scores are logarithms of probability
    ratios, they can be added to give a meaningful
    score for the entire alignment. The more
    positive the score, the better the alignment!

86
The PAM matrices (Percent accepted mutations)
  • Align sequences that are at least 85 identical.
  • Minimizes ambiguity in alignments and the number
    of coincident mutations.
  • Reconstruct phylogenetic trees and infer
    ancestral sequences.
  • Tally replacements "accepted" by natural
    selection, in all pairwise comparisons.
  • Meaning, the number of times j was replaced by i
    in all comparisons.
  • Compute amino acid mutability (i.e., the
    propensity of a given amino acid, j, to be
    replaced).

87
The PAM matrices
  • Combine data to produce a Mutation Probability
    Matrix for one PAM of evolutionary distance,
    which is used to calculate the Log Odds Matrix
    for similarity scoring.
  • Thus, depending on the protein family used,
    various PAM matrices result - some of which are
    good at locating evolutionary distant conserved
    mutations and some that are good at locating
    evolutionary close conserved mutations.

88
More on log-odds ratios
In PAM log-odds scores are multiplied by 10 to
avoid decimals. Therefore, a PAM score of 2
actually corresponds to a log-odds ratio of 0.2.
0.2 substitioni to j log10 (observed ij
mutation rate) / (expected rate) The value
0.2 is log10 of the relative expectation value of
the mutation. Therefore, the expectation value
is 100.2 1.6. So, a PAM score of 2 indicates
that (in related sequences) the mutation would be
expected to occur 1.6 times more frequently than
random.
89
PAM250
  • Calculated for families of related proteins (85
    identity)
  • 1 PAM is the amount of evolutionary change that
    yields, on average, one substitution in 100 amino
    acid residues
  • A positive score signifies a common replacement
    whereas a negative score signifies an unlikely
    replacement
  • PAM250 matrix assumes/is optimized for sequences
    separated by 250 PAM, i.e. 250 substitutions in
    100 amino acids (longer evolutionary time)

90
PAM250
Sequence alignment matrix that allows 250
accepted point mutations per 100 amino acids.
PAM250 is suitable for comparing distantly
related sequences, while a lower PAM is suitable
for comparing more closely related sequences.
91
Selecting a PAM Matrix
  • Low PAM numbers short sequences, strong local
    similarities.
  • High PAM numbers long sequences, weak
    similarities.
  • PAM60 for close relations (60 identity)
  • PAM120 recommended for general use (40 identity)
  • PAM250 for distant relations (20 identity)
  • If uncertain, try several different matrices
  • PAM40, PAM120, PAM250 recommended.

92
BLOSUM
  • Blocks Substitution Matrix
  • Steven and Jorga G. Henikoff (1992).
  • Based on BLOCKS database (www.blocks.fhcrc.org)
  • Families of proteins with identical function.
  • Highly conserved protein domains.
  • Ungapped local alignment to identify motifs
  • Each motif is a block of local alignment.
  • Counts amino acids observed in same column.
  • Symmetrical model of substitution.


93
BLOSUM62
  • BLOSUM matrices are based on local alignments
    (blocks or conserved amino acid patterns).
  • BLOSUM 62 is a matrix calculated from comparisons
    of sequences with no less than 62 divergence.
  • All BLOSUM matrices are based on observed
    alignments they are not extrapolated from
    comparisons of closely related proteins.
  • BLOSUM 62 is the default matrix in BLAST 2.0.

94
BLOSUM Matrices
  • Different BLOSUMn matrices are calculated
    independently from BLOCKS
  • BLOSUMn is based on sequences that are at most n
    percent identical.

95
BLOSUM62
The procedure for calculating a BLOSUM matrix is
based on a likelihood method estimating the
occurrence of each possible pairwise
substitution. Only aligned blocks are used to
calculate the BLOSUMs. The higher the score The
more closely related sequences.
96
Why is BLOSUM62 called BLOSUM62?
Because all blocks whose members shared at least
62 identity with ANY other member of that block
were averaged and represented as 1 sequence.
97
Selecting a BLOSUM Matrix
  • For BLOSUMn, higher n suitable for sequences
    which are more similar
  • BLOSUM62 recommended for general use
  • BLOSUM80 for close relations
  • BLOSUM45 for distant relations

98
  • Equivalent PAM and Blosum matricesThe following
    matrices are roughly equivalent...
  • PAM100 Blosum90
  • PAM120 Blosum80
  • PAM160 Blosum60
  • PAM200 Blosum52
  • PAM250 Blosum45Generally speaking...
  • The Blosum matrices are best for detecting local
    alignments.
  • The Blosum62 matrix is the best for detecting the
    majority of weak protein similarities.
  • The Blosum45 matrix is the best for detecting
    long and weak alignments.

Less divergent
More divergent
99
Comparison of PAM250 and BLOSUM62
The relationship between BLOSUM and PAM
substitution matrices BLOSUM matrices with
higher numbers and PAM matrices with low numbers
are both designed for comparisons of closely
related sequences. BLOSUM matrices with low
numbers and PAM matrices with high numbers are
designed for comparisons of distantly related
proteins. If distant relatives of the query
sequence are specifically being sought, the
matrix can be tailored to that type of search.
100
Scoring matrices commonly used
  • PAM250
  • Shown to be appropriate for searching for
    sequences of 17-27 identity.
  • BLOSUM62
  • Though it is tailored for comparisons of
    moderately distant proteins, it performs well in
    detecting closer relationships.
  • BLOSUM50
  • Shown to be better for FASTA searches.

101
Effect of gap penalties on amino-acid alignment
Human pancreatic hormone precursor versus
chicken pancreatic hormone (a) Penalty
for gaps is 0 (b) Penalty for a gap of size k
nucleotides is wk 1 0.1k (c) The same
alignment as in (b), only the similarity between
the two sequences is further enhanced by showing
pairs of biochemically similar amino acids
102
An Alignment
103
Local vs. Global Alignment
  • The Global Alignment Problem tries to find the
    longest path between vertices (0,0) and (n,m) in
    the edit graph.
  • The Local Alignment Problem tries to find the
    longest path among paths between arbitrary
    vertices (i,j) and (i,j) in the edit graph.

104
Local vs. Global Alignment
  • Global Alignment
  • Local Alignmentbetter alignment to find
    conserved segment

--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-G
AC


AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT-
-C
tccCAGTTATGTCAGgggacacgagcatgcagag
ac

aattgccgccgtcgttttcagCAGTTATGTCAGatc
105
Local Alignments Why?
  • Two genes in different species may be similar
    over short conserved regions and dissimilar over
    remaining regions.
  • Example
  • Homeobox genes have a short region called the
    homeodomain that is highly conserved between
    species.
  • A global alignment would not find the homeodomain
    because it would try to align the ENTIRE sequence

106
Link for Dynamic Programming tutorial
  • http//www.sbc.su.se/pjk/molbioinfo2001/dynprog/d
    ynamic.html
Write a Comment
User Comments (0)
About PowerShow.com