Similar Sequence Similar Function

About This Presentation

Title:

Similar Sequence Similar Function

Description:

BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than ... The p value is calculated by relating the observed alignment score, S, to the ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 39

Provided by: cyan4

Category:

more less

Transcript and Presenter's Notes

Title: Similar Sequence Similar Function

1
Similar Sequence Similar Function

Charles Yan
Spring 2006

2
From Sequence to Function

Protein sequence determine protein function. Thus
similar protein sequences have similar functions
One approach to predict function for a new
protein is to search for similar proteins
(homologues) whose functions are known. If the
similarities are high, it is likely that the new
protein has the same functions as its homologues

3
Homologue Search

Basic Local Alignment Search Tool (BLAST) finds
regions of local similarity between sequences.
The program compares nucleotide or protein
sequences to sequence databases and calculates
the statistical significance of matches. BLAST
can be used to infer functional and evolutionary
relationships between sequences as well as help
identify members of gene families

4
Dynamic Programming
a1a2a3am b1b2b3bn

Mi,j MAX
Mi-1, j-1 Si,j (match/mismatch)
Mi,j-1 w (gap in sequence 1)
Mi-1,j w (gap in sequence 2)

5
Dynamic Programming

G A A T T C A G T T A (sequence 1)
G G A T C G A (sequence 2)

Si,j 1 (match) Si,j 0 (mismatch score) w 0
(gap penalty)
6
Dynamic Programming
M1,1 MAXM0,0 1, M1, 0 0, M0,1 0 MAX
1, 0, 0 1
7
Dynamic Programming
8
Dynamic Programming
9
Global and Local Alignment

A global alignment is an optimal alignment that
includes all characters from each sequence,
whereas a local alignment is an optimal alignment
that includes only the most similar local region
or regions.

10
BLAST

The BLAST programs (Basic Local Alignment Search
Tools) are a set of sequence comparison
algorithms introduced in 1990 that are used to
search sequence databases for optimal local
alignments to a query.
Break the query and database sequences into
fragments ("words"), and initially seek matches
between fragments. The initial search is done for
a word of length "W" that scores at least "T"
when compared to the query using a given
substitution matrix.
Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S". The "T"
parameter dictates the speed and sensitivity of
the search.

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
BLAST

Web interface http//www.ncbi.nlm.nih.gov/BLAST/
Download http//www.ncbi.nlm.nih.gov/BLAST/downloa
d.shtml

15
BLAST
16
BLAST
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
BLAST
21
Substitution Matrix

A substitution matrix containing values
proportional to the probability that amino acid i
mutates into amino acid j for all pairs of amino
acids

22
Substitution Matrix

The BLOSUM family
BLOSUM matrices are based on local alignments.
BLOSUM 62 is a matrix calculated from comparisons
of sequences with no less than 62 divergence.
All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins.
BLOSUM 62 is the default matrix in BLAST 2.0.
Though it is tailored for comparisons of
moderately distant proteins, it performs well in
detecting closer relationships. A search for
distant relatives may be more sensitive with a
different matrix.

23
Substitution Matrix

The PAM family
PAM matrices are based on global alignments of
closely related proteins.
The PAM1 is the matrix calculated from
comparisons of sequences with no more than 1
divergence.
Other PAM matrices are extrapolated from PAM1.

24
Substitution Matrix

The relationship between BLOSUM and PAM
substitution matrices. BLOSUM matrices with
higher numbers and PAM matrices with low numbers
are both designed for comparisons of closely
related sequences. BLOSUM matrices with low
numbers and PAM matrices with high numbers are
designed for comparisons of distantly related
proteins. If distant relatives of the query
sequence are specifically being sought, the
matrix can be tailored to that type of search.

25
(No Transcript)
26
Raw Score S

The raw score S for an alignment is calculated by
summing the scores for each aligned position and
the scores for gaps

27
Bit Score S'

Raw scores have little meaning without detailed
knowledge of the scoring system used, or more
simply its statistical parameters K and lambda.
Unless the scoring system is understood, citing a
raw score alone is like citing a distance without
specifying feet, meters, or light years. By
normalizing a raw score using the formula
one attains a "bit score" S', which has a
standard set of units.

28
Bit Score S'

The value S' is derived from the raw alignment
score S in which the statistical properties of
the scoring system used have been taken into
account. Because bit scores have been normalized
with respect to the scoring system, they can be
used to compare alignment scores from different
searches.

29
Significance

The significance of each alignment is computed as
a P value or an E value
E value Expectation value. The number of
different alignents with scores equivalent to or
better than S that are expected to occur in a
database search by chance. The lower the E value,
the more significant the score.
P value The probability of an alignment
occurring with the score in question or better.
The p value is calculated by relating the
observed alignment score, S, to the expected
distribution of HSP scores from comparisons of
random sequences of the same length and
composition as the query to the database. The
most highly significant P values will be those
close to 0. P values and E values are different
ways of representing the significance of the
alignment.

30
E-value

In the limit of sufficiently large sequence
lengths m and n, the statistics of HSP scores are
characterized by two parameters, K and lambda.
Most simply, the expected number of HSPs with
score at least S is given by the formula We
call this the E-value for the score S. This
formula makes eminently intuitive sense. Doubling
the length of either sequence should double the
number of HSPs attaining a given score. Also, for
an HSP to attain the score 2x it must attain the
score x twice in a row, so one expects E to
decrease exponentially with score. The parameters
K and lambda can be thought of simply as natural
scales for the search space size and the scoring
system respectively.

31
P-value

The number of random HSPs with score gt S is
described by a Poisson distribution. This means
that the probability of finding exactly a HSPs
with score gtS is given by where E is the
E-value of S given by equation (1) above.
Specifically the chance of finding zero HSPs with
score gtS is e-E, so the probability of finding
at least one such HSP is This is the P-value
associated with the score S. For example, if one
expects to find three HSPs with score gt S, the
probability of finding at least one is 0.95. The
BLAST programs report E-value rather than
P-values because it is easier to understand the
difference between, for example, E-value of 5 and
10 than P-values of 0.993 and 0.99995.

32
(No Transcript)
33
PSI-BAST

Position specific iterative BLAST (PSI-BLAST)
refers to a feature of BLAST 2.0 in which a
profile (or position specific scoring matrix,
PSSM) is constructed (automatically) from a
multiple alignment of the highest scoring hits in
an initial BLAST search. The PSSM is generated by
calculating position-specific scores for each
position in the alignment. Highly conserved
positions receive high scores and weakly
conserved positions receive scores near zero. The
profile is used to perform a second (etc.) BLAST
search and the results of each "iteration" used
to refine the profile. This iterative searching
strategy results in increased sensitivity.
PSI-BLAST uses the blastp program exclusively, so
there is no need to select the program.

34
PSI-BAST
35
PSI-BAST

The threshold value for inclusion in the position
specific matrix used for PSI-BLAST iterations.
Hits with E-value less than this threshold will
be used to constructed the for next round.

36
PSI-BAST
37
PHI-BLAST

PHI-BLAST (Pattern-Hit Initiated BLAST) is a
search program that combines matching of regular
expressionswith local alignments surrounding the
match. Given a protein sequence S and a regular
expression pattern Poccurring in S, PHI-BLAST
helps answer the question What other protein
sequences both contain an occurrence of Pand are
homologous to S in the vicinity of the pattern
occurrences? PHI-BLAST may be preferable to just
searching for pattern occurrences because it
filters out those cases where the pattern
occurrence is probably random and not indicative
of homology.

38
PHI-BLAST

Write a Comment

User Comments (0)

About PowerShow.com

Similar Sequence Similar Function - PowerPoint PPT Presentation

Similar Sequence Similar Function

BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than ... The p value is calculated by relating the observed alignment score, S, to the ... – PowerPoint PPT presentation