Similar Sequence Similar Function - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Similar Sequence Similar Function

Description:

BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than ... The p value is calculated by relating the observed alignment score, S, to the ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 39
Provided by: cyan4
Category:

less

Transcript and Presenter's Notes

Title: Similar Sequence Similar Function


1
Similar Sequence Similar Function
  • Charles Yan
  • Spring 2006

2
From Sequence to Function
  • Protein sequence determine protein function. Thus
    similar protein sequences have similar functions
  • One approach to predict function for a new
    protein is to search for similar proteins
    (homologues) whose functions are known. If the
    similarities are high, it is likely that the new
    protein has the same functions as its homologues

3
Homologue Search
  • Basic Local Alignment Search Tool (BLAST) finds
    regions of local similarity between sequences.
    The program compares nucleotide or protein
    sequences to sequence databases and calculates
    the statistical significance of matches. BLAST
    can be used to infer functional and evolutionary
    relationships between sequences as well as help
    identify members of gene families

4
Dynamic Programming
a1a2a3am b1b2b3bn
  • Mi,j MAX
  • Mi-1, j-1 Si,j (match/mismatch)
  • Mi,j-1 w (gap in sequence 1)
  • Mi-1,j w (gap in sequence 2)

5
Dynamic Programming
  • G A A T T C A G T T A (sequence 1)
  • G G A T C G A (sequence 2)

Si,j 1 (match) Si,j 0 (mismatch score) w 0
(gap penalty)
6
Dynamic Programming
M1,1 MAXM0,0 1, M1, 0 0, M0,1 0 MAX
1, 0, 0 1
7
Dynamic Programming
8
Dynamic Programming
9
Global and Local Alignment
  • A global alignment is an optimal alignment that
    includes all characters from each sequence,
    whereas a local alignment is an optimal alignment
    that includes only the most similar local region
    or regions.

10
BLAST
  • The BLAST programs (Basic Local Alignment Search
    Tools) are a set of sequence comparison
    algorithms introduced in 1990 that are used to
    search sequence databases for optimal local
    alignments to a query.
  • Break the query and database sequences into
    fragments ("words"), and initially seek matches
    between fragments. The initial search is done for
    a word of length "W" that scores at least "T"
    when compared to the query using a given
    substitution matrix.
  • Word hits are then extended in either direction
    in an attempt to generate an alignment with a
    score exceeding the threshold of "S". The "T"
    parameter dictates the speed and sensitivity of
    the search.

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
BLAST
  • Web interface http//www.ncbi.nlm.nih.gov/BLAST/
  • Download http//www.ncbi.nlm.nih.gov/BLAST/downloa
    d.shtml

15
BLAST
16
BLAST
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
BLAST
21
Substitution Matrix
  • A substitution matrix containing values
    proportional to the probability that amino acid i
    mutates into amino acid j for all pairs of amino
    acids

22
Substitution Matrix
  • The BLOSUM family
  • BLOSUM matrices are based on local alignments.
  • BLOSUM 62 is a matrix calculated from comparisons
    of sequences with no less than 62 divergence.
  • All BLOSUM matrices are based on observed
    alignments they are not extrapolated from
    comparisons of closely related proteins.
  • BLOSUM 62 is the default matrix in BLAST 2.0.
    Though it is tailored for comparisons of
    moderately distant proteins, it performs well in
    detecting closer relationships. A search for
    distant relatives may be more sensitive with a
    different matrix.

23
Substitution Matrix
  • The PAM family
  • PAM matrices are based on global alignments of
    closely related proteins.
  • The PAM1 is the matrix calculated from
    comparisons of sequences with no more than 1
    divergence.
  • Other PAM matrices are extrapolated from PAM1.

24
Substitution Matrix
  • The relationship between BLOSUM and PAM
    substitution matrices. BLOSUM matrices with
    higher numbers and PAM matrices with low numbers
    are both designed for comparisons of closely
    related sequences. BLOSUM matrices with low
    numbers and PAM matrices with high numbers are
    designed for comparisons of distantly related
    proteins. If distant relatives of the query
    sequence are specifically being sought, the
    matrix can be tailored to that type of search.

25
(No Transcript)
26
Raw Score S
  • The raw score S for an alignment is calculated by
    summing the scores for each aligned position and
    the scores for gaps

27
Bit Score S'
  • Raw scores have little meaning without detailed
    knowledge of the scoring system used, or more
    simply its statistical parameters K and lambda.
    Unless the scoring system is understood, citing a
    raw score alone is like citing a distance without
    specifying feet, meters, or light years. By
    normalizing a raw score using the formula
    one attains a "bit score" S', which has a
    standard set of units.

28
Bit Score S'
  • The value S' is derived from the raw alignment
    score S in which the statistical properties of
    the scoring system used have been taken into
    account. Because bit scores have been normalized
    with respect to the scoring system, they can be
    used to compare alignment scores from different
    searches.

29
Significance
  • The significance of each alignment is computed as
    a P value or an E value
  • E value Expectation value. The number of
    different alignents with scores equivalent to or
    better than S that are expected to occur in a
    database search by chance. The lower the E value,
    the more significant the score.
  • P value The probability of an alignment
    occurring with the score in question or better.
    The p value is calculated by relating the
    observed alignment score, S, to the expected
    distribution of HSP scores from comparisons of
    random sequences of the same length and
    composition as the query to the database. The
    most highly significant P values will be those
    close to 0. P values and E values are different
    ways of representing the significance of the
    alignment.

30
E-value
  • In the limit of sufficiently large sequence
    lengths m and n, the statistics of HSP scores are
    characterized by two parameters, K and lambda.
    Most simply, the expected number of HSPs with
    score at least S is given by the formula We
    call this the E-value for the score S.   This
    formula makes eminently intuitive sense. Doubling
    the length of either sequence should double the
    number of HSPs attaining a given score. Also, for
    an HSP to attain the score 2x it must attain the
    score x twice in a row, so one expects E to
    decrease exponentially with score. The parameters
    K and lambda can be thought of simply as natural
    scales for the search space size and the scoring
    system respectively.

31
P-value
  • The number of random HSPs with score gt S is
    described by a Poisson distribution. This means
    that the probability of finding exactly a HSPs
    with score gtS is given by where E is the
    E-value of S given by equation (1) above.
    Specifically the chance of finding zero HSPs with
    score gtS is e-E, so the probability of finding
    at least one such HSP is This is the P-value
    associated with the score S. For example, if one
    expects to find three HSPs with score gt S, the
    probability of finding at least one is 0.95. The
    BLAST programs report E-value rather than
    P-values because it is easier to understand the
    difference between, for example, E-value of 5 and
    10 than P-values of 0.993 and 0.99995.

32
(No Transcript)
33
PSI-BAST
  • Position specific iterative BLAST (PSI-BLAST)
    refers to a feature of BLAST 2.0 in which a
    profile (or position specific scoring matrix,
    PSSM) is constructed (automatically) from a
    multiple alignment of the highest scoring hits in
    an initial BLAST search. The PSSM is generated by
    calculating position-specific scores for each
    position in the alignment. Highly conserved
    positions receive high scores and weakly
    conserved positions receive scores near zero. The
    profile is used to perform a second (etc.) BLAST
    search and the results of each "iteration" used
    to refine the profile. This iterative searching
    strategy results in increased sensitivity.
  • PSI-BLAST uses the blastp program exclusively, so
    there is no need to select the program.

34
PSI-BAST
35
PSI-BAST
  • The threshold value for inclusion in the position
    specific matrix used for PSI-BLAST iterations.
    Hits with E-value less than this threshold will
    be used to constructed the for next round.

36
PSI-BAST
37
PHI-BLAST
  • PHI-BLAST (Pattern-Hit Initiated BLAST) is a
    search program that combines matching of regular
    expressionswith local alignments surrounding the
    match. Given a protein sequence S and a regular
    expression pattern Poccurring in S, PHI-BLAST
    helps answer the question What other protein
    sequences both contain an occurrence of Pand are
    homologous to S in the vicinity of the pattern
    occurrences? PHI-BLAST may be preferable to just
    searching for pattern occurrences because it
    filters out those cases where the pattern
    occurrence is probably random and not indicative
    of homology.

38
PHI-BLAST
Write a Comment
User Comments (0)
About PowerShow.com