http:creativecommons'orglicensesbysa2'0 - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

http:creativecommons'orglicensesbysa2'0

Description:

Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ ... should look for hits with E-values of 10-6 or less and sequence identity of 70% or more ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 60
Provided by: stephe78
Category:

less

Transcript and Presenter's Notes

Title: http:creativecommons'orglicensesbysa2'0


1
http//creativecommons.org/licenses/by-sa/2.0/
2
Sequence Similarity Searching Understanding
and UsingWeb Based BLAST
Dr. Joanne Fox joanne_at_bioinformatics.ubc.ca
3
Concepts of Sequence Similarity Searching
  • The premise
  • The sequence itself is not informative it must
    be analyzed by comparative methods against
    existing databases to develop hypothesis
    concerning relatives and function.

4
Important Terms for Sequence Similarity Searching
with very different meanings
  • Similarity
  • The extent to which nucleotide or protein
    sequences are related. The extent of similarity
    between two sequences can be based on percent
    sequence identity and/or conservation. In BLAST
    similarity refers to a positive matrix score.
  • Identity
  • The extent to which two (nucleotide or amino
    acid) sequences are invariant.
  • Homology
  • Similarity attributed to descent from a common
    ancestor.
  • It is your responsibility as an informed
    bioinformatician to use these terms correctly A
    sequence is either homologous or not. Dont use
    with this term!

5
Sequence Similarity Searching The Approach
  • Sequence similarity searching involves the use of
    a set of algorithms (such as the BLAST programs)
    to compare a query sequence to all the sequences
    in a specified database.
  • Comparisons are made in a pairwise fashion. Each
    comparison is given a score reflecting the degree
    of similarity between the query and the sequence
    being compared.
  • The higher the score, the greater the degree of
    similarity. The similarity is measured and shown
    by aligning two sequences.

6
Sequence Similarity Searching The Alignment
  • Alignments can be global or local (this is
    algorithm specific)
  • A global alignment is an optimal alignment that
    includes all characters from each sequence
    (clustal generates global alignments)
  • A local alignment is an optimal alignment that
    includes only the most similar local region or
    regions (BLAST generates local alignments).

7
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
8
Topics
BLAST program
  • The different blast programs
  • Understanding the BLAST algorithm
  • Word size
  • HSPs
  • Understanding BLAST statistics
  • The alignment score (S)
  • Scoring Matrices
  • Dealing with gaps in an alignment
  • The expectation value (E)

9
The BLAST algorithm
  • The BLAST programs (Basic Local Alignment Search
    Tools) are a set of sequence comparison
    algorithms introduced in 1990 that are used to
    search sequence databases for optimal local
    alignments to a query.
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman
    DJ (1990) Basic local alignment search tool. J.
    Mol. Biol. 215403-410.
  • Altschul SF, Madden TL, Schaeffer AA, Zhang J,
    Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST
    and PSI-BLAST a new generation of protein
    database search programs. NAR 253389-3402.

10
http//www.ncbi.nlm.nih.gov/BLAST/
blastn
11
Several different BLAST programs
 

12
Other BLAST programs
  • BLAST 2 Sequences (bl2seq)
  • Aligns two sequences of your choice
  • Can do different types of comparison ex. Blastx
  • Gives dot-plot like output
  • VecScreen
  • Compares query with sequences of known cloning
    vectors
  • Both very handy for sequencing!

13
More BLAST programs
  • BLAST against genomes
  • Many available
  • BLAST parameters pre-optimized
  • Handy for mapping query to genome
  • Search for short exact matches
  • BLAST parameters pre-optimized
  • Great for checking probes and primers

14
MegaBLAST
  • megaBLAST
  • For aligning sequences which differ slightly due
    to sequencing errors etc.
  • Very efficient for long query sequences
  • Uses big word (k-tuple) sizes to start search
  • Very fast
  • Accepts batch submissions of ESTs
  • Can upload files of sequences as queries
  • More detailed info see megaBLAST pages

15
How Does BLAST Really Work?
  • The BLAST programs improved the overall speed of
    searches while retaining good sensitivity
    (important as databases continue to grow) by
    breaking the query and database sequences into
    fragments ("words"), and initially seeking
    matches between fragments.
  • Word hits are then extended in either direction
    in an attempt to generate an alignment with a
    score exceeding the threshold of "S".

16
Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
17
How Does BLAST Really Work?
  • The BLAST programs improved the overall speed of
    searches while retaining good sensitivity
    (important as databases continue to grow) by
    breaking the query and database sequences into
    fragments ("words"), and initially seeking
    matches between fragments.
  • Word hits are then extended in either direction
    in an attempt to generate an alignment with a
    score exceeding the threshold of "S".

18
Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
19
Each BLAST hit generates an alignment that can
contain one or more these high scoring pairs
(HSPs)
20
Where does the score (S) come from?
  • The quality of each pair-wise alignment is
    represented as a score and the scores are ranked.
  • Scoring matrices are used to calculate the score
    of the alignment base by base (DNA) or amino acid
    by amino acid (protein).
  • The alignment score will be the sum of the scores
    for each position.

21
Whats a scoring matrix?
  • Substitution matrices are used for amino acid
    alignments. These are matrices in which each
    possible residue substitution is given a score
    reflecting the probability that it is related to
    the corresponding residue in the query.
  • A unitary matrix is used for DNA pairs because
    each position can be given a score of 1 if it
    matches and a score of zero if it does not.

22
PAM vs. BLOSUM scoring matrices
  • BLOSUM 62 is the default matrix in BLAST 2.0.
    Though it is tailored for comparisons of
    moderately distant proteins, it performs well in
    detecting closer relationships. A search for
    distant relatives may be more sensitive with a
    different matrix.

23
PAM vs BLOSUM scoring matrices
  • The PAM Family
  • PAM matrices are based on global alignments of
    closely related proteins.
  • The PAM1 is the matrix calculated from
    comparisons of sequences with no more than 1
    divergence.
  • Other PAM matrices are extrapolated from PAM1.
  • The BLOSUM family
  • BLOSUM matrices are based on local alignments.
  • BLOSUM 62 is a matrix calculated from comparisons
    of sequences with no less than 62 divergence.
  • All BLOSUM matrices are based on observed
    alignments they are not extrapolated from
    comparisons of closely related proteins.

24
What happens if you have a gap in the alignment?
  • A gap is a position in the alignment at which a
    letter is paired with a null
  • Gap scores are negative. Since a single
    mutational event may cause the insertion or
    deletion of more than one residue, the presence
    of a gap is frequently ascribed more significance
    than the length of the gap.
  • Hence the gap is penalized heavily, whereas a
    lesser penalty is assigned to each subsequent
    residue in the gap.

25
What do the Score and the e-value really mean?
  • The quality of the alignment is represented by
    the Score.
  • Score (S)
  • The score of an alignment is calculated as the
    sum of substitution and gap scores. Substitution
    scores are given by a look-up table (PAM, BLOSUM)
    whereas gap scores are assigned empirically .
  • The significance of each alignment is computed as
    an E value.
  • E value (E)
  • Expectation value. The number of different
    alignments with scores equivalent to or better
    than S that are expected to occur in a database
    search by chance. The lower the E value, the more
    significant the score.

26
Is the E-value the same P-value?
  • E value (E)
  • Expectation value. The number of different
    alignments with scores equivalent to or better
    than S that are expected to occur in a database
    search by chance. The lower the E value, the more
    significant the score.
  • When E lt 0.01, P-values and E-value are nearly
    identical.
  • So, the E-value is the number of times you expect
    to see your hit occur in the database (with as
    good as or better score) due to randomn chance
    alone.

27
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
28
Topics
BLAST databases
  • The different blast databases provided by the
    NCBI
  • Protein databases
  • Nucleotide databases
  • Genomic databases
  • Considerations for choosing a BLAST database
  • Custom databases for BLAST

29
BLAST protein databases available at through
blastp web interface _at_ NCBI
30
BLAST nucleotide databases available at through
blastn web interface _at_ NCBI
31
Considerations for choosing a BLAST database
  • First consider your research question
  • Are you looking for an ortholog in a particular
    species?
  • BLAST against the genome of that species.
  • Are you looking for additional members of a
    protein family across all species?
  • BLAST against nr, if you cant find hits check
    wgs, htgs, and the trace archives.
  • Are you looking to annotate genes in your species
    of interest?
  • BLAST against known genes (RefSeq) and/or ESTs
    from a closely related species.

32
When choosing a database for BLAST
  • It is important to know your reagents.
  • Changing your choice of database is changing your
    search space completely
  • Database size affects the BLAST statistics
  • record BLAST parameters, database choice,
    database size in your bioinformatics lab book,
    just as you would for your wet-bench experiments.
  • Databases change rapidly and are updated
    frequently
  • It may be necessary to repeat your analyses

33
Creating Custom Databases for BLAST
UBiC FAQ
34
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
35
Topics
BLAST results
  • Choosing the right BLAST program
  • Running a blastp search
  • BLAST parameters and options to consider
  • Viewing BLAST results
  • Look at your alignments
  • Using the BLAST taxonomy report

36
http//www.ncbi.nlm.nih.gov/BLAST/
blastn
37
http//www.ncbi.nlm.nih.gov/BLAST/
Program selection guide
38
What BLAST program should I use? check the
NCBIs BLAST Program selection guide
39
http//www.ncbi.nlm.nih.gov/BLAST/
40
Input your query (gi231571) as FASTA, raw
sequence, or Accession/ID and choose your
database
database
41
Links to more information can be found on the
BLAST page
links
links
links
links
42
BLAST parameters and options to consider
conserved domains
Entrez query
E-value cutoff
Word size
43
More BLAST parameters and options to consider
filtering
gap penalities
matrix
44
Run your BLAST search
BLAST
45
The BLAST Queue
click for more info
Note your RID
46
Formatting and Retrieving your BLAST results
Results
options
47
A graphical view of your BLAST results
48
The BLAST hit list
Score
E-Value
GenBank
alignment
EntrezGene
49
The BLAST pairwise alignments
Identity
Similarity
50
Sorting BLAST results by Taxonomy
Taxonomy Report
51
Tax BLAST Report
Summary hits by lineage
BLAST hits by organism
52
BLAST statistics to record in your bioinformatics
labbook
Record the statistics that are found at bottom of
your BLAST results page
53
Homology Some Guidelines
  • Similarity can be indicative of homology
  • Generally, if two sequences are significantly
    similar over entire length they are likely
    homologous
  • Low complexity regions can be highly similar
    without being homologous
  • Homologous sequences not always highly similar
  • Suggested BLAST Cutoffs
  • (source Chapter 11 Bioinformatics A Practical
    Guide to the Analysis of Genes and Proteins)
  • For nucleotide based searches, one should look
    for hits with E-values of 10-6 or less and
    sequence identity of 70 or more
  • For protein based searches, one should look for
    hits with E-values of 10-3 or less and sequence
    identity of 25 or more

54
Advanced BLAST programs
  • The NCBI BLAST pages have several advanced BLAST
    methods available
  • PSI-BLAST
  • PHI-BLAST
  • RPS-BLAST
  • All are powerful methods based on protein
    similarities

55
PSI-BLAST
  • Position Specific Iterated BLAST
  • A cycling/iterative method
  • Gives increased sensitivity for detecting
    distantly related proteins
  • Can give insight into functional relationships
  • Very refined statistical methods
  • Fast still based on BLAST methods
  • Simple to use

56
How does PSI-BLAST work?
  • First, a standard blastp is performed
  • The highest scoring hits are used to generate a
    multiple alignment
  • A Position Specific Scoring Matrix (PSSM) is
    generated from the multiple alignment.
  • Highly conserved residues get high scores
  • Less conserved residues get lower scores
  • The PSSM describes the sequence similarity
    between your query and all significant blastp
    hits
  • Another similarity search is performed, this time
    using the new PSSM instead of the standard BLOSUM
    or PAM matrices
  • - This PSSM (scoring matrix) is now customized to
    find sequences that are related to your original
    query
  • Steps 2-4 can be repeated until convergence
  • Convergence occurs when no new sequences appear
    after iteration

57
http//www.ncbi.nlm.nih.gov/BLAST/
PSI-BLAST
58
Format results for PSI-BLAST with inclusion
E-value set at 0.005
PSI-BLAST
BLAST
59
Contributors
  • Special thanks to David Wishart, Andy Baxevanis,
    Stephanie Minnema, Sohrab Shah, and Francis
    Ouellette for their contributions to these
    materials
  • You are now ready to complete the BLAST
    assignment
Write a Comment
User Comments (0)
About PowerShow.com