Lab 3'2: Database Similarity Searching - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Lab 3'2: Database Similarity Searching

Description:

Take a tour of NCBI BLAST. Review practicalities of submitting BLAST queries ... Chocolate Vanilla Swirl not available. Basic BLAST Flavors ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 67
Provided by: stephanie160
Category:

less

Transcript and Presenter's Notes

Title: Lab 3'2: Database Similarity Searching


1
Lab 3.2 Database Similarity Searching
  • The BLAST Buffet
  • Stephanie Minnema
  • University of Calgary

2
Our Goal
  • Take a tour of NCBI BLAST
  • Review practicalities of submitting BLAST queries
  • Understand BLAST output
  • Do sequence comparisons using basic and advanced
    BLAST methods

3
BLAST is Good For You
4
Database Similarity Searching
  • The method youll use most!
  • Scans a database for alignments to a query
    sequence
  • Can get tons of information
  • functionality
  • evolutionary history
  • important residues
  • Basis for many forms of bioinformatic analysis

5
Most Common Tool
  • BLAST basic local alignment search tool
  • NCBI and others
  • Based on fast local alignment methods
  • Global alignment computationally intensive
  • Global alignment not always biologically
    significant
  • Breaks query down into words (K-tuples)
  • Finds regions of similarity
  • NCBI uses BLAST 2.0 (gapped BLAST)
  • Balances speed and sensitivity

6
www.ncbi.nlm.nih.gov/BLAST/
7
(No Transcript)
8
Basic BLAST Flavors
  • blastp protein query vs. protein sequence
    database.
  • blastn nucleotide query vs. nucleotide sequence
    database.
  • blastx translated nucleotide query vs. protein
    sequence database
  • tblastn protein query vs. translated nucleotide
    sequence database
  • tblastx translated nucleotide query vs.
    translated nucleotide sequence database.

9
Whats Your Favorite Flavor?
  • What program will best suit your query, and
    desired output?
  • Protein comparisons give most meaningful results
  • Sequence complexity 20 aa vs. 4 nt.
  • Moderately similar nucleotide sequences could
    encode a highly similar protein sequence!

10
Takeout Message 1
  • Compare sequences on the protein level unless you
    know your query does not encode a protein product

11
Using Basic BLAST Methods
  • Example MASH-1 protein sequence from mouse
  • Can I find similar proteins in Human?

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Submitting Your Query
  • Input query sequence
  • FASTA
  • Raw
  • Accession/ ID
  • Choose Database
  • Many available varies with program
  • For complete list follow the link to

http//www.ncbi.nlm.nih.gov/blast/html/blastcgihel
p.htmlprotein_databases
16
Finds Conserved Domains
Limit results with entrez query
E-Value cut off
17
Submitting Your Query
  • CD Search
  • Finds conserved domains in query sequence
  • Compares to patterns and profiles of CDs
  • Limit by entrez query
  • Restricts results to single organism etc.
  • E-value cut off
  • Restricts results to ones falling below defined
    e-value
  • Default 10
  • Will revisit concept of e-value

18
Filtering
Matrix
Gap Penalties
19
Submitting Your Query
  • Low complexity filtering
  • Low complexity sequence can lead to spurious
    alignments
  • Filtering hides these regions
  • On by default
  • SEG (proteins) or DUST (nucleic acids)
  • Should turn it off in some cases what if your
    entire sequence gets filtered?

20
Submitting Your Query
  • Choice of scoring matrix
  • Different ones available
  • BLOSUM matrices based on observed frequencies of
    a.a. substitutions
  • Each tailored to different levels of sequence
    divergence and length
  • BLOSUM 62 default
  • Shown to be best at detecting most protein
    similarities dont usually need to change
  • Follow link for detailed information

21
Submitting Your Query
  • Gap Penalties
  • Accounts for insertions and deletions in
    different sequences
  • Scores are penalized for gaps to prevent aberrant
    alignments
  • Opening penalty is high extension penalty is
    lower
  • Defaults may change depending on matrix choice
  • Rarely need to change default value

22
(No Transcript)
23
Click for more info
Take note
24
Formatting Options
25
(No Transcript)
26
Understanding Your Results
  • Graphic representation of results
  • Top of graph represents query sequence
  • Underlying bars show where hits occur
  • Colors represent alignment scores
  • Grey areas represent non similar regions
    surrounded by similar regions
  • Scrolling over bar shows accession and
    description of hit
  • Clicking on a bar takes you to its alignment with
    the query

27
(No Transcript)
28
Understanding Your Results
  • Bit scores
  • Normalized raw score
  • Raw score sum of substitution scores and gap
    penalties
  • Normalized on basis of scoring method
  • Can compare searches scored using different
    matrices
  • Higher is better, but dont adequately represent
    significance of alignment

29
Understanding Your Results
  • E-values
  • Indicator of alignment significance
  • Number of times an alignment with the same score
    could have arose by chance
  • Lower is better
  • E-values decrease exponentially as scores for an
    alignment increase

30
Examine Results
31
Understanding Your Results
  • Alignments
  • Important to inspect them
  • Take note of percent identity and similarity
    between query and aligned sequence
  • Examine regions of similarity and gaps
  • What if a sub-optimal alignment is the most
    functionally significant one?

32
Takeout Message 2
  • Dont trust your computer blindly Examine and
    think about your results

33
Homology Some Rules to Consider
  • Similarity can be indicative of homology
  • Generally, if two sequences are significantly
    similar over entire length they are likely
    homologous
  • 50 similarity over a short sequence often occurs
    by chance
  • Low complexity regions can be highly similar
    without being homologous
  • Homologous sequences not always highly similar

34
Takeout Message 3
  • Homology is like pregnancy

35
Basic BLAST Flavors for Special Occasions
  • BLAST 2 Sequences (bl2seq)
  • Aligns two sequences of your choice
  • Can do different types of comparison ex. Blastx
  • Gives dot-plot like output
  • VecScreen
  • Compares query with sequences of known cloning
    vectors
  • Both very handy for sequencing!

36
Basic BLAST Flavors for Special Occasions
  • BLAST against genomes
  • Many available
  • BLAST parameters pre-optimized
  • Handy for mapping query to genome
  • Search for short exact matches
  • BLAST parameters pre-optimized
  • Great for checking probes and primers

37
Basic BLAST Flavors for Special Occasions
  • megaBLAST
  • For aligning sequences which differ slightly due
    to sequencing errors etc.
  • Very efficient for long query sequences
  • Uses big word (k-tuple) sizes to start search
  • Very fast
  • Accepts batch submissions of ESTs
  • Can upload files of sequences as queries
  • More detailed info see megaBLAST pages

38
Time to Sample the Buffet
  • Try questions 1 4, found at the end of the lab
    notes accompanying this lecture.
  • Well discuss them in 15 - 20 minutes

39
Advanced BLAST Methods
  • The NCBI BLAST pages have several advanced BLAST
    methods available
  • PSI-BLAST
  • PHI-BLAST
  • RPS-BLAST
  • All are powerful methods based on protein
    similarities

40
More Complex Flavor PSI-BLAST
  • Position Specific Iterated BLAST
  • A cycling/iterative method
  • Gives increased sensitivity for detecting
    distantly related proteins
  • Can give insight into functional relationships
  • Very refined statistical methods
  • Fast still based on BLAST methods
  • Simple to use

41
PSI-BLAST Principle
  • First, a standard blastp is performed
  • The highest scoring hits are used to generate a
    multiple alignment
  • A PSSM is generated from the multiple alignment.
  • Highly conserved residues get high scores
  • Less conserved residues get lower scores
  • Another similarity search is performed, this time
    using the new PSSM
  • Steps 2-4 can be repeated until convergence
  • No new sequences appear after iteration

42
Example Aminoacyl tRNA Synthetases
  • 20 enzymes for 20 amino acids
  • Each is very different
  • Big, small, monomers, tetramers, strange globs
  • All bind to their appropriate tRNAs, with high
    specificity
  • Bind all for their amino acid, but none of the
    others
  • TrpRS and TyrRS share only 13 sequence identity
  • BUT, overall structures of TrpRS and TyrRS are
    similar
  • Structure ? Function relationship

43
Same SCOP family based on catalytic domain
44
TyrRS and TrpRS are Similar
  • Sequence similarity expected right?
  • BUT blastp of E.coli TyrRS against bacterial
    sequences in SwissProt does not show similarity
    with TrpRS
  • e-value cutoff of 10

45
No TrpRS!?
46
Try Using PSI-BLAST
  • PSI-BLAST available from BLAST main page
  • Query form just like for blastp
  • BUT one extra formatting option must be used
  • Format for PSI-BLAST check it off!
  • Second e-value cutoff used to determine which
    alignments will be used for PSSM build
    Threshold for inclusion
  • First search using TyrRS as query
  • Db SwissProt limit Bacteria ORGN
  • Threshold for inclusion 0.005

47
(No Transcript)
48
(No Transcript)
49
After A Few Iterations
50
TyrRS Similarity to TrpRS!
51
Power of PSI-BLAST
  • We knew TyrRS and TrpRS were similarly
  • Functionally and structurally
  • Blastp gave no indication
  • PSI-BLAST was able to detect their weak sequence
    similarity
  • A word of caution be sure to inspect and think
    about the results included in the PSSM build.
  • Include/exclude sequences on basis of biological
    knowldge

52
Query
Does the query really have a relationship with
the results?
Results
53
Takeout Message 4
  • Use you biological knowledge when doing PSI-Blast
    to yield the most significant results

54
Another Complex Flavour PHI-BLAST
  • Pattern Hit Initiated BLAST
  • PHI-BLAST principle
  • Same method as PSI-BLAST
  • Starts first search with query sequence pattern
    for a motif in the query
  • PHI-BLAST finds sequences containing the motif
    and having significant sequence similarity in the
    vicinity of the motif occurrence
  • Highly specific

55
Example TyrRS
  • TyrRS contains the aaRS class-I signature
  • Want to find sequences containing that motif, and
    regional similarity to TyrRS
  • First get the Prosite pattern for the class-I
    signature
  • Prosite db of protein families and domains

56
http//ca.expasy.org/prosite
57
P-x(0,2)-GSTAN-DENQGAPK-x-LIVMFP-HT-LIVMY
AC-G- HNTG-LIVMFYSTAGPC
58
Insert Query Sequence
Insert PHI Pattern
59
PHI-BLAST Results
  • After first search, PHI-BLAST functions same as
    PSI-BLAST
  • Result page is the same
  • Can iterate in same way.
  • Try it later if you like

60
The Key to PHI- and PSI-BLAST
  • Generating the multiple alignments to create
    PSSMs
  • Refines scoring in searches
  • Annotated collections of multiple alignments
    defining domains exist
  • Conserved domain database (CDD)
  • Contains 18039 alignments (10013 last year)
  • Can search the CDD using CD search
  • Uses RPS-BLAST

61
RPS-BLAST
  • Reverse Position Specific BLAST
  • Opposite of PSI-BLAST
  • CDD multiple alignments converted to PSSMs
  • PSSMs are processed and turned into a searchable
    database
  • Queries are searched against PSSMs using
    RPS-BLAST
  • Output indicates conserved domains within the
    query sequence

62
Example CRADD protein
63
Click on picture to see CDD multiple alignment
Click to see alignment with query
64
Summary of Advanced BLAST Methods
  • PSI-BLAST
  • Input SEQUENCE
  • Database SEQUENCES
  • Algorithm Constructs a PSSM from an initial pass
    and uses this in the next pass
  • Output Distantly related sequences
  • sensitive, -specific
  • PHI-BLAST
  • Input PROFILE SEQUENCE
  • Database SEQUENCES
  • Algorithm Same as PSI-BLAST except start with a
    profile
  • Output Sequences containing the domain and that
    are similar in the domain region
  • sensitive, -gt -specific
  • RPS-BLAST
  • Input SEQUENCE
  • Database DOMAINS
  • Output Domains found in the sequence
  • sensitive, specific

65
Back for Another Helping
  • Try the remaining questions in the notes!

66
Enlightenment begins with a BLAST
Special Thanks to Sohrab Shah for the aaRS
example and further BLAST enlightenment
Write a Comment
User Comments (0)
About PowerShow.com