Sequence Similarity Searching - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Sequence Similarity Searching

Description:

What is this thing I just found? ... Guess functions for entire genomes full of new gene sequences ... if A~B and B~C, then A~C. Advanced Similarity Techniques ... – PowerPoint PPT presentation

Number of Views:527
Avg rating:3.0/5.0
Slides: 56
Provided by: researchco3
Category:

less

Transcript and Presenter's Notes

Title: Sequence Similarity Searching


1
Sequence Similarity Searching
Stuart M. Brown, Ph.D. Center for Health
Informatics and BioinformaticsNYU School of
Medicine
2
Why Compare Sequences?
  • Identify sequences found in lab experiments
  • What is this thing I just found?
  • Compare new genes to known ones
  • Compare genes from different species
  • information about evolution
  • Guess functions for entire genomes full of new
    gene sequences
  • Map sequence reads to a Reference
    Genome (ChIP-seq, RNA-seq, etc.)

3
Are there other sequences like this one?
  • 1) Huge public databases - GenBank, Swissprot,
    etc.
  • 2) Sequence comparison is the most powerful and
    reliable method to determine evolutionary
    relationships between genes
  • 3) Similarity searching is based on alignment
  • 4) BLAST and FASTA provide rapid similarity
    searching
  • a. rapid approximate (heuristic)
  • b. false and - scores

4
Similarity ? Homology
  • 1) 25 similarity 100 AAs is strong evidence
    for homology
  • 2) Homology is an evolutionary statement which
    means descent from a common ancestor
  • common 3D structure
  • usually common function
  • homology is all or nothing, you cannot say "50
    homologous"

5
How to Compare Sequences?
  • Manually line them up and count?
  • an alignment program can do it for you
  • or a just use a text editor
  • Dot Plot
  • shows regions of similarity as diagonals

GATGCCATAGAGCTGTAGTCGTACCCT lt gt
CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC
6
Percent Sequence Identity
  • The extent to which two nucleotide or amino acid
    sequences are invariant

A C C T G A G A G A C G T G G C
A G
mismatch
indel
70 identical
7
Global vs Local similarity
  • Global similarity uses complete aligned sequences
    - total matches
  • - Needleman Wunch algorithm
  • 2) Local similarity looks for best internal
    matching region between 2 sequences
  • Smith-Waterman algorithm,
  • BLAST and FASTA
  • 3) dynamic programming
  • optimal computer solution, not approximate

8
Global vs. Local Alignments
9
Global Alignment
Two sequences sharing several regions of local
similarity
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAA
TTAAAGAGGAGGTAGACCG.... 67


1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAG
CACTAAAGCGTCAGCGAGACCG 70
10
Similarity is Based on Dot Plots
  • 1) two sequences on vertical and horizontal axes
    of graph
  • 2) put dots wherever there is a match
  • 3) diagonal line is region of identity (local
    alignment)
  • 4) apply a window filter - look at a group of
    bases, must meet identity to get a dot

11
Simple Dot Plot
12
Dot plot filtered with 4 base window and 75
identity
13
Dot plot of real data
14
Dotplot (Window 130 / Stringency 9)
Hemoglobin?-chain
Hemoglobin ?-chain
15
Window / Stringency
Score 11
PTHPLASKTQILPEDLASEDLTI
?
PTHPLAGERAIGLARLAEEDFGM
Filtering
Score 11
Window 12 Stringency 9
PTHPLASKTQILPEDLASEDLTI
?
PTHPLAGERAIGLARLAEEDFGM
Score 7
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
16
Dotplot (Window 18 / Stringency 10)
Hemoglobin?-chain
Hemoglobin ?-chain
17
Scoring Similarity
  • 1) Can only score aligned sequences
  • 2) DNA is usually scored as identical or not
  • 3) modified scoring for gaps - single vs.
    multiple base gaps (gap extension)
  • 4) Protein AAs have varying degrees of similarity
  • a. of mutations to convert one to another
  • b. chemical similarity
  • c. observed mutation frequencies
  • 5) PAM matrix calculated from observed mutations
    in protein families

18
Search with Protein, not DNA Sequences
  • 1) 4 DNA bases vs. 20 amino acids - less chance
    similarity
  • 2) can have varying degrees of similarity between
    different AAs
  • - of mutations, chemical similarity, PAM matrix
  • 3) protein databanks are much smaller than DNA
    databanks

19
The PAM 250 scoring matrix
20
What program to use for searching?
  • 1) Smith-Waterman is slower, but more sensitive
  • known as a rigorous or exhaustive search
    optimal alignments
  • EMBOSS water program
  • 2) FASTA
  • more sensitive for DNA-DNA comparisons
  • FASTX and TFASTX can find similarities in
    sequences with frameshifts
  • 3) BLAST is fastest and easily accessed on the
    Web
  • limited sets of databases for web version
  • Free software to install on UNIX computers, make
    custom databases
  • nice translation tools (BLASTX, TBLASTN)

21
Smith-Waterman Method
Basic principles of dynamic programming
- Creation of an alignment path matrix -
Stepwise calculation of score values -
Backtracking (evaluation of the optimal path)
22
(No Transcript)
23
Creation of an alignment path matrix
Idea Build up an optimal alignment using
previous solutions for optimal alignments of
smaller subsequences
  • Construct matrix F indexed by i and j (one index
    for each sequence)
  • F(i,j) is the score of the best alignment between
    the initial segment x1...i of x up to xi and
    the initial segment y1...j of y up to yj
  • Build F(i,j) recursively beginning with F(0,0) 0

24
Creation of an alignment path matrix
  • If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we
    can calculate F(i,j)
  • Three possibilities
  • xi and yj are aligned, F(i,j) F(i-1,j-1)
    s(xi ,yj)
  • xi is aligned to a gap, F(i,j) F(i-1,j) - d
  • yj is aligned to a gap, F(i,j) F(i,j-1) - d
  • The best score up to (i,j) will be the largest of
    the three options

25
Backtracking
H E A G A W G H
E E 0 -8 -16 -24 -32 -40 -48
-56 -64 -72 -80 P
-8 -2 -9 -17 -25 -33 -42 -49 -57 -65
-73 A -16 -10 -3 -4 -12 -20 -28 -36
-44 -52 -60 W -24 -18 -11 -6 -7 -15
-5 -13 -21 -29 -37 H -32 -14 -18 -13
-8 -9 -13 -7 -3 -11 -19 E -40 -22
-8 -16 -16 -9 -12 -15 -7 3 -5 A -48
-30 -16 -3 -11 -11 -12 -12 -15 -5
2 E -56 -38 -24 -11 -6 -12 -14 -15
-12 -9 1
0
-8
-16
-25
-17
-20
-5
-13
-3
3
-5
1
- A
E E
H H
G -
W W
A A
G -
A P
E -
H -
Optimal global alignment
26
Smith-Waterman is OPTIMAL but computationally slow
  • SW search requires computing of matrix of scores
    at every possible alignment position with every
    possible gap.
  • Compute task increases with the product of the
    lengths of two sequence to be compared
  • Difficult for comparison of one small sequence
    to a much larger one, very difficult for two
    large sequences, essentially impossible to search
    very large databases.

27
FASTA
  • 1) Derived from logic of the dot plot
  • compute best diagonals from all frames of
    alignment
  • 2) Word method looks for exact matches between
    words in query and test sequence
  • hash tables (fast computer technique)
  • DNA words are usually 6 bases
  • protein words are 1 or 2 amino acids
  • only searches for diagonals in region of word
    matches faster searching

28
FASTA Format
  • simple format used by almost all programs
  • gtheader line with a return at end
  • Sequence (no specific requirements for line
    length, characters, etc)

gtURO1 uro1.seq Length 2018 November 9, 2000
1150 Type N Check 3854 .. CGCAGAAAGAGGAGGCGC
TTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAAC
TGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTT
GTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCA
ACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTAT
GGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGT
CTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCT
GGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CT
TGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCT
GAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGAT
GACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGC
TCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATA
CACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCT
CGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGC
TT
29
FASTA Algorithm
30
Makes Longest Diagonal
  • 3) after all diagonals found, tries to join
    diagonals by adding gaps
  • 4) computes alignments in regions of best
    diagonals

31
FASTA Alignments
32
FASTA Results - Alignment
  • SCORES Init1 1515 Initn 1565 Opt 1687
    z-score 1158.1 E() 2.3e-58
  • gtgtGB_IN3DMU09374
    (2038 nt)
  • initn 1565 init1 1515 opt 1687 Z-score
    1158.1 expect() 2.3e-58
  • 66.2 identity in 875 nt overlap
  • (83-957151-1022)
  • 60 70 80
    90 100 110
  • u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAG
    CGGAGGCGATGGCGCTGTTGGCC

  • DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAAC
    AGAAGGCGCTCCAACTGATGGCC
  • 130 140 150
    160 170 180
  • 120 130 140
    150 160 170
  • u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCT
    TCTCTGGCCTCTTTGGAGGCTCA

  • DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTC
    TGGGATCGCTGTTCGGAGGGTCC
  • 190 200 210
    220 230 240
  • 180 190 200
    210 220 230

33
BLAST Searches GenBank(or a custom database)
  • BLAST Basic Local Alignment Search Tool
  • The NCBI BLAST web server lets you compare your
    query sequence to various sections of GenBank
  • nr non-redundant (main sections)
  • month new sequences from the past few weeks
  • ESTs
  • human, drososphila, yeast, or E.coli genomes
  • proteins (by automatic translation)
  • This is a VERY fast and powerful computer.

34
BLAST
  • Uses word matching
  • Similarity matching of words (3 aas, 11 bases)
  • does not require identical words.
  • If no words are similar, then no alignment
  • wont find matches for very short sequences
  • gapped BLAST (BLAST 2) improved handling of
    gaps in alignment
  • BLAST searches can be sent to the NCBIs server
    from website, or a custom client program (Unix)

35
BLAST Algorithm
36
BLAST Word Matching
  • MEAAVKEEISVEDEAVDKNI
  • MEA
  • EAA
  • AAV
  • AVK
  • VKE
  • KEE
  • EEI
  • EIS
  • ISV
  • ...

Break query into words
Break database sequences into words
37
Compare Word Lists
  • Database Sequence Word Lists
  • RTT AAQ
  • SDG KSS
  • SRW LLN
  • QEL RWY
  • VKI GKG
  • DKI NIS
  • LFC WDV
  • AAV KVR
  • PFR DEI
  • Query Word List
  • MEA
  • EAA
  • AAV
  • AVK
  • VKL
  • KEE
  • EEI
  • EIS
  • ISV

?
Compare word lists by Hashing (allow near
matches)
38
Find locations of matching words in database
sequences
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
MEA EAA AAV AVK KLV KEE EEI EIS ISV
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSR
WNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
39
Extend hits one base at a time
40
BLAST 2 algorithm
  • The NCBIs BLAST website and GCG (NETBLAST)
    now both use BLAST 2 (also known as gapped
    BLAST)
  • This algorithm is more complex than the original
    BLAST
  • It requires two word matches close to each other
    on a pair of sequences (i.e. with a gap) before
    it creates an alignment

41
Gapped BLAST
HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
Seq_XYZ
QSVFDYIYYGCYCGWGLG_GK__PRDA
Query
E-val10-13
  • Use two word matches as anchors to build an
    alignment between the query and a database
    sequence.
  • Then score the alignment.

42
HSPs are Aligned Regions
  • The results of the word matching and attempts to
    extend the alignment are segments
  • - called HSPs (High-scoring Segment Pairs)
  • BLAST often produces several short HSPs rather
    than a single aligned region

43
  • gtgbBE588357.1BE588357 194087 BARC 5BOV Bos
    taurus cDNA 5'.
  • Length 369
  • Score 272 bits (137), Expect 4e-71
  • Identities 258/297 (86), Gaps 1/297 (0)
  • Strand Plus / Plus

  • Query 17 aggatccaacgtcgctccagctgctcttgacgactccac
    agataccccgaagccatggca 76

  • Sbjct 1 aggatccaacgtcgctgcggctacccttaaccact-cgc
    agaccccccgcagccatggcc 59

  • Query 77 agcaagggcttgcaggacctgaagcaacaggtggagggg
    accgcccaggaagccgtgtca 136

  • Sbjct 60 agcaagggcttgcaggacctgaagaagcaagtggagggg
    gcggcccaggaagcggtgaca 119

  • Query 137 gcggccggagcggcagctcagcaagtggtggaccaggcc
    acagaggcggggcagaaagcc 196

  • Sbjct 120 tcggccggaacagcggttcagcaagtggtggatcaggcc
    acagaagcagggcagaaagcc 179

  • Query 197 atggaccagctggccaagaccacccaggaaaccatcgac
    aagactgctaaccaggcctct 256

44
(No Transcript)
45
BLAST has Automatic Translation
  • BLASTX makes automatic translation (in all 6
    reading frames) of your DNA query sequence to
    compare with protein databanks
  • TBLASTN makes automatic translation of an entire
    DNA database to compare with your protein query
    sequence
  • Only make a DNA-DNA search if you are working
    with a sequence that does not code for protein.

46
BLAST Results - Summary
47
BLAST Results - List
48
BLAST Results - Alignment
gtgi17556182refNP_497582.1 Predicted CDS,
phosphatidylinositol transfer protein
Caenorhabditis elegans gi14574401gbAAK68521.
1AC024814_1 Hypothetical protein Y54F10AR.1
Caenorhabditis elegans Length 336
Score 283 bits (723), Expect 8e-75
Identities 144/270 (53), Positives 186/270
(68), Gaps 13/270 (4) Query 48
KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----
DGE--KGQYT 101 K RVLPSVEYQVGQLSVAE
ASK P G KGQYT Sbjct 70
KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLL
NGQFTKGQYT 129 Query 102 HKIYHLQSKVPTFVRMLAPEGAL
NIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160
HKIYHLQSKVP R APGL IHEAWNAYPYCTVTN
YMKEF KIET H P Sbjct 130 HKIYHLQSKVPAILRKIAPKG
SLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
189 Query 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-
SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN
H L E V IIA L S D PKFS KTGRGPL
N Sbjct 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSG
DLHPDSTPSKFQSTKTGRGPLSGN 249 Query 220
WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHR
QLFCWLDKWV 279 WK P MCAYKLVTV
FKWG Q VEN H Q RLF FHRFCWDKW Sbjct 250
WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHR
EVFCWIDKWH 304 Query 280 DLTMDDIRRMEEETKRQLDEMRQ
KDPVKGM 309 LTM DIR E LE R
VGM Sbjct 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM
334
49
BLAST alignments are short segments
  • BLAST tends to break alignments into
    non-overlapping segments
  • can be confusing
  • reduces overall significance score

50
BLAST is Approximate
  • BLAST makes similarity searches very quickly
    because it takes shortcuts.
  • looks for short, nearly identical words (11
    bases)
  • It also makes errors
  • misses some important similarities
  • makes many incorrect matches
  • easily fooled by repeats or skewed composition

51
Web BLAST runs on a big computer at NCBI
  • Usually fast, but does get busy sometimes
  • Fixed choices of databases
  • problems with genome data clogging the system
  • ESTs are not part of the default NR dataset
  • Uses filtering of repeats
  • Graphical summary of output
  • Links to GenBank sequences

52
BLAST Statistics
  • E() value is equivalent to standard P value
  • Significant if E() lt 0.05 (smaller numbers are
    more significant)
  • The E-value represents the likelihood that the
    observed alignment is due to chance alone. A
    value of 1 indicates that an alignment this good
    would happen by chance with any random sequence
    searched against this database.

53
Interpretation of output
  • very low E() values (e-100) are homologs or
    identical genes
  • moderate E() values are related genes
  • long list of gradually declining of E() values
    indicates a large gene family
  • long regions of moderate similarity are more
    significant than short regions of high identity

54
Borderline similarity
  • What to do with matches with E() values in the
    0.5 -1.0 range?
  • this is the Twilight Zone
  • retest these sequences and look for related hits
    (not just your original query sequence)
  • similarity is transitive
  • if AB and BC, then AC

55
Biological Relevance
  • It is up to you, the biologist to scrutinize
    these alignments and determine if they are
    significant.
  • Were you looking for a short region of nearly
    identical sequence or a larger region of general
    similarity?
  • Are the mismatches conservative ones?
  • Are the matching regions important structural
    components of the genes or just introns and
    flanking regions?
Write a Comment
User Comments (0)
About PowerShow.com