Comparing biological sequences (3): Database searching and Multiple alignment - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Comparing biological sequences (3): Database searching and Multiple alignment

Description:

Comparing biological sequences (3): Database searching and Multiple alignment – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 71
Provided by: SophieDa4
Learn more at: https://www.eecs.ucf.edu
Category:

less

Transcript and Presenter's Notes

Title: Comparing biological sequences (3): Database searching and Multiple alignment


1
Comparing biological sequences (3) Database
searching and Multiple alignment
2
Database searching
  • Goal find similar (homologous) sequences of a
    query sequence in a sequence of database
  • Input query sequence database
  • Output hits (pairwise alignments)

3
Database searching
  • Core pair-wise alignment algorithm
  • Speed (fast sequence comparison)
  • Relevance of the search results (statistical
    tests)
  • Recovering all information of interest
  • The results depend of the search parameters like
    gap penalty, scoring matrix.
  • Sometimes searches with more than one matrix
    should be preformed

4
What program to use for searching?
  • 1) BLAST is fastest and easily accessed on the
    Web
  • limited sets of databases
  • nice translation tools (BLASTX, TBLASTN)
  • 2) FASTA
  • precise choice of databases
  • more sensitive for DNA-DNA comparisons
  • FASTX and TFASTX can find similarities in
    sequences with frameshifts
  • 3) Smith-Waterman is slower, but more sensitive
  • known as a rigorous or exhaustive search
  • SSEARCH in GCG and standalone FASTA

5
FASTA
  • 1) Derived from logic of the dot plot
  • compute best diagonals from all frames of
    alignment
  • 2) Word method looks for exact matches between
    words in query and test sequence
  • hash tables (fast computer technique)
  • DNA words are usually 6 bases
  • protein words are 1 or 2 amino acids
  • only searches for diagonals in region of word
    matches faster searching

6
FASTA Algorithm
7
Makes Longest Diagonal
  • 3) after all diagonals found, tries to join
    diagonals by adding gaps
  • 4) computes alignments in regions of best
    diagonals

8
FASTA Alignments
9
FASTA Results - Histogram
  • !!SEQUENCE_LIST 1.0
  • (Nucleotide) FASTA of b2.seq from 1 to 693
    December 9, 2002 1402
  • TO /u/browns02/Victor/Search-set/.seq
    Sequences 2,050 Symbols
  • 913,285 Word Size 6
  • Searching with both strands of the query.
  • Scoring matrix GenRunDatafastadna.cmp
  • Constant pamfactor used
  • Gap creation penalty 16 Gap extension penalty
    4
  • Histogram Key
  • Each histogram symbol represents 4 search set
    sequences
  • Each inset symbol represents 1 search set
    sequences
  • z-scores computed from opt scores
  • z-score obs exp
  • () ()
  • lt 20 0 0
  • 22 0 0
  • 24 3 0
  • 26 2 0

10
FASTA Results - List
  • The best scores are init1
    initn opt z-sc E(1018780)..
  • SWPPI1_HUMAN Begin 1 End 269
  • ! Q00169 homo sapiens (human). phosph... 1854
    1854 1854 2249.3 1.8e-117
  • SWPPI1_RABIT Begin 1 End 269
  • ! P48738 oryctolagus cuniculus (rabbi... 1840
    1840 1840 2232.4 1.6e-116
  • SWPPI1_RAT Begin 1 End 270
  • ! P16446 rattus norvegicus (rat). pho... 1543
    1543 1837 2228.7 2.5e-116
  • SWPPI1_MOUSE Begin 1 End 270
  • ! P53810 mus musculus (mouse). phosph... 1542
    1542 1836 2227.5 2.9e-116
  • SWPPI2_HUMAN Begin 1 End 270
  • ! P48739 homo sapiens (human). phosph... 1533
    1533 1533 1861.0 7.7e-96
  • SPTREMBL_NEWBAC25830 Begin 1 End 270
  • ! Bac25830 mus musculus (mouse). 10, ... 1488
    1488 1522 1847.6 4.2e-95
  • SP_TREMBLQ8N5W1 Begin 1 End 268
  • ! Q8n5w1 homo sapiens (human). simila... 1477
    1477 1522 1847.6 4.3e-95
  • SWPPI2_RAT Begin 1 End 269
  • ! P53812 rattus norvegicus (rat). pho... 1482
    1482 1516 1840.4 1.1e-94

11
FASTA Results - Alignment
  • SCORES Init1 1515 Initn 1565 Opt 1687
    z-score 1158.1 E() 2.3e-58
  • gtgtGB_IN3DMU09374
    (2038 nt)
  • initn 1565 init1 1515 opt 1687 Z-score
    1158.1 expect() 2.3e-58
  • 66.2 identity in 875 nt overlap
  • (83-957151-1022)
  • 60 70 80
    90 100 110
  • u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAG
    CGGAGGCGATGGCGCTGTTGGCC

  • DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAAC
    AGAAGGCGCTCCAACTGATGGCC
  • 130 140 150
    160 170 180
  • 120 130 140
    150 160 170
  • u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCT
    TCTCTGGCCTCTTTGGAGGCTCA

  • DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTC
    TGGGATCGCTGTTCGGAGGGTCC
  • 190 200 210
    220 230 240
  • 180 190 200
    210 220 230

12
FASTA on the Web
  • Many websites offer FASTA searches
  • Various databases and various other services
  • Be sure to use FASTA 3
  • Each server has its limits
  • Be aware that you are depending on the kindness
    of strangers.

13
Institut de Génétique Humaine, Montpellier
France, GeneStream server http//www2.igh.cnrs.fr/
bin/fasta-guess.cgi Oak Ridge National Laboratory
GenQuest server http//avalon.epm.ornl.gov/ Europ
ean Bioinformatics Institute, Cambridge,
UK http//www.ebi.ac.uk/htbin/fasta.py?request EM
BL, Heidelberg, Germany http//www.embl-heidelber
g.de/cgi/fasta-wrapper-free Munich Information
Center for Protein Sequences (MIPS)at
Max-Planck-Institut, Germany http//speedy.mips.b
iochem.mpg.de/mips/programs/fasta.html Institute
of Biology and Chemistry of Proteins Lyon,
France http//www.ibcp.fr/serv_main.html Institut
e Pasteur, France http//central.pasteur.fr/seqan
al/interfaces/fasta.html GenQuest at The Johns
Hopkins University http//www.bis.med.jhmi.edu/Da
n/gq/gq.form.html National Cancer Center of
Japan http//bioinfo.ncc.go.jp
14
BLAST Searches GenBank
  • BLAST Basic Local Alignment Search Tool
  • The NCBI BLAST web server lets you compare your
    query sequence to various sections of GenBank
  • nr non-redundant (main sections)
  • month new sequences from the past few weeks
  • ESTs
  • human, drososphila, yeast, or E.coli genomes
  • proteins (by automatic translation)
  • This is a VERY fast and powerful computer.

15
BLAST
  • Uses word matching like FASTA
  • Similarity matching of words (3 aas, 11 bases)
  • does not require identical words.
  • If no words are similar, then no alignment
  • wont find matches for very short sequences
  • Does not handle gaps well
  • New gapped BLAST (BLAST 2) is better

16
BLAST Algorithm
17
BLAST Word Matching
  • MEAAVKEEISVEDEAVDKNI
  • MEA
  • EAA
  • AAV
  • AVK
  • VKE
  • KEE
  • EEI
  • EIS
  • ISV
  • ...

Break query into words
Break database sequences into words
18
Compare Word Lists
  • Database Sequence Word Lists
  • RTT AAQ
  • SDG KSS
  • SRW LLN
  • QEL RWY
  • VKI GKG
  • DKI NIS
  • LFC WDV
  • AAV KVR
  • PFR DEI
  • Query Word List
  • MEA
  • EAA
  • AAV
  • AVK
  • VKL
  • KEE
  • EEI
  • EIS
  • ISV

?
Compare word lists by Hashing (allow near
matches)
19
Find locations of matching words in database
sequences
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
MEA EAA AAV AVK KLV KEE EEI EIS ISV
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSR
WNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
20
Extend hits one base at a time
21
HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
Seq_XYZ
QSVFDYIYYGCYCGWGLG_GK__PRDA
Query
E-val10-13
  • Use two word matches as anchors to build an
    alignment between the query and a database
    sequence.
  • Then score the alignment.

22
HSPs are Aligned Regions
  • The results of the word matching and attempts to
    extend the alignment are segments
  • - called HSPs (High-scoring Segment Pairs)
  • BLAST often produces several short HSPs rather
    than a single aligned region

23
BLAST 2 algorithm
  • The NCBIs BLAST website now both use BLAST 2
    (also known as gapped BLAST)
  • This algorithm is more complex than the original
    BLAST
  • It requires two word matches close to each other
    on a pair of sequences (i.e. with a gap) before
    it creates an alignment

24
Statistical tests
  • Evaluate the probability of an event taking place
    by chance (at random).
  • P-value
  • Randomized data
  • Distribution under the same setup
  • Z-score
  • Chebyshev Inequality

25
BLAST Statistics
  • E value is equivalent to standard P value (based
    on Karlin-Altschul theorem)
  • Significant if E lt 0.05 (smaller numbers are more
    significant)
  • The E-value represents the likelihood that the
    observed alignment is due to chance alone. A
    value of 1 indicates that an alignment this good
    would happen by chance with any random sequence
    searched against this database.

26
BLAST variants for different searchesa (after S.
Brenner, Trends Guide to Bioinformatics, 1998)
27
BLAST is Approximate
  • BLAST makes similarity searches very quickly
    because it takes shortcuts.
  • looks for short, nearly identical words (11
    bases)
  • It also makes errors
  • misses some important similarities
  • makes many incorrect matches
  • easily fooled by repeats or skewed composition

28
Interpretation of output
  • very low E values (e-100) are homologs or
    identical genes
  • moderate E values are related genes
  • long list of gradually declining of E values
    indicates a large gene family
  • long regions of moderate similarity are more
    significant than short regions of high identity

29
Biological Relevance
  • It is up to you, the biologist to scrutinize
    these alignments and determine if they are
    significant.
  • Were you looking for a short region of nearly
    identical sequence or a larger region of general
    similarity?
  • Are the mismatches conservative ones?
  • Are the matching regions important structural
    components of the genes or just introns and
    flanking regions?

30
Borderline similarity
  • What to do with matches with E() values in the
    0.5 -1.0 range?
  • this is the Twilight Zone
  • retest these sequences and look for related hits
    (not just your original query sequence)
  • similarity is transitive
  • if AB and BC, then AC

31
Position Specific Iterated BLAST
  • Collect all database sequence segments that have
    been aligned with query sequence with E-value
    below set threshold (default 0.01)
  • Construct position specific scoring matrix for
    collected sequences. Rough idea
  • Align all sequences to the query sequence as the
    template.
  • Assign weights to the sequences
  • Construct position specific scoring matrix
  • Iterate

32
Motif finding
  • Observation Some regions have been better
    conserved than others during evolution
  • Idea By analyzing the constant and variable
    properties of such groups of similar sequences,
    it is possible to derive a signature for a
    protein family or domain (motifs)

33
PROSITE patterns
  • PROSITE fingerprints are described by regular
    grammars
  • There is a number of programs that allow to
    search databases for PROSITE patterns (example
    GCG package)
  • Example EDQH-x-K-x-DN-G-x-R-GACV
  • Rules
  • Each position is separated by a hyphen
  • One character denotes residuum at a given
    position
  • denoted a set of allowed residues
  • (n) denotes repeat of n
  • (n,m) denoted repeat between n and m inclusive
  • Ex. ATP/GTP binding motive SGX(4)-G-K-DT

34
Multiple sequence alignment
35
Generalizing the Notion of Pairwise Alignment
  • Alignment of 2 sequences is represented as a
  • 2-row matrix
  • In a similar way, we represent alignment of 3
    sequences as a 3-row matrix
  • A T _ G C G _
  • A _ C G T _ A
  • A T C A C _ A
  • Score more conserved columns, better alignment

36
Alignments Paths
  • Align 3 sequences ATGC, AATC,ATGC

A -- T G C
A A T -- C
-- A T G C
37
Alignment Paths

0 1 1 2 3 4
x coordinate
A -- T G C
A A T -- C
-- A T G C
38
Alignment Paths
  • Align the 3 sequences ATGC, AATC,ATGC

0 1 1 2 3 4
x coordinate
A -- T G C
y coordinate
0 1 2 3 3 4
A A T -- C
-- A T G C

39
Alignment Paths
0 1 1 2 3 4
x coordinate
A -- T G C
y coordinate
0 1 2 3 3 4
A A T -- C
z coordinate
0 0 1 2 3 4
-- A T G C
  • Resulting path in (x,y,z) space
  • (0,0,0)?(1,1,0)?(1,2,1) ?(2,3,2) ?(3,3,3) ?(4,4,4)

40
Aligning Three Sequences
source
  • Same strategy as aligning two sequences
  • Use a 3-D Manhattan Cube, with each axis
    representing a sequence to align
  • For global alignments, go from source to sink

sink
41
2-D vs 3-D Alignment Grid
V
W
2-D edit graph
3-D edit graph
42
2-D cell versus 2-D Alignment Cell
In 2-D, 3 edges in each unit square
In 3-D, 7 edges in each unit cube
43
Architecture of 3-D Alignment Cell
(i-1,j,k-1)
(i-1,j-1,k-1)
(i-1,j,k)
(i-1,j-1,k)
(i,j,k-1)
(i,j-1,k-1)
(i,j,k)
(i,j-1,k)
44
Multiple Alignment Dynamic Programming
cube diagonal no indels
  • si,j,k max
  • ?(x, y, z) is an entry in the 3-D scoring matrix

face diagonal one indel
edge diagonal two indels
45
Multiple Alignment Running Time
  • For 3 sequences of length n, the run time is 7n3
    O(n3)
  • For k sequences, build a k-dimensional Manhattan,
    with run time (2k-1)(nk) O(2knk)
  • Conclusion dynamic programming approach for
    alignment between two sequences is easily
    extended to k sequences but it is impractical due
    to exponential running time.

46
Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T
A G C T A C C A - - - G C A G
C T A C C A - - - G C A G C
T A T C A C G G C A G C T A
T C G C G G A 1 1
.8 C .6 1 .4 1 .6
.2 G 1 .2 .2 .4
1 T .2 1 .6 .2 - .2
.8 .4 .8 .4
47
Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T
A G C T A C C A - - - G C A G
C T A C C A - - - G C A G C
T A T C A C G G C A G C T A
T C G C G G A 1 1
.8 C .6 1 .4 1 .6
.2 G 1 .2 .2 .4
1 T .2 1 .6 .2 - .2
.8 .4 .8 .4
  • In the past we were aligning a sequence against a
    sequence
  • Can we align a sequence against a profile?
  • Can we align a profile against a profile?

48
Aligning alignments
  • Given two alignments, can we align them?

x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z
GGGAACTGCAG w GGACGTACC-- Alignment 2 v
GGACCT-----
49
Aligning alignments
  • Given two alignments, can we align them?
  • Hint use alignment of corresponding profiles

x GGGCACTGCAT y GGTTACGTC-- Combined
Alignment z GGGAACTGCAG w GGACGTACC--
v GGACCT-----
50
Multiple Alignment Greedy Approach
  • Choose most similar pair of strings and combine
    into a profile , thereby reducing alignment of k
    sequences to an alignment of of k-1
    sequences/profiles. Repeat
  • This is a heuristic greedy method

u1 ACg/tTACg/tTACg/cT u2 TTAATTAATTAA uk
CCGGCCGGCCGG
u1 ACGTACGTACGT u2 TTAATTAATTAA u3
ACTACTACTACT uk CCGGCCGGCCGG
k-1
k
51
Greedy Approach Example
  • Consider these 4 sequences

s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC
52
Greedy Approach Example (contd)
  • There are 6 possible alignments

s2 GTCTGA s4 GTCAGC (score 2) s1 GAT-TCA s2
G-TCTGA (score 1) s1 GAT-TCA s3 GATAT-T
(score 1)
s1 GATTCA-- s4 GT-CAGC(score 0) s2
G-TCTGA s3 GATAT-T (score -1) s3 GAT-ATT s4
G-TCAGC (score -1)
53
Greedy Approach Example (contd)
s2 and s4 are closest combine
s2 GTCTGA s4 GTCAGC
s2,4 GTCt/aGa/cA (profile)
new set of 3 sequences
s1 GATTCA s3 GATATT s2,4 GTCt/aGa/c
54
Progressive Alignment
  • Progressive alignment is a variation of greedy
    algorithm with a somewhat more intelligent
    strategy for choosing the order of alignments.
  • Progressive alignment works well for close
    sequences, but deteriorates for distant sequences
  • Gaps in consensus string are permanent
  • Use profiles to compare sequences

55
ClustalW
  • Popular multiple alignment tool today
  • W stands for weighted (different parts of
    alignment are weighted differently).
  • Three-step process
  • 1.) Construct pairwise alignments
  • 2.) Build Guide Tree
  • 3.) Progressive Alignment guided by the tree

56
Step 1 Pairwise Alignment
  • Aligns each sequence again each other giving a
    similarity matrix
  • Similarity exact matches / sequence length
    (percent identity)

(.17 means 17 identical)
57
Step 2 Guide Tree
  • Create Guide Tree using the similarity matrix
  • ClustalW uses the neighbor-joining method
  • Guide tree roughly reflects evolutionary relations

58
Step 2 Guide Tree (contd)
v1 v3 v4 v2
Calculatev1,3 alignment (v1, v3)v1,3,4
alignment((v1,3),v4)v1,2,3,4
alignment((v1,3,4),v2)
59
Step 3 Progressive Alignment
  • Start by aligning the two most similar sequences
  • Following the guide tree, add in the next
    sequences, aligning to the existing alignment
  • Insert gaps as necessary

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPL
LNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE
PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSIS
NVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAA
AEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE
PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-------
----------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----
SAPAKEDGFSWLLPPPPPPP-----------------LPFQ
. . . .. . .

Dots and stars show how well-conserved a column
is.
60
Multiple Alignments Scoring
  • Number of matches (multiple longest common
    subsequence score)
  • Entropy score
  • Sum of pairs (SP-Score)

61
Multiple LCS Score
  • A column is a match if all the letters in the
    column are the same
  • Only good for very similar sequences

AAA AAA AAT ATC
62
Entropy
  • Define frequencies for the occurrence of each
    letter in each column of multiple alignment
  • pA 1, pTpGpC0 (1st column)
  • pA 0.75, pT 0.25, pGpC0 (2nd column)
  • pA 0.50, pT 0.25, pC0.25 pG0 (3rd column)
  • Compute entropy of each column

AAA AAA AAT ATC
63
Entropy Example
Best case
Worst case
64
Multiple Alignment Entropy Score
Entropy for a multiple alignment is the sum of
entropies of its columns
? over all columns ? XA,T,G,C pX logpX
65
Entropy of an Alignment Example
column entropy -( pAlogpA pClogpC pGlogpG
pTlogpT)
  • Column 1 -1log(1) 0log0 0log0
    0log0 0
  • Column 2 -(1/4)log(1/4) (3/4)log(3/4)
    0log0 0log0 - (1/4)(-2)
    (3/4)(-.415) 0.811
  • Column 3 -(1/4)log(1/4)(1/4)log(1/4)(1/4)l
    og(1/4) (1/4)log(1/4) 4 -(1/4)(-2)
    2.0
  • Alignment Entropy 0 0.811 2.0 2.811

A A A
A C C
A C G
A C T
66
Multiple Alignment Induces Pairwise Alignments
  • Every multiple alignment induces pairwise
    alignments
  • x AC-GCGG-C
  • y AC-GC-GAG
  • z GCCGC-GAG
  • Induces
  • x ACGCGG-C x AC-GCGG-C y AC-GCGAG
  • y ACGC-GAC z GCCGC-GAG z GCCGCGAG

67
Sum of Pairs Score(SP-Score)
  • Consider pairwise alignment of sequences
  • ai and aj
  • imposed by a multiple alignment of k
    sequences
  • Denote the score of this suboptimal (not
    necessarily optimal) pairwise alignment as
  • s(ai, aj)
  • Sum up the pairwise scores for a multiple
    alignment
  • s(a1,,ak) Si,j s(ai, aj)

68
Computing SP-Score
Aligning 4 sequences 6 pairwise alignments
Given a1,a2,a3,a4 s(a1a4) ??s(ai,aj)
s(a1,a2) s(a1,a3)
s(a1,a4) s(a2,a3)
s(a2,a4) s(a3,a4)
69
SP-Score Example
a1 . ak
ATG-C-AAT A-G-CATAT ATCCCATTT
To calculate each column
Pairs of Sequences
s
s(
A
G
1
1
1
Score3
-m
Score 1 2m
A
A
C
G
1
-m
Column 1
Column 3
70
Multiple Alignment History
  • 1975 Sankoff
  • Formulated multiple alignment problem and gave
    dynamic programming solution
  • 1988 Carrillo-Lipman
  • Branch and Bound approach for MSA
  • 1990 Feng-Doolittle
  • Progressive alignment
  • 1994 Thompson-Higgins-Gibson-ClustalW
  • Most popular multiple alignment program
  • 1998 Morgenstern et al.-DIALIGN
  • Segment-based multiple alignment
  • 2000 Notredame-Higgins-Heringa-T-coffee
  • Using the library of pairwise alignments
  • 2004 MUSCLE
Write a Comment
User Comments (0)
About PowerShow.com