Title: Statistical Significance for Local Sequence Alignments Multiple Sequence Alignments Profilebased seq
1Statistical Significance forLocal Sequence
AlignmentsMultiple Sequence AlignmentsProfile-ba
sed sequence search
- A. Brüngger, Labhead BioinformaticsNovartis
Pharma AG - adrian.bruengger_at_pharma.novartis.com
2Local Alignments, Multiple Sequence Alignments,
Profile-based Searches
- Statistical Significance of Local Sequence
Alignments - Fast identification of identical k-mers
Implementation detail - Facts about the statistical significance of local
alignments - Statistical significance of local alignments in
BLAST - Multiple Sequence Alignments
- Dynamic Programming
- Iterative Approach for multiple sequence
alignments CLUSTALW - Construction of phylogenetic trees
- Profile-based sequence searches
- Patterns, motifs, profiles as domain footprints
- Databases of patterns and profiles PROSITE
- Example Sensitivity/selectivity of a pattern for
ubiquitin conjugating enzymes
Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3Local Sequence Alignments Fast identification of
k-mers
query
MAAAPLLGAQGAPLEPVYPGDNATPEQMAQYAADLRRYIN
MLTRPRYGKRHKEDTLAFSE
ADAYYVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYETQLRRYINTLT
RPRYGKRAEEENTGGLP
database
- Fast identification of k-mers
- Indexing of all possible k-mers in database
- Index of a k-mer is easily computed
- A 0, C 1, D 2 Y 19
- Index of "ADAYY"
0 AAAAA - 1 AAAAC - 2 AAAAD
- ... 16399 ADAYY 1 ...
ALLLQ 7 ... ... VALLL 6
... 205-1 YYYYY -
A D A Y Y 0 2
0 19 19 204 203 202 201 200
0 16000 0 380 19 16399
4Statistical Significance of Local Sequence
Alignments
- Probability that q (lenm) is contained in t
(lenn) - a. There are (n-m) 'words' of length m in t
- b. In total, there are 4m sequences of length m
- c. p (n-m) / 4m
This is wrong. Why?
- The (n-m) words in t are not independent!
- Therefore
- probability p in general hard to compute (no
analytical solution known so far!) - simplest case! complication scoring, gapped
alignment - approximation formulas used in BLAST/FASTA
- probability of a certain score of a local
alignment between t and q - follows an extreme value distribution (not normal
distribution) - extreme value distribution changes with residual
composition of sequences - FASTA plots distribution of occurred scores vs.
expected scores - BLAST assigns an E-value to each hit
- expected number of hits with score of hit
- considers residual composition of query and
database - the same local alignment has different E-Values
if it occurs for different DBs!
5Sample BLAST output
visualization of hits
summary of hits, containing E-Values
6Multiple Sequence Alignment
- Dynamic Programming for multiple sequence
alignments - Dynamic Programming can theoretically be used for
aligning kgt2 sequences - instead of 2-dimensional scoring matrix, a
k-dimensional scoring matrix is used - instead of 2-dimensional dynamic programming
scheme, a k-dimensional one is used - However, in practical terms, not feasible
- k-dimensional scoring matrix uses exponential
space - example 5 sequences, 100 residues each, dyn.pr.
scheme 1005 cells, at least 10GB - (Mathematically) optimal solution for aligning
multiple sequences not feasible! - Approximations used
- Idea of progressive methods
- align two sequences
- compute "consensus" sequence
- align consensus with third sequence
- repeat steps 2/3
7Multiple Sequence Alignments CLUSTALW (1)
- Basic steps
- step 1 Perform pairwise sequence alignments of
all possible pairs - step 2 Use scores to produce a phylogenetic
tree - step 3 align the sequences sequentially, guided
by the tree - the most closely related sequences are aligned
first - other sequences or groups of sequences are added
- Example Phylogenetic tree for 4 sequences
A B C D A 100 90 85 90 B 100 95
80 C 100 97 D
100 Percentage Identities inpairwise alignment
8Multiple Sequence Alignments CLUSTALW (2)
- Computation of Phylogenetic tree
A B C D A 100 95 80 90 B 100 95
85 C 100 97 D
100 Percentage Identities inpairwise alignment
9Multiple Sequence Alignments CLUSTALW (3)
10Patterns/Motifs as domain footprints
- Available implementations
11Multiple sequence alignment of human ubiquitin
conjugating enzymes
- UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
--------SQWSPALTISK - UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
--------SQWSPALTISK - BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-----
--------SQWSPALTVSK - UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-----
--------SQWSPALTVSK - UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
--------DNWSPALTISK - UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
--------DNWSPALTISK - UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-----
--------DKWSPALQIRT - AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---
------PKGAWRPSLNIAT - UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-----
--------EKWSALYDVRT - CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDD
PQSGELPSERWNPTQNVRT - BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDD
PQSGELPSERWNPTQNVRT - UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGED
KYGYEKPEERWLPIHTVET - UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN----
---------RWSPTYDVSS - UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED---
--------KDWRPAITIKQ - E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR----
---------DWTAELGIRH - UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA----
--------ENWKPATKTDQ - UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS----
--------ENWKPCTKTCQ - UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-----
--------QTWTALYDLTN - UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-----
--------EDWKPVLTINS
Pattern FYW-H-PC-N-IV-x(3,4)-G-x-IV-C-
LIV-x-IV PROSITE FYWLSP-H-PC-NH-LIV-x(
3,4)-G-x-LIV-C-LIV-x-LIV
Pattern that can be used to scan databases for
its occurrence or Sequence can be searched
against databse of patterns
12Patterns as footprints of functional protein
domains PROSITE
- Example Ubiquitin conjugating enzyme pattern
entry
13Scanning a sequence vs. profile/pattern database
InterProScan (EBI)
- Example EphB2 (ephrin B2 receptor, receptor
tyrosine kinase)
14Conclusions
- Statistical significance of local alignments
- no analytical formula for probability of a
certain local alignment score - local alignment scores follow extreme value
distribution (not normal) - BLAST reports expectation value, dependent on
residual composition and size of sequences - Multiple alignments
- dynamic programming not feasible ? approximation
used - progressive methods (align two seqs, consensus,
align third) - CLUSTALW all pairwise alignments ? guide tree
- Patterns, Motifs, Profiles
- characteristics of domains
- pattern/profile databases PROSITE
- search with pattern vs. sequence databases
- search with sequence vs. pattern database
INTERPROSCAN - Hidden Markov Models (HMM) Generalized profiles