Statistical Significance for Local Sequence Alignments Multiple Sequence Alignments Profilebased seq - PowerPoint PPT Presentation

1 / 14

About This Presentation

Title:

Statistical Significance for Local Sequence Alignments Multiple Sequence Alignments Profilebased seq

Description:

... methods. align two sequences ... the most closely related sequences are aligned first ... Pattern that can be used to scan databases for its occurrence ... – PowerPoint PPT presentation

Number of Views:211

Avg rating:3.0/5.0

Slides: 15

Provided by: adrianbr

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Significance for Local Sequence Alignments Multiple Sequence Alignments Profilebased seq

1
Statistical Significance forLocal Sequence
AlignmentsMultiple Sequence AlignmentsProfile-ba
sed sequence search

A. Brüngger, Labhead BioinformaticsNovartis
Pharma AG
adrian.bruengger_at_pharma.novartis.com

2
Local Alignments, Multiple Sequence Alignments,
Profile-based Searches

Statistical Significance of Local Sequence
Alignments
Fast identification of identical k-mers
Implementation detail
Facts about the statistical significance of local
alignments
Statistical significance of local alignments in
BLAST
Multiple Sequence Alignments
Dynamic Programming
Iterative Approach for multiple sequence
alignments CLUSTALW
Construction of phylogenetic trees
Profile-based sequence searches
Patterns, motifs, profiles as domain footprints
Databases of patterns and profiles PROSITE
Example Sensitivity/selectivity of a pattern for
ubiquitin conjugating enzymes

Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3
Local Sequence Alignments Fast identification of
k-mers
query
MAAAPLLGAQGAPLEPVYPGDNATPEQMAQYAADLRRYIN
MLTRPRYGKRHKEDTLAFSE

ADAYYVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYETQLRRYINTLT
RPRYGKRAEEENTGGLP
database

Fast identification of k-mers
Indexing of all possible k-mers in database
Index of a k-mer is easily computed
A 0, C 1, D 2 Y 19
Index of "ADAYY"

0 AAAAA - 1 AAAAC - 2 AAAAD
- ... 16399 ADAYY 1 ...
ALLLQ 7 ... ... VALLL 6
... 205-1 YYYYY -
A D A Y Y 0 2
0 19 19 204 203 202 201 200
0 16000 0 380 19 16399
4
Statistical Significance of Local Sequence
Alignments

Probability that q (lenm) is contained in t
(lenn)
a. There are (n-m) 'words' of length m in t
b. In total, there are 4m sequences of length m
c. p (n-m) / 4m

This is wrong. Why?

The (n-m) words in t are not independent!
Therefore
probability p in general hard to compute (no
analytical solution known so far!)
simplest case! complication scoring, gapped
alignment
approximation formulas used in BLAST/FASTA
probability of a certain score of a local
alignment between t and q
follows an extreme value distribution (not normal
distribution)
extreme value distribution changes with residual
composition of sequences
FASTA plots distribution of occurred scores vs.
expected scores
BLAST assigns an E-value to each hit
expected number of hits with score of hit
considers residual composition of query and
database
the same local alignment has different E-Values
if it occurs for different DBs!

5
Sample BLAST output
visualization of hits
summary of hits, containing E-Values
6
Multiple Sequence Alignment

Dynamic Programming for multiple sequence
alignments
Dynamic Programming can theoretically be used for
aligning kgt2 sequences
instead of 2-dimensional scoring matrix, a
k-dimensional scoring matrix is used
instead of 2-dimensional dynamic programming
scheme, a k-dimensional one is used
However, in practical terms, not feasible
k-dimensional scoring matrix uses exponential
space
example 5 sequences, 100 residues each, dyn.pr.
scheme 1005 cells, at least 10GB
(Mathematically) optimal solution for aligning
multiple sequences not feasible!
Approximations used
Idea of progressive methods
align two sequences
compute "consensus" sequence
align consensus with third sequence
repeat steps 2/3

7
Multiple Sequence Alignments CLUSTALW (1)

Basic steps
step 1 Perform pairwise sequence alignments of
all possible pairs
step 2 Use scores to produce a phylogenetic
tree
step 3 align the sequences sequentially, guided
by the tree
the most closely related sequences are aligned
first
other sequences or groups of sequences are added
Example Phylogenetic tree for 4 sequences

A B C D A 100 90 85 90 B 100 95
80 C 100 97 D
100 Percentage Identities inpairwise alignment
8
Multiple Sequence Alignments CLUSTALW (2)

Computation of Phylogenetic tree

A B C D A 100 95 80 90 B 100 95
85 C 100 97 D
100 Percentage Identities inpairwise alignment
9
Multiple Sequence Alignments CLUSTALW (3)

CLUSTALW
seven globins

10
Patterns/Motifs as domain footprints

Available implementations

11
Multiple sequence alignment of human ubiquitin
conjugating enzymes

UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
--------SQWSPALTISK
UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
--------SQWSPALTISK
BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-----
--------SQWSPALTVSK
UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-----
--------SQWSPALTVSK
UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
--------DNWSPALTISK
UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
--------DNWSPALTISK
UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-----
--------DKWSPALQIRT
AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---
------PKGAWRPSLNIAT
UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-----
--------EKWSALYDVRT
CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDD
PQSGELPSERWNPTQNVRT
BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDD
PQSGELPSERWNPTQNVRT
UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGED
KYGYEKPEERWLPIHTVET
UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN----
---------RWSPTYDVSS
UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED---
--------KDWRPAITIKQ
E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR----
---------DWTAELGIRH
UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA----
--------ENWKPATKTDQ
UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS----
--------ENWKPCTKTCQ
UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-----
--------QTWTALYDLTN
UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-----
--------EDWKPVLTINS

Pattern FYW-H-PC-N-IV-x(3,4)-G-x-IV-C-
LIV-x-IV PROSITE FYWLSP-H-PC-NH-LIV-x(
3,4)-G-x-LIV-C-LIV-x-LIV
Pattern that can be used to scan databases for
its occurrence or Sequence can be searched
against databse of patterns
12
Patterns as footprints of functional protein
domains PROSITE

Example Ubiquitin conjugating enzyme pattern
entry

13
Scanning a sequence vs. profile/pattern database
InterProScan (EBI)

Example EphB2 (ephrin B2 receptor, receptor
tyrosine kinase)

14
Conclusions

Statistical significance of local alignments
no analytical formula for probability of a
certain local alignment score
local alignment scores follow extreme value
distribution (not normal)
BLAST reports expectation value, dependent on
residual composition and size of sequences
Multiple alignments
dynamic programming not feasible ? approximation
used
progressive methods (align two seqs, consensus,
align third)
CLUSTALW all pairwise alignments ? guide tree
Patterns, Motifs, Profiles
characteristics of domains
pattern/profile databases PROSITE
search with pattern vs. sequence databases
search with sequence vs. pattern database
INTERPROSCAN
Hidden Markov Models (HMM) Generalized profiles