Statistical Significance for Local Sequence Alignments Multiple Sequence Alignments Profilebased seq - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Statistical Significance for Local Sequence Alignments Multiple Sequence Alignments Profilebased seq

Description:

... methods. align two sequences ... the most closely related sequences are aligned first ... Pattern that can be used to scan databases for its occurrence ... – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 15
Provided by: adrianbr
Category:

less

Transcript and Presenter's Notes

Title: Statistical Significance for Local Sequence Alignments Multiple Sequence Alignments Profilebased seq


1
Statistical Significance forLocal Sequence
AlignmentsMultiple Sequence AlignmentsProfile-ba
sed sequence search
  • A. Brüngger, Labhead BioinformaticsNovartis
    Pharma AG
  • adrian.bruengger_at_pharma.novartis.com

2
Local Alignments, Multiple Sequence Alignments,
Profile-based Searches
  • Statistical Significance of Local Sequence
    Alignments
  • Fast identification of identical k-mers
    Implementation detail
  • Facts about the statistical significance of local
    alignments
  • Statistical significance of local alignments in
    BLAST
  • Multiple Sequence Alignments
  • Dynamic Programming
  • Iterative Approach for multiple sequence
    alignments CLUSTALW
  • Construction of phylogenetic trees
  • Profile-based sequence searches
  • Patterns, motifs, profiles as domain footprints
  • Databases of patterns and profiles PROSITE
  • Example Sensitivity/selectivity of a pattern for
    ubiquitin conjugating enzymes

Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3
Local Sequence Alignments Fast identification of
k-mers
query
MAAAPLLGAQGAPLEPVYPGDNATPEQMAQYAADLRRYIN
MLTRPRYGKRHKEDTLAFSE

ADAYYVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYETQLRRYINTLT
RPRYGKRAEEENTGGLP
database
  • Fast identification of k-mers
  • Indexing of all possible k-mers in database
  • Index of a k-mer is easily computed
  • A 0, C 1, D 2 Y 19
  • Index of "ADAYY"

0 AAAAA - 1 AAAAC - 2 AAAAD
- ... 16399 ADAYY 1 ...
ALLLQ 7 ... ... VALLL 6
... 205-1 YYYYY -
A D A Y Y 0 2
0 19 19 204 203 202 201 200
0 16000 0 380 19 16399
4
Statistical Significance of Local Sequence
Alignments
  • Probability that q (lenm) is contained in t
    (lenn)
  • a. There are (n-m) 'words' of length m in t
  • b. In total, there are 4m sequences of length m
  • c. p (n-m) / 4m

This is wrong. Why?
  • The (n-m) words in t are not independent!
  • Therefore
  • probability p in general hard to compute (no
    analytical solution known so far!)
  • simplest case! complication scoring, gapped
    alignment
  • approximation formulas used in BLAST/FASTA
  • probability of a certain score of a local
    alignment between t and q
  • follows an extreme value distribution (not normal
    distribution)
  • extreme value distribution changes with residual
    composition of sequences
  • FASTA plots distribution of occurred scores vs.
    expected scores
  • BLAST assigns an E-value to each hit
  • expected number of hits with score of hit
  • considers residual composition of query and
    database
  • the same local alignment has different E-Values
    if it occurs for different DBs!

5
Sample BLAST output
visualization of hits
summary of hits, containing E-Values
6
Multiple Sequence Alignment
  • Dynamic Programming for multiple sequence
    alignments
  • Dynamic Programming can theoretically be used for
    aligning kgt2 sequences
  • instead of 2-dimensional scoring matrix, a
    k-dimensional scoring matrix is used
  • instead of 2-dimensional dynamic programming
    scheme, a k-dimensional one is used
  • However, in practical terms, not feasible
  • k-dimensional scoring matrix uses exponential
    space
  • example 5 sequences, 100 residues each, dyn.pr.
    scheme 1005 cells, at least 10GB
  • (Mathematically) optimal solution for aligning
    multiple sequences not feasible!
  • Approximations used
  • Idea of progressive methods
  • align two sequences
  • compute "consensus" sequence
  • align consensus with third sequence
  • repeat steps 2/3

7
Multiple Sequence Alignments CLUSTALW (1)
  • Basic steps
  • step 1 Perform pairwise sequence alignments of
    all possible pairs
  • step 2 Use scores to produce a phylogenetic
    tree
  • step 3 align the sequences sequentially, guided
    by the tree
  • the most closely related sequences are aligned
    first
  • other sequences or groups of sequences are added
  • Example Phylogenetic tree for 4 sequences

A B C D A 100 90 85 90 B 100 95
80 C 100 97 D
100 Percentage Identities inpairwise alignment
8
Multiple Sequence Alignments CLUSTALW (2)
  • Computation of Phylogenetic tree

A B C D A 100 95 80 90 B 100 95
85 C 100 97 D
100 Percentage Identities inpairwise alignment
9
Multiple Sequence Alignments CLUSTALW (3)
  • CLUSTALW
  • seven globins

10
Patterns/Motifs as domain footprints
  • Available implementations

11
Multiple sequence alignment of human ubiquitin
conjugating enzymes
  • UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
    --------SQWSPALTISK
  • UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-----
    --------SQWSPALTISK
  • BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-----
    --------SQWSPALTVSK
  • UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-----
    --------SQWSPALTVSK
  • UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
    --------DNWSPALTISK
  • UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-----
    --------DNWSPALTISK
  • UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-----
    --------DKWSPALQIRT
  • AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---
    ------PKGAWRPSLNIAT
  • UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-----
    --------EKWSALYDVRT
  • CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDD
    PQSGELPSERWNPTQNVRT
  • BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDD
    PQSGELPSERWNPTQNVRT
  • UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGED
    KYGYEKPEERWLPIHTVET
  • UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN----
    ---------RWSPTYDVSS
  • UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED---
    --------KDWRPAITIKQ
  • E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR----
    ---------DWTAELGIRH
  • UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA----
    --------ENWKPATKTDQ
  • UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS----
    --------ENWKPCTKTCQ
  • UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-----
    --------QTWTALYDLTN
  • UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-----
    --------EDWKPVLTINS

Pattern FYW-H-PC-N-IV-x(3,4)-G-x-IV-C-
LIV-x-IV PROSITE FYWLSP-H-PC-NH-LIV-x(
3,4)-G-x-LIV-C-LIV-x-LIV
Pattern that can be used to scan databases for
its occurrence or Sequence can be searched
against databse of patterns
12
Patterns as footprints of functional protein
domains PROSITE
  • Example Ubiquitin conjugating enzyme pattern
    entry

13
Scanning a sequence vs. profile/pattern database
InterProScan (EBI)
  • Example EphB2 (ephrin B2 receptor, receptor
    tyrosine kinase)

14
Conclusions
  • Statistical significance of local alignments
  • no analytical formula for probability of a
    certain local alignment score
  • local alignment scores follow extreme value
    distribution (not normal)
  • BLAST reports expectation value, dependent on
    residual composition and size of sequences
  • Multiple alignments
  • dynamic programming not feasible ? approximation
    used
  • progressive methods (align two seqs, consensus,
    align third)
  • CLUSTALW all pairwise alignments ? guide tree
  • Patterns, Motifs, Profiles
  • characteristics of domains
  • pattern/profile databases PROSITE
  • search with pattern vs. sequence databases
  • search with sequence vs. pattern database
    INTERPROSCAN
  • Hidden Markov Models (HMM) Generalized profiles
Write a Comment
User Comments (0)
About PowerShow.com