pattern matching - PowerPoint PPT Presentation

About This Presentation
Title:

pattern matching

Description:

Functional genomics: finding out the function of all genes (and other parts) in a genome ... NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=2.3; R2=.02281121; TEXT='-LogE' ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 43
Provided by: ibi7
Category:
Tags: loge | matching | pattern

less

Transcript and Presenter's Notes

Title: pattern matching


1
Master Course Sequence Alignment Lecture
11Sequence Motif Searches
2
Pattern matching
  • Functional genomics finding out the function of
    all genes (and other parts) in a genome
  • Ability to recognise protein function paramount
  • Database searching is crucial strategy
  • trypsin has catalytic triad (His, Asp, Ser). How
    to recognize this?
  • (local) alignments not always suitable
  • short patterns, too many dont cares, etc.

3
Degenerate DNA codes
  • Four bases A, C, G, T
  • Two-fold degenerate IUB codes
  • RAG
  • YCT
  • KGT
  • MAC
  • SGC
  • WAT
  • Four-fold degenerate NAGCT

4
Degenerate protein codes
  • 20 bases ACDEFGHIKLMNPQRSTVWY
  • Degenerate codes
  • X unknown, all (20-fold degenerate)
  • B DE
  • Z NQ

5
Transcription factors
  • Required but not a part of the RNA polymerase
    complex
  • Many different roles in gene regulation
  • Binding
  • Interaction
  • Initiation
  • Enhancing
  • Repressing
  • Various structural classes (e.g. zinc finger
    domains)
  • Consist of both a DNA-binding domain and an
    interactive
  • domain

6
Transcription factors
TF binding site
TF
Transcription factor polymerase interaction
sets off gene transcription
mRNA transcription
Pol II
TATA
mRNA transcription
TF binding site (closed)
TATA
TF binding site (open)
Nucleosomes (chromatin structures composed of
histones) are structures round which DNA coils.
This blocks access of TFs
many TFBSs are possible
7
Motifs
  • Short sequences of DNA (or RNA or amino acids)
  • Often consist of 5- 16 nucleotides
  • Protein motifs can be more variable
  • May contain gaps
  • Examples include
  • Splice sites
  • Start/stop codons
  • Transmembrane domains
  • Centromeres
  • Phosphorylation sites
  • Coiled-coil domains
  • Transcription factor binding site (regulatory
    motifs)

8
pattern matching
  • This lecture
  • Regular expressions
  • Pre-processing data for pattern matching suffix
    trees
  • Hidden Markov models (brief)

9
Rationale for regular expressions
  • I want to see all sequences that ...
  • ... contain a C
  • ... contain a C or an F
  • ... contain a C and an F
  • ... contain a C immediately followed by an F
  • ... contain a C later followed by an F
  • ... begin with a C
  • ... do not contain a C
  • ... contain at least three Cs
  • ... contain exactly three Cs
  • ... has a C at the seventh position
  • ... either contain a C, an E, and an F in any
    order except CFE, unless there are also at most
    three Ps, or there is a ....

10
regular expressions
  • alphabet set of symbols
  • A, C, T, G
  • string sequence of symbols from alphabet
  • AACTG, CATG, GGA, ACFT, e
  • regex formal method to define (sub)set of
    strings
  • C.AG?T
  • used for pattern matching
  • check if database sequence ? regex

11
contruction of a regex
  • regex contains
  • symbols from alphabet
  • C ? C
  • operators
  • operations on regex(es) yield new regex
  • concatenation, union, repetition, ...

12
basic operators
r1r2 concatenation AC ? AC AAC ? AAC
s1s2 ... sn union (of symbols) ACG ?
A, C, G ACG ? AG, CG r1r2 union
(of regexes) ACC ? A, CC ACAC ? A,
C, AC r repeat once or more C ? C, CC,
CCC, CCCC, ... AAC ? AA, AC, AAA, AAC,
ACA, ACC, AAAA, AAAC, ...
13
derived operators
r? optional C? ? e, C AC?G ? AG, ACG
r repeat zero or more times C ? e, C,
CC, CCC, CCCC, ... AC ? C, AC, AAC, AAAC,
... AC ? e, A, C, AA, AC, CA, CC,
AAA, AAC, ACA, ACC, ... rn-m repeat n
m times C4 ? CCCC C2-4 ? CC, CCC, CCCC
C-3 ? e, C, CC, CCC C3- ? CCC, CCCC,
CCCCC, ...
14
miscellaneous
. any symbol . ? A, C, G, T A.C ?
AAC, ACC, AGC, ATC .? ? e, A, C, G, T
. ? e, A, C, G, T, AA, AC, AG, AT,
CA, CC, CG, CT, GA, ... s1s2 ...
sn exclude symbols A ? C, G, T
AC ? G, T (r) grouping (AC)? ?
e, AC AC? ? A, AC (AC) ? e, AC,
ACAC, ACACAC, ACACACAC, ... AC ? A, AC,
ACC, ACCC, ...
15
limitations
  • regex cannot remember indeterminate counts !!!
  • I want to see all sequences with ...
  • ... six Cs followed by six Ts
  • C6T6
  • ... any number of Cs followed by any number of
    Ts
  • CT
  • ... Cs followed by an equal number of Ts
  • CnTn
  • (CTCCTTCCCTTTC4T4 ... )?
  • use (context-free) grammar

16
regexes in pattern matching
  • pattern described by regex
  • check if sequence ? regex
  • matching done very efficiently
  • O(n)
  • using state machine

17
state machines
ACTGGC
  • compile regex to state machine
  • match sequence with regex

18
Example from BLASTDetermining Query Words
  • Given
  • query sequence QLNFSAGW
  • word length w 3
  • word score threshold T 8
  • Step 1 determine all words of length w in query
    sequence
  • QLN LNF NFS FSA SAG AGW

19
Example from BLASTDetermining Query Words
  • Step 2 determine all words that score at least T
    when compared to a word in the query sequence
  • words from query words w/ T8
  • sequence
  • QLN QLN11, QMD9, HLN8, ZLN9,
  • LNF LNF9, LBF9, LBY8, FNW8,
  • NFS NFS12, AFS8, NYS8, DFT10,
  • SAG none
  • ...

20
Example from BLASTScanning the Database
  • search database for all occurrences of query
    words
  • approach
  • build a DFA (deterministic finite-state
    automaton) that recognizes all query words
  • run DB sequences through DFA
  • remember hits

21
Example from BLAST Scanning the Database
  • consider a DFA to recognize the query words QL,
    QM, ZL
  • All that a DFA does is read strings, and output
    "accept" or "reject."
  • use Mealy paradigm (accept on transitions) to
    save space and time

Moore paradigm the alphabet is (a, b), the
states are q0, q1, and q2, the start state is q0
(denoted by the arrow coming from nowhere), the
only accepting state is q2 (denoted by the double
ring around the state), and the transitions are
the arrows. The machine works as follows. Given
an input string, we start at the start state, and
read in each character one at a time, jumping
from state to state as directed by the
transitions. When we run out of input, we check
to see if we are in an accept state. If we are,
then we accept. If not, we reject. Moore
paradigm accept/reject states Mealy paradigm
accept/reject transitions
22
Example from BLAST a DFA to recognize query
words QL, QM, ZL
Q
Mealy paradigm
not (L or M or Q)
L or M
start
Q
Z
Z
L
not (L or Z)
not (Q or Z)
Accept on red transitions (Mealy paradigm)
23
other uses
  • many programs use regular expressions
  • command-line interpreter
  • del .
  • editor
  • search
  • replace
  • compilers
  • perl, grep, sed, awk
  • many different syntaxes

24
local vs. global matching
  • global regex describes entire string to be
    matched
  • ACCCCTG ? C3-
  • local regex describes substring to be matched
  • ACCCCTG ? C3-
  • matches start-of-string
  • CG match everything starting with CG
  • CG match everything not starting with C or G
  • matches end-of-string
  • AC match everything ending with AC

25
Regular expressions
Alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQN
RCDRYYQ ADIGQPHSLCERYFQ Regular
expression AS-D-IVL-G-x4-PG-C-DE-R-FY2-Q
PG not (P or G)
For short sequence stretches, regular expressions
are often more suitable to describe the
information than alignments (or profiles)
26
Regular expressions
Regular expression No. of exact matches
in DB D-A-V-I-D 71 D-A-V-I-DENQ 252 DENQ-
A-V-I-DENQ 925 DENQ-A-VLI-I-DENQ 2739 DE
NQ-AG-VLI2-DENQ 51506 D-A-V-E 1088
27
Motif-based function prediction
  • Prediction of protein functions based on
    identified sequence motifs
  • PROSITE contains patterns specific for more than
    a thousand protein families.
  • ScanPROSITE -- it allows to scan a protein
    sequence for occurrence of patterns and profiles
    stored in PROSITE

http//www.expasy.org/prosite/
28
Prosite example Post-translational modification
ASN_GLYCOSYLATION, PS00001 N-glycosylation site  (PATTERN with a high probability of occurrence!)

Consensus pattern N - P - ST - P N is the glycosylation site
29
Prosite example extended profile
  • Acyl carrier protein phosphopantetheine domain
    profile.
  • /GENERAL_SPEC ALPHABET'ABCDEFGHIKLMNPQRSTVWYZ'
    LENGTH71
  • /DISJOINT DEFINITIONPROTECT N16 N266
  • /NORMALIZATION MODE1 FUNCTIONLINEAR R12.3
    R2.02281121 TEXT'-LogE'
  • /CUT_OFF LEVEL0 SCORE271 N_SCORE8.5
    MODE1 TEXT'!'
  • /CUT_OFF LEVEL-1 SCORE184 N_SCORE6.5
    MODE1 TEXT'?'
  • /DEFAULT D-20 I-20 B1-80 E1-80 MI-105
    MD-105 IM-105 DM-105 MM1 M0-1 A B C D E
    F G H I K L M N P Q R S T V W Y Z
  • /I B10 BI-105 BD-105
  • /M SY'T' M -5,-15,-20,-17,-12,-10,-22,-18,
    2,-13, -1, 0,-13, -6,-10,-13, -5, 4, 1,-23,
    -9,-12
  • /M SY'E' M -6, -6,-22, -6, 9,-13,-21, -9,-11,
    0, -8, -7, -7,-13, 1, 1, -4, -3, -8,-24,-10, 4
  • /M SY'E' M -5, 9,-24, 11, 15,-24,-12, -3,-23,
    3,-20,-15, 6, -9, 5, 1, 4, -2,-19,-29,-16, 9
  • /M SY'E' M -5, 2,-26, 4, 8,-22,-13, -7,-21,
    7,-17,-12, 0,-13, 3, 7, -2, -6,-16,-22,-12, 5
  • /M SY'L' M -6,-27,-19,-30,-23, 4,-30,-23,
    26,-25, 28, 17,-25,-27,-21,-20,-19, -5, 23,-23,
    -3,-23
  • /M SY'R' M -3,-10,-10,-11, 2,-16,-19,-11,-13,
    -1, -8, -7, -8,-17, -1, 3, -5, -6, -9,-26,-13,
    -1 /M SY'E' M -1, 3,-23, 4, 9,-24,-11,
    -7,-22, 8,-19,-13, 2,-11, 5, 6, 2,
    -2,-17,-26,-15, 7 /M SY'I' M
    -5,-22,-20,-27,-19, -4,-29,-20, 20,-19, 13,
    10,-18,-21,-14,-17,-15, -6, 14,-20, -4,-18 /M
    SY'I' M -8,-30,-24,-33,-27, 8,-29,-26, 19,-24,
    15, 9,-28,-27,-23,-21,-22,-10, 17, 9, 4,-25 /M
    SY'A' M 11, -8, -8,-12, -5,-19,-11,-14,-14,
    -1,-14, -9, -6,-15, -4, -4, 2, -2, -6,-25,-15,
    -5 /M SY'E' M -5, 10,-26, 15, 22,-28,-12,
    -2,-26, 6,-21,-16, 4, -8, 10, 0, 2,
    -6,-23,-28,-16, 16 /M SY'V' M
    -5,-14,-15,-16, -6,-11,-23,-14, 4,-11, 0,
    1,-13,-19, -5,-12, -7, -5, 6,-24, -8, -6 /M
    SY'L' M -2,-24,-21,-26,-19, 5,-24,-20, 10,-23,
    22, 7,-23,-24,-18,-19,-18, -7, 6, -7, 0,-18 /M
    SY'G' M 3, -4,-25, -5, -4,-27, 24,-12,-29,
    -6,-25,-16, 1,-12, -4, -8, 5, -9,-23,-24,-22, -4
    /M SY'V' M -1,-12,-19,-14, -8,-11,-20,-14,
    4,-12, 0, 1,-11,-18,-10,-13, -5, -2, 7,-25,
    -9,-10 /I I-4 MI0 MD-15 IM0 /M M -2,
    -6,-13, -6, -5, -9,-11,-10, -2, -8, 0, -2, -7,
    -7, -7, -8, -5, -3, -1,-18, -8, -7 D-3 /I
    DM-15

. . .
30
Suffix Trees
  • A suffix tree (also called PAT tree or, in an
    earlier form, position tree) is a data structure
    that presents the suffixes of a given string,
    allowing fast implementation of many important
    string operations.
  • A suffix is defined as the shortest sub-sequence
    starting at a given position that is unique in
    the complete sequence and can therefore be used
    to clearly identify that position.
  • The suffix tree for a string S is a tree whose
    edges are labelled with strings, such that each
    suffix of S corresponds to exactly one path from
    the tree's root to a leaf.
  • Constructing such a tree for the string S takes
    time and space linear in the length of S.
  • Once constructed, several operations can be
    performed quickly, for instance locating a
    substring in S, locating a substring if a certain
    number of mistakes are allowed, locating matches
    for a regular expression pattern etc.
  • Suffix trees also provided one of the first
    linear-time solutions for the longest common
    substring problem. These speedups come at a cost
    storing a string's suffix tree typically requires
    significantly more space than storing the string
    itself. With what kind of strings would this not
    be the case?

31
Suffix Trees
Given the string mississippi', miss' is a
prefix, ippi' is a suffix, issi' is a
substring. Note that a substring is a prefix of a
suffix. If txtt1t2...ti...tn is a string, then
Tititi1...tn is the suffix of txt that starts
at position i, e.g. T1 mississippi txt T2
ississippi T3 ssissippi T4 sissippi T5
issippi T6 ssippi T7 sippi T8 ippi T9
ppi T10 pi T11 i T12 (empty)
32
Suffix Trees
  • 11 i
  • 8 ippi
  • 5 issippi
  • 2 ississippi
  • 1 mississippi
  • 10 pi
  • 9 ppi
  • 7 sippi
  • 4 sissippi
  • 6 ssippi
  • 3 ssissippi
  • (1mississippi)leaf
  • tree
  • (6ssippi)leaf
  • (3ssi)
  • (9ppi)leaf
  • (2i)
  • (9ppi)leaf

From http//www.allisons.org/ll/AlgDS/Tree/Suffix/
, please study the information provided by this
link!
33
Building a Suffix Tree
  • Take a sequence
  • Group all positions according to base
    (nucleotide, a.a.) type leading to the first
    level in the tree
  • (symbol is often included to indicate the
    end of the string)
  • Regroup each group according to the following
    base, giving the second row of the tree
  • Continue this process and stop when a (sub)group
    only contains one sequence position (but include
    complete suffix)

34
Finding motifs in sequences
  • Many methods exist
  • MEME, Gibbs
  • Typically, these programs try and find motifs of
    a given length W in a set of N sequences.
  • Using probabilistic formalisms, they distinguish
    background residues, those not in the pattern,
    and residues at specific positions in the
    pattern.

35
Example of HMM repositoryThe PFAM Database
  • Pfam is a large collection of multiple sequence
    alignments and hidden Markov models covering many
    common protein domains and families. For each
    family in Pfam you can
  • Look at multiple alignments
  • View protein domain architectures
  • Examine species distribution
  • Follow links to other databases
  • View known protein structures
  • Search with Hidden Markov Model (HMM) for each
    alignment

36
The PFAM Database
  • Pfam is a database of two parts, the first is the
    curated part of Pfam containing about 9000
    protein families (Pfam-A). Pfam-A comprises
    manually crafted multiple alignments and
    profile-HMMs .
  • To give Pfam a more comprehensive coverage of
    known proteins we automatically generate a
    supplement called Pfam-B. This contains a large
    number of small families taken from the PRODOM
    database that do not overlap with Pfam-A.
  • Although of lower quality Pfam-B families can be
    useful when no Pfam-A families are found.

37
The PFAM Database
  • Sequence coverage Pfam-A 74 (Yellow)
  • Sequence coverage Pfam-B 13 (Blue)
  • Other (Grey)
  • 74 of proteins have at least one match with
    Pfam.
  • Version 21.0 - November 2006 Pfam-A contains
    8957 families

38
Pfam Ig Family Alignment
39
Clan pages in Pfam. (A) A screen shot of a clan
summary page, containing the description,
annotation and membership of the clan. From this
page, the user can view the family relationship
diagram (B). Each family in the clan is
represented by a blue box and its relationship to
other families is represented by solid lines
(significant profileprofile comparison score) or
dashed lines (non-significant profile-profile
comparison score). Beside each line, the
profileprofile comparison E-value score is
presented. This score is also linked to a
visualization of the profileprofile comparison
alignment (C). The clan summary page also
provides a link to the clan alignment (D) (for
more details see text). The clan alignment is a
multiple sequence alignment of all of the clan
members seed alignments (each set of seed
sequences are separated by the alternate
background shading). The alignments are coloured
using Jalview.
40
HMM-based homology searching
Transition probabilities and Emission
probabilities Gapped HMMs also have insertion
and deletion states
41
Profile HMM mmatch state, I-insert state,
ddelete state go from left to right. I and m
states output amino acids d states are silent.
Transition probabilities and Emission
probabilities
42
A hidden Markov model accompanying a PFAM
alignment
  • HMMER2.0 2.2g
  • NAME cytochrome_b_N
  • ACC PF00033
  • DESC Cytochrome b(N-terminal)/b6/petB
  • LENG 222 ALPH Amino RF no CS no MAP yes COM
    hmmbuild -F HMM_ls.ann SEED.ann COM hmmcalibrate
    --seed 0 HMM_ls.ann NSEQ 8 DATE Thu Dec 12
    024853 2002
  • CKSUM 8731
  • GA -41.9 -41.9 TC -41.9 -41.9 NC -42.4 -42.4 XT
    -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4
    -8455 NULE 595 -1558 85 338 -294 453 -1158 197
    249 902 -1085 -142 -21 -313 45 531 201 384 -1998
    -644 EVD -170.913223 0.138730
  • HMM A C D E F G H I K L M N P Q R S T V W Y
  • m-gtm m-gti m-gtd i-gtm i-gti d-gtm d-gtd b-gtm m-gte -300
    -2414
  • 1 -2605 -2478 -3823 -3810 -1719 -3245 -3021 -1267
    -3347 -794 5096 -3495 -3580 -3266 -3203 -3106
    -2761 -1627 -2687 -2398 1
  • -564 -3141 -2265 -289 -2463 -701 -1378 -1300
    8788
  • 2 1405 -805 -1720 -1394 -2431 567 -1366 -2119
    -1298 -2328 -1482 -1065 -1670 -1131 -1568 1592
    2088 -1424 -2629 -2251 3
  • -148 -500 233 43 -381 399 106 -626 210 -466 -720
    275 394 45 96 359 117 -369 -294 -249

HMMs are good for profile searches but
optimising the many parameters when using HMMs to
do alignments from scratch is a problem.
Write a Comment
User Comments (0)
About PowerShow.com