Motif Discovery in Unaligned and Aligned Protein Sequences - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Motif Discovery in Unaligned and Aligned Protein Sequences

Description:

Performance of ARCS on incorrectly aligned sequences. Among incorrectly aligned 47 protein families, the first peak of ARCS corresponds to ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 47
Provided by: Har5159
Category:

less

Transcript and Presenter's Notes

Title: Motif Discovery in Unaligned and Aligned Protein Sequences


1
Motif Discovery in Unaligned and Aligned Protein
Sequences
  • Sun Kim
  • sunkim2atindiana.edu
  • School of Informatics Center for Genomics and
    Bioinformatics
  • Indiana University Bloomington
  • November 2, 2006

2
OUTLINE
  • Introduction
  • Research Question and Motivation
  • Conserved region detection in aligned sequences
    using aggregated related column scores
  • iGibbs improved Gibbs sampler for proetins
  • Future work/Discussion

3
INTRODUCTION
  • (Sequence) Motifs
  • Short, conserved subsequences across a set of
    proteins that share similar function
  • Seq 1 KGGAKRHRKIL
  • Seq 2 KVGAKRHSKRS
  • Seq 3 KVGAKRHSRKS
  • Seq 4 KGGAKRHRKVL

4
Why do we want to identify motifs ?
  • Useful for
  • Predicting protein functions from its primary
    sequence
  • Predicting the family to which a protein belongs
  • Predicting protein structural features

5
Motifs
PS00577 DENY-x(2)-KRI-STA-x(2)-V-G-x-DN-x
-FW-T-KR
6
Motif Discovery
  • In aligned sequences
  • The input sequences are aligned using a multiple
    sequence alignment algorithm, say ClustalW or
    T-coffee or WP.
  • From the alignment, determine motif regions,
    typically most conserved regions. How?
  • In unaligned sequences
  • A motif model M is typically an alignment of
    subsequences occurring input sequences S.
  • Then the motif search problem is to find a model
    M such that maximizes a measure, typically log
    Pr(SM).

7
  • ARCS An Aggregated Related Column Scoring Scheme
    for Aligned Sequences
  • A method for detecting conserved, hopefully
    motif, regions in aligned sequences

8
Challenge in motif discovery in aligned sequences
  • Aligning input sequences globally often works
    very well when input sequences share enough
    similarity.
  • However, determining conserved regions or motifs
    in aligned sequences is poorly studied.

9
Motivation for ARCS
  • There are a few methods for detecting conserved
    regions in aligned sequences.
  • AL2CO (Pei and Grishin 2001) Two different
    strategies (unweighted frequencies and weighted
    frequencies) and three conceptually different
    approaches (entropy-based, variance-based and
    matrix score-based) were used.
  • ConFind (Smagala et al 2005) Several criteria,
    including minimum region length and maximum
    informational entropy (variability) per position,
    were used.
  • These methods do not use correlation among
    columns in an alignment, thus they often fail to
    detect conserved regions.

10
How we measure column correlation?
  • To measure column correlation, we use approximate
    functional dependency (Dalkilic and Robertson,
    2000 Giannella and Robertson, 2004).
  • Correlation score at column 2 with columns 1 and
    3.

11
Aggregated Related Column Score (ARCS) at column i
  • Measure LOGOS score at each column j.
  • Measure approximate functional dependency between
    column i and its neighbor column j.
  • ARCS at column i is the sum of
  • LOGOS at neighbor column j TIMES
  • approximate functional dependency between column
    i and its neighbor column j.

12
ARCS formal definition
  • LOGOS(i) HMax ? H(i) where
  • H(i) ??eFie log2(Fie)
  • HMax log2(Min(NL, n))
  • FD i?j 1- H i?j / log2(n)
  • H i?j ?p( cip /n) (?q(cip,jq / c ip) log2 (c
    ip/ c ip,jq))
  • ARCS(i) ? j?N(i) FD j ?i LOGOS(j)

13
ARCS example
  • HMax log2(Min(NL, n)) log2(Min(5, 4)) 2
  • H(1) ?(F1M log2(F1M) F1W log2(F1W)) -(½-1
    ½-1)1
  • LOGOS(1) HMax ? H(1) 1 LOGOS(2) 0.5
    LOGOS(3) 1.1887 LOGOS(4) 1.1887.
  • H 1?4 2/4 (1 log2 (1)) 2/4(1/2 log2 (2) 1/2
    log2 (2)) 0.5
  • FD 1?4 1-0.5/ log2 (4) 0.75
  • ARCS(1) FD 1?1 LOGOS(1) FD 2?1 LOGOS(2)
    1.375

14
Smoothed ARCS
  • Smoothed ARCS(i) ? 0?j? ?(w-1)/2? ARCS( i ?
    j)/w

15
Experiments with PROSITE patterns
  • We extracted a sequence set from Swissprot having
    the same PROSITE patterns.
  • Among 1320 patterns, we randomly chose 709
    patterns where the number of sequences was not
    greater than 50.
  • Each sequence set was aligned using Clustalw.
  • We used 533 multiple sequence alignments to
    evaluate our method for the case that the
    alignment is correct.
  • Forty-seven alignments were tested for the case
    that Clustal W aligned part or none of the
    motifs.

16
Effect of smoothing window size
17
Smoothed LOGO, AL2CO, and ARCS for PS00702
18
Performance of ARCS on incorrectly aligned
sequences
Among incorrectly aligned 47 protein families,
the first peak of ARCS corresponds to part of
motifs up to 40.4 test cases.
19
ARCS can be used for a new multiple seq alignment
  • If ARCS can highlight true motif regions even
    when a multiple sequence alignment is only
    partially correct, ARCS can be used for
    generating a multiple sequence alignment.
  • Indeed, we are trying two different methods
  • WP A Weighted Position Approach for Multiple
    Sequence Alignment a manuscript in preparation.
  • An iterative alignment improvement method
    (similar to a method by Wang, Dalkilic, and Kim
    ACM SAC 2004).

20
Pattern Complexity
  • We measured the sensitivity of ARCS with respect
    to the motif complexity which is defined as 1 -
    the ratio of the number of exact characters in
    the pattern to the length of the pattern.

21
  • iGibbs An Improved Gibbs Motif Sampler for
    Proteins

22
Motif Discovery Problem a search problem
  • Input N sequences
  • Output set of conserved regions

23
Motif search in unaligned seqs an optimization
problem
24
RELATED RESEARCH
  • Existing motif discovery algorithms
  • Stochastic
  • Gibbs
  • MEME
  • Combinatorial
  • Pratt
  • Teiresias

25
Motivation for iGibbs, Improved Gibbs motif
sampler
  • As shown in previous slides, the search space for
    motif detection is determined by the input
    sequence set.
  • What if only a set of the input seqs contains a
    motif, or what if there are more than one motif
    occuring in different subsets ?
  • We have shown that a sequence clustering approach
    improved the performance of MEME.
  • This time, we will improve the performance of
    Gibbs using a double clustering approach.

26
Why Gibbs?
  • Why did we choose to improve Gibbs motif sampler?
  • Gibbs is one of the fast running motif sampler,
    yet retains a good performance.
  • A Gibbs motif sampler predict different motifs
    for different execution, since it is a random
    sampling algorithm (we discuss more in the next
    slide).
  • Of course, the effect of the input seq set is not
    considered in the design of Gibbs.

27
Gibbs motif sampler review
28
Why clustering approach?
29
RESEARCH PROBLEM OF INTEREST
w1
Family 1
Motif 1
  • Separate non-homologous ones using clustering

w2
Family 2
Motif 2
Look at subsequences !
30
Three Search frameworks for motif discovery
  • One by clustering subsequences.
  • with Hardik Sheth
  • iGibbs by clustering whole sequences into
    clusters of sequences and then clustering motifs
    predicted using gibbs motif sampler from each
    cluster.
  • with Zhiping Wang and Mehmet Dalkilic
  • Iterative subsequence graph splitting approach
    using a min-cut algorithm (a genetic programming
    implementation)
  • with Yong-Hyuk Kim, Byung-Ro Moon, Seunghee
    Bae.

31
DESIGN OBJECTIVE
  • Algorithm should
  • Find motifs from an unknown input set i.e. work
    without knowing the expected number of motifs
  • Run Fast

32
Problem with sequence clustering algorithms
  • A lot!
  • BAG tends to generate clusters too specific or
    fragmented clusters a true family is split into
    multiple clusters.
  • Thus we used a double clustering approach to
    iGibbs.

33
Overview
34
Basic idea again, the effect of the input
sequence set
  • The input sequences are divided in two different
    ways
  • Horizontally by clustering whole sequences, and
  • Vertically by selecting subsequences
  • Then, each partition is input to the motif
    discovery
  • Algorithm, Gibbs in this case.

35
Subsequence selection by patterns
We look for patterns common to all input
sequences. However, it is unlikley to find a
single pattern common all sequences. Thus we are
using multiple patterns which Are similar to each
other.
36
patRefine a procedure to select patterns


37
iGibbs a double clustering approach
  • Group input sequences in clusters, Ci
  • Predict motifs using Gibbs
  • Scan the input for occurrences of the motifs
    predicted.
  • Group input sequences into clusters, Di , based
    on the motif occurrences.
  • Predict the final motifs from Dj

38
Experiment with PROSITE sequences
  • We collected sequence sets with the same PROSITE
    patterns.
  • There are three data sets
  • Single-family-set
  • Sequence sets with a single PROSITE pattern
  • Two-family-set
  • Sequence sets with two PROSITE patterns
  • More extensive two family data 1025 data sets
  • We ran Gibbs, new Gibbs, MEME, iGibbs without
    patRefine, iGibbs with patRefine.

39
Single-family-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
40
Two-faimly-set
PPV from 10 runs GGibbs NGGibbs (new
version) IUNiGibbs unguided (no patRefine)
IiGibbs with patRefine
MMEME
41
Runtime
42
More extensive two family data 1025 data sets
43
When the clustering approach works?
44
Availability
  • ARCS An Aggregated Related Column Scoring Scheme
    for Aligned Sequences
  • Bioinformatics, vol22, pp 2326-2332, Oct, 2006
  • iGibbs A Motif Discovery Framework for Gibbs
    Sampling Algorithm Try iGibbs online! Proteins
    Structure, Function, and Bioinformatics, in press
  • More information on our motif projects at
  • http//bio.informatics.indiana.edu/projects/MOTIF/

45
FUTURE WORK
  • Extending the algorithm to DNA sequences
  • Parameter Optimization
  • More rigorous statistical study
  • More general graphical model in general,
  • a sequence has several different motifs, not
    just one.

46
ACKNOWLEDGEMENTS
  • JeongHyeon Choi, Mehmet Dalkilic, Hardik Sheth,
    Zhiping Wang (Indiana)
  • Jiong Yang, Meng Hu (CaseWestern)
  • Presentation materials by Mehmet Dalkilic and
    Hardik Sheth
  • Bioinformatics Research Group
  • Supported by
  • NSF CAREER DBI-0237901,    
  • INGEN (Indiana Genomics Initiatives)
Write a Comment
User Comments (0)
About PowerShow.com