A New Algorithm for the Identification of Transcription Binding Sites - PowerPoint PPT Presentation

1 / 1

About This Presentation

Title:

A New Algorithm for the Identification of Transcription Binding Sites

Description:

A New Algorithm for the Identification of ... tata. CAGG. CAGG. CAGG. CAGG. ACAG. ACAG. ACAG. I2. TTTCT. I3. 118. 32. 268. 145. 76. 114. 70. Position ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 2

Provided by: bioin5

Category:

more less

Transcript and Presenter's Notes

Title: A New Algorithm for the Identification of Transcription Binding Sites

1
A New Algorithm for the Identification of
Transcription Binding Sites
Catherine Putonti, Tong-Bin Li, Sergey
Chumakov, B. Montgomery Pettitt, and Yuriy
Fofanov
Department of Computer Science, Department of
Chemistry, University of Houston, Houston,
TX W.M. Keck Center for Computational and
Structural Biology, Houston, TX Institute for
Molecular Design, University of Houston, Houston,
TX
Results Using the same parameters as we did with
our initial test, on the set of ?2,500 promoter
sequences our algorithm returned many patterns.
We then needed to filter these results removing
all poly A/T patterns as well as those islands
whose nearest neighboring island was greater than
30 bases away (fitting with our knowledge of
bacterial binding sites). This filter removed 76
of the original patterns discovered. To analyze
the remaining patterns, we first identified those
patterns (subpatterns were also included) which
occurred more frequently than others. Since each
pattern, occurs in at least two sequences because
of our pair-wise comparison, we found that
requiring a pattern (including its subpatterns)
to occur 5-8 times further refined our results.
Just to verify this parameter, we reevaluated the
15 cases used for our method comparison test and
found that the patterns were not filtered out.
The number of frequently occuring patterns per
factor varied between 10 and 30. From this
analysis, we were able to identify two situations
Abstract Identification of transcription
binding sites within promoter regions of genomic
DNA is imperative for the understanding of the
regulatory circuits that direct the expression of
genes. Much research has been devoted to finding
such sites, in particular for the bacteria
Escherichia coli. The binding sites of bacterial
genomes typically contain two conserved
sub-regions with a variable distance between
them. The general approach for locating such
sites relies on weight matrices and related
information-scoring functions. The success of
such methods is dependent upon the information
known about either the genes function or the
genome. Thus, these approaches are limited when
studying lesser-understood organisms. We propose
a new algorithm that does not require a priori
knowledge. Moreover, it permits variability in
both the length of the conserved sub-region and
the area between. This algorithm implements
pair-wise comparisons between every gene promoter
sequence looking for shared patterns. Novel data
structures efficiently store these patterns as
groups of islands where an island is the shared
subsequence. Furthermore, each island within a
particular pattern has a relative position that
can be floated to accommodate for the variable
distance between islands as is typical of
bacterial genomes. After poly A/T patterns are
filtered out, analysis can then performed to
identify those patterns found in more than one
pair-wise comparison that are biologically
significant. Our algorithm was applied to
several of the groups found by other methods. We
were able to successfully identify the patterns
found by our predecessors as well as several
other patterns with some having as many as five
islands. In hopes of identifying new
transcription binding sites, the algorithm was
then run on the set of all known promoters, ?
2500 (hypothetical genes excluded), in the
Escherichia coli genome. As a result we were
able to identify several new transcription
binding site candidates.
Our Method Pseudocode for our algorithm
Given min_length (minimum length to be
considered an island), min_num (minimum
number of islands to be considered a pattern),
float_val (distance the islands can float to
compensate for additions/deletions),
complement (will the complement of the sequence
be considered) for (i0 ilt num_sequences
i) for (ji1 jlt num_sequences j) for
(k-max_length klt max_length_of_sequence
k) compare each element in between the two
sequences finding all matches if (the number
of consecutive matches ? min_length) make an
island else mark as mismatches assign
islands to a pattern and increment pattern
counter, num_patterns for (l0 llt float_val
l) float islands generating new sets of
islands in addition to those defined by previous
loop if (num_islands ? min_num) destroy pattern

Previously identified binding site patterns in
promoter regions not yet identified.
Binding site patterns for promoters not yet
determined.

New promoter sequences matching previously
identified sets While this situation occurred for
several sets, we will discuss just one here. The
LexA factor has been individually analyzed in
literature. In addition to the set of promoters
already associated with LexA, we found an
additional 7. The commonly identified consensus
sequence which has been identified for LexA is
CTGTNNNNNNNNACAG. In these 7 cases, we found
patterns closely related to this consensus
sequence. Moreover, these patterns can be found
in all of the promoter sequences in the LexA
group. Listed here are the islands which matched
directly to the LexA sequence as well as the
start position in the new sequence relative to
the transcription start site.
The float function
for (i0 ilt num_patterns i) for (ji1 jlt
num_patterns j) if (rp of I1 ?
float_val) for (k0 klt num_islands
k) if ( ap-bp of Ik ? float_val) merg
e the islands of the two patterns to make a new
pattern
Introduction Transcription is a fundamental step
in gene expression controlled by the interaction
of transcription factors with specific DNA
subsequences. Identifying the sites in which this
binding occurs can further our understanding of
the regulatory circuits managing the expression
of genes as well as the gene itself. Accordingly,
much research has been devoted to identifying
both the binding sites and the DNA-binding
proteins. The methods used fall into two main
categories
The results generated are dependent upon the
values chosen for the parameters. If the
min_length or min_num is too small, a lot of
patterns will be found increasing computation and
analysis time. Likewise if either value is too
large, patterns may be excluded from the results.
Comparing with Other Methods Existing methods
have identified some consensus sequences for
promoter groups. Before implementing our
algorithm for all of the E. coli promoters, we
wanted to see if our new algorithm could find
these same patterns. Of the transcription factors
identified, we selected 15 which were included in
the work of both references 1 and 3. This
test also proved useful in determining the
parameter values most appropriate for the genome.
As a result of our tests, we found that
min_length equal to 4 and min_num equal to 2,
while generating a lot of patterns, would capture
all probable binding sites. Furthermore, we found
that a float_val of 3 worked best. Below is the
results of our test for two of the promoter
groups tested.

Search using known consensus sequences.
Necessary for this method is the knowledge of
DNA-binding proteins and their related binding
sites. For these binding sites, a consensus
sequence is determined. This sequence is then
used to search other promoter regions.
Search using position weight matrices. To derive
these matrices, clustering is often implemented
typically by the genes function. Furthermore,
such matrices require an information scoring
function.

Binding site patterns for promoters not yet
determined Once again, we found several factors
fitting this description. Here we present the
factor fba which produces fructose-bisphosphate
aldolase class II protein 2. The following are
the 7 patterns found most frequently when
comparing fba to the other promoter sequences.
The first number in the line identifies the
number of islands in the pattern.
Although these existing methods have been able to
successfully identify several binding sites, they
are limited by the existing knowledge about
either the gene or the DNA-binding proteins known
for the species. It was our desire to circumvent
this limitation by designing an algorithm which
could search for binding sites without a priori
knowledge.
argR Factor
Our Results
Other Methods Results
For example, we can see the 5-island pattern in
the promoter sequence pheS (125 bases from the
transcription start site) GGAGCgTACAtAAGTAcGTGAga
atttcGAGC.
Conclusion Our algorithm was successful in
finding both transcription binding sites
identified by other methods as well as several
patterns shared between a group of promoter
sequences. Such findings require further
investigation to determine if they are in fact
binding sites or matches merely due to random
chance. Our algorithm, unlike previous methods,
does not rely on a priori knowledge. As a result,
this method is ideal for identifying probable
binding sites especially for lesser understood
genomes. The interdependence between the number
of sequences and the length of the sequences
determines the number of comparisons performed
and in turn the computational time needed. Thus,
this algorithm is well suited for promoter
sequences. Although the search is exhaustive, the
results are very informative and promising. For
future research, it would be beneficial to allow
single base mismatches within an island which is
not included in this present implementation.
We chose to test our method on the Escherichia
coli K-12 genome. Several other groups have used
this same genome, thus it will be possible to
compare the results of our algorithm with others.
E. coli contains at least 240 proteins that are
known or predicted to be DNA-binding proteins. It
has also been established that transcription
binding sites in bacterial genomes are usually
long, ? 30 bases, and variable, found within 300
nucleotides upstream of the start of translation.
The binding site typically consists of two
conserved subregions each about 6 bases in
length. The distance between the islands varies
from 0 to 30 base pairs. According to NCBI, there
are 4,289 protein coding genes in this genome,
2,579 of which are documented or predicted
transcription units. We tested our algorithm on
this subset of genes downloaded from the NCBI
website 2.
3
TGATTAWNAATCAWNHTNA 1
tyrR Factor
Our Results
Other Methods Results
Data Structure The data structure used by our
algorithm is optimal for storing binding sites of
bacterial genomes. We search the promoter
sequences for patterns and a pattern is stored as
a collection of islands. Here we define an island
to be a subsequence found in both sequences. Our
algorithm relies on the Island data type which
entails the following
References 1 Li, Hao, Virgil Rhodius, Carol
Gross, and Eric D. Siggia. Identification of the
binding sites of regulatory proteins in bacterial
genomes Proc. Natl. Acad. Sci. (US) 2002, 99
11772-7. 2 NCBI website. http//www.ncbi.nlm.nih
.gov/ 3 Robison, Keith, Abigail Manson
McGuire, and George M. Church. A Comprehensive
Library of DNA-binding Site Matrices for 55
Proteins Applied to the Complete Escherichia coli
K-12 Genome J. Mol. Bio. 1998, 284 241-54.
3
NTGTAAANWNNNNTWTACANNNNN 1
a Binary array containing the matching bases
(bases are converted to a 2 bit value) rp
Relative position of island with respect to the
first island in the pattern (shift position for
first island) p1 Position of the island in
sequence 1 p2 Position of the island in
sequence 2
In all 15 cases tested, we were able to find the
same (or an extremely similar) consensus
sequence. Furthermore, our algorithm found
several other patterns shared between all or part
of the group, as seen in the case for the tyrR
factor. Such patterns could be further related to
the functionality and/or transcription of the
gene and warrant further investigation.
C. Putonti's work was supported by a training
fellowship from the Keck Center for Computational
and Structural Biology of the Gulf Coast
Consortia (NLM Grant No. 5T15LM07093). T-B. Li's
work was supported in part by a training
fellowship from the W.M. Keck Foundation to the
Gulf Coast Consortia through the Keck Center for
Computational and Structural Biology.

Write a Comment

User Comments (0)