Title: An analysis of upstream sequences of Dictyostelium discoideum using a distributed computer system No
1An analysis of upstream sequences of
Dictyostelium discoideum using a distributed
computer systemNorio Kobayashi1, Mircea Marin2,
Takahiro Morio1, Yoshimasa Tanaka1, and Hideko
Urushihara1 1 University of Tsukuba, Japan 2
Johann Radon Institute for Computational and
Applied Mathematics, Austria
- 4. Development of distributed system
- We would like to perform our sequence analyzer
against longer probe sequences. - cis-elements longer than 20 have already been
identified in biological experiments. - Huge computation resource is required for length
20. - Distributed approach works well.
- Each probe sequence can be applied to our
program. - System architecture
- 1. Introduction
- An upstream sequence analysis with information
science technique - Our goal
- Investigation of characterized elements in
upstream sequences as candidates of cis-elements. - Life cycle stage specific
- Location specific in upstream sequences
- Stage and location specific
- Our approach
- Development of upstream sequence database
- Implementation of a sequence analyzer program
which extracts characterized elements in upstream
sequences - Development of distributed environment for
collaborative computing with the program
- 3. Upstream sequence analyzer
- Development of a program which extracts statistic
characteristic elements as candidates of
cis-elements from upstream sequences -
- Design
- Let be the set of all possible sequences of
length constructed from characters A,T,G and
C.For every , - Step 1. Extract the elements of the upstream
sequences of the cDNA clones which are similar to
and obtain a list of their position - Step 2. Obtain the statistic distribution of
on the upstream sequences from . If the
distribution satisfies a certain criterion, then
is regarded as a candidate of cis-element. - Implementation in Java
- For step 1 local alignment based on dynamic
programming - align(n) performs local alignments between each
and upstream sequences
100 bases
2,000 bases
V stage
A stage
S stage
C stage
Red V stage specific elementGreen location
specificBlue V stage and location specific
Transcription initiation site
LAN 1
Figure 3. An illustration of statistic
characterized elements in upstream sequences
Request
Request
Result
Upstreamsequences
Upstream sequence database
User client
Coordinator
Web
Protocol SOAP / Jini
LAN 2
LAN 3
- 2. Upstream sequence database
- Acquisition of upstream sequences
- Step 1. Sequence alignment between the genome
sequence and cDNA contig sequences - The genome sequence of chromosome 2 (The
Dictyostelium discoideum genome project) - 8,402 cDNA contig sequences (The Dictyostelium
cDNA project in Japan) - We have obtained 2,152 upstream sequences
Sequence analyzers
Sequence analyzers
Figure 5. Architecture of distributed system
which performs sequence analyzers
in parallel.
1999-min(a,-1)
99max(ß,1)
a
ß
-1 1
Genomesequence
Upstream part
Contig part
5
3
Clone 1
Clone 2
- 5. Results
- The system extracted the candidate elements
listed in Table 1 in 436 minutes. - TTGSSCAA is an known element called Harwood
element (TTGN2,4CAA) which deactivates the
expression at a prestalk cell in Dictyostelium. - GRGTGTAT partially matches to a known element
(DGKGKGDN4-7DGKGKGD) which regulate prestalk gene
expression.
Clone 3
2000 bases
100 bases
Figure 1. Structure of upstream sequence for cDNA
contigs. It includes 2,000 bases
of upstream and 100 bases of downstream
of the transcription initiation site of all
concerned cDNA clones.
Table 1. Extracted elements with probe sequences
of lengths 4-10
A
T A T C G A C A C G T
- 6. Conclusions
- We have presented upstream sequence analysis with
an approach from information science. - We have developed a upstream sequence database,
an analyzer and a distributed environment. - The system extracted 6 candidate sequences, and
two of them are known elements. - As the result, we could confirm that our system
is effective and practical. - Future work
- We will perform our system for probe sequences of
lengths up to 20. - We would like to examine if the extracted
elements really work as cis-elements with
biological experiments.
B
Figure 4. Computation time for probe sequences of
lengths 4-7 for score threshold
0, 2, 4, 6 and 8.
Figure 2. Graphical user interfaces embedded in
web pages. A Sequence information
page consisting of the viewer of transcription
initiation sites of concerned
cDNA clones, B Viewer for the
detail of upstream sequences in a list format.