Title: Error Correction and Clustering Gene Expression Data Using Majority Logic Decoding Humberto OrtizZua
1Error Correction and Clustering Gene Expression
Data Using Majority Logic DecodingHumberto
Ortiz-Zuazaga, Sandra Peña de Ortiz and Oscar
MorenoDepartmetns of Biology and Computer
ScienceRio Piedras Campus, University of Pueto
Rico
Abstract Microarrays allow researchers to
simultaneously measure the expression of
thousands of genes. They give invaluable insight
into the transcriptional state of biological
systems, and can be important in understanding
physiological as well as diseased conditions.
However, the analysis of data from many thousands
of genes, from only a few replications is very
difficult. We have devised a novel method of
correcting errors in microarray experiments, that
also clusters genes into groups, and categorizes
their measurements into coarse divisions,
suitable for discrete techniques for reverse
engineering. These techniques are based on finite
fields and algebraic coding theory. We test these
new techniques on a data set obtained from
behavioral training experiments on rats, and
identify two novel genes that may be involved in
learning and memory. Methods
- Results
- We have performed the analysis described above on
the CTA data set. In this data set, there are 127
consistent genes, which we divide into clusters
by grouping together the genes that have the same
set of calls in the 1 - 24 hour timepoints. This
results in 23 clusters. - In Dr. Sandra Peña's lab they study the role of
CREB, a gene known to be required for long term
memory. We focus on the expression of this gene,
and other genes with the same pattern of
expression. CREB binds to a DNA element called
cAMP-response element (CRE) in the target genes,
and in conjunction with a co-activator initiates
transcription of the target genes. - We focus on the cluster labeled 000''. The
consensus of the calls for these genes represents
no change over the 1, 3, and 6 hour time points,
followed by upregulation at the 24 hour
timepoint. This cluster consists of genes whose
expression most closely matches the expression
profile of CREB. We investigated the genes in
this cluster in depth, retrieving the gene
information and sequence from the Ensembl Genome
Browser version 32. - From Ensembl we obtained genomic sequence for
each gene, 1020 base pairs starting 800 base
pairs upstream of the transcription start site.
These sequences were then submitted to TESS to
search for transcription factor binding sites. We
look for the CRE element, a DNA sequence that is
the target site for CREB. Genes that have CRE in
their upstream region are potential targets of
regulation by CREB. - Two genes in particular caught our interest Pmch
and Calca. Both genes have CRE elements in their
upstream regions. According to the Rat Genome
Database, Pmch is a cyclic neuropeptide that
induces hippocampal synaptic transmission. It
seems to have an effect on appetite or metabolism
and anxiety, and promotes synaptic transmission
in the hippocampus 3. Calca is principally a
vasodilator, but seems to have a role in axonal
regeneration or synaptogenesis. 4 Thus these
genes exhibit a pattern of expression consistent
with the expression of Creb1, have CRE elements
upstream of their transcription start site, and
seem to have a role in strengthening or creating
new synapses. - Discussion
- We have developed a method for error correction
of microarray experiments. The technique produces
a clustering of genes and describes each gene as
unchanged, upregulated, or downregulated, in
accordance to biologists natural description of
expression levels. We applied these techniques to
a microarray data set derived from a CTA
experiment in rats, looking for genes that may be
important in learning and memory processes. We
found two genes, Pmch and Calca, that share an
expression pattern with CREB, contain CRE in
their upstream regions, and have demonstrated
function related to synaptic plasticity. Pmch and
Calca are strongly implicated as important genes
for the formation of memories. We are now
actively seeking confirmation of these genes'
role in CTA and of their regulation by CREB as a
result of CTA training. - Acknowledgements
- The authors received partial support from a SCORE
grant (S06GM08102), an INBRE grant (P20RR16470)
and an IDeA Program grant (P20RR15565), from the
National Institutes of Health, and National
Science Foundation grant CNS-0540592.
Fig. 1 Conditioned Taste Aversion Task. CTA is an
associative aversive conditioning paradigm in
which pairing gastrointestinal malaise (induced
by lithium chloride, LiCl, the unconditioned
stimulus) with prior exposure to a novel taste
(the conditioned stimulus) may create a strong
and long lasting aversion to the novel taste.
CTA lends itself as an excellent model system to
study the dynamics of gene regulation in learning
and memory because it is a single trial
associative learning paradigm, which involves
discrete regions in the brain, including selected
amygdala nuclei. The gene profiling experiment
was replicated five times. Four animals were
used per condition for each replicate. Thus, a
total of twenty rats were used per condition.
Animals were sacrificed by decapitation at 1, 3,
6, and 24 hours after conditioning and amygdala
enriched tissue punches were obtained for RNA
isolation. Hybridization, image capture and
analysis was similar to the procedures described
in 1. The data set thus obtained (CTA data
set) is described in 2. In summary, the data
has two controls, the pre-treatment group and the
one hour saline group, and four time points, 1,
3, 6, and 24 hours after conditioning. Each array
has 1185 genes, and we have 5 biological
replicates of each array. Error Correction and
Clustering We have developed a method for error
correction and clustering the gene expression
data that proceeds in stages. First we discretize
each replicate separately by comparing each
expression value to the mean of the control
values. We choose an epsilon such that either the
1 h or 24 h timpepoints are within epsilon of the
controls. We call a gene if it's expresion is
greater than the contol epsilon, - if it is
less than control epsilon, and 0 otherwise. The
second stage uses majority logic decoding to
summarize all the repetitions into a single call
for each timepoint. The third stage repeats the
discretization using a single control value for
all repetitions, the averaged control. The fourth
stage tosses out extreme values and repeats the
discretization. Finally, we test if at least two
of the stages agree, and the calls across the 1,
3, 6, and 24 h timepoints have the consecutive
zeros properties. We call genes that pass this
test consistent. These heuristics try to
capture biological knowlege about the behaviour
of the genes.