Title: An Introduction to Databases and Data Standards for Metabolomics
1An Introduction to Databases and Data Standards
for Metabolomics
- Dr Helen Jenkins
- Computational Biology Research Group
- haf_at_aber.ac.uk
2This lecture will cover
- A brief introduction to the -omes and the
-omics. - Description of the projects that we are involved
in and our remit within them. - Discussion of the requirements for metabolomics
experiment databases. - A brief description of our solution.
3The -omes and -omics (1)
- Genome The complete set of genes in an organism.
- Genomics Identification of genes and sequencing
their exact order. - Transcriptome The set of RNA transcripts
produced by a genome at any one time. - Transcriptomics Study of which genes are
switched on when.
4The -omes and omics (2)
- Proteome The complete set of proteins produced
by a genome at any one time. - Proteomics Study of the proteome.
- Metabolome The complete set of metabolites
produced by a genome at any one time. - Metabolomics Study of the metabolome
5Metabolomics
Comprehensive (?) estimation of the low molecular
weight compounds (e.g. sugars) in a biological
sample.
- Example applications
- Basic biology
- Food quality (human/animal)
- Clinical trials
- Examples of samples
- Plant material
- Blood/urine from humans and animals
- Microbial material (yeast/fungi)
- Samples may be taken from
- The organisms themselves or
- Their environment (footprinting)
6Our projects (1) Metabolome technology for the
profiling of GM and conventionally bred plant
materials
- 3 year project (Dec 2001 to Dec 2004)
- Funded by the UK Food Standards Agency (FSA).
- Collaboration between DCS and IBS at Aberystwyth
and the Max Planck Institute of Molecular Plant
Physiology in Golm, Germany. - Aim To build a technology platform for
performing metabolomics to assess the differences
between GM and conventionally bred versions of
plant material. - Using Arabidopsis thaliana and potato as model
plants. - Our remit To build a robust database for
metabolome and related data
7Our projects (2) Hierarchical plant metabolomics
(HiMet)
- 3 year project (Apr 2003 to Apr 2006)
- Funded by the Biotechnology and Biological
Sciences Research Council (BBSRC) - Collaboration between DCS and IBS at Aberystwyth,
the John Innes Centre in Norwich, Rothamsted
Research in Harpenden and CNAP at York University - Aim To understand metabolome patterns through
analysis of the data produced by high-throughput
chemical analyses of metabolism mutants. - Using mutant lines of Arabidopsis thaliana.
- Our remit To build a database to store
information about plant metabolomics experiments
and their results and facilitate statistical and
data mining analysis of the datasets.
8Our projects (3) The UK Centre for Plant and
Microbial Metabolomics (MeT-RO)
- 4 year project (Dec 2004 to Dec 2008)
- Funded by the BBSRC
- Collaboration between Rothamsted Research in
Harpenden, DCS and IBS at Aberystwyth and the
Department of Chemistry at the University of
Manchester. - Aims
- To establish and operate a high-throughput,
state-of-the-art global metabolome
fingerprinting (as distinct from metabolic
profiling) service - accessible on a user pays
basis to all members of the biological sciences
community. - Carry out research and training in plant and
microbial metabolomics and associated
bioinformatics. - Our remit To build a robust data handling
platform for storing and querying the datasets
produced by experiments carried out through the
centre.
9Understanding metabolomics data (1) The
different types of omic data
- Reference (standard) data
- Gene sequences
- Gene expressions
- Proteins
- Metabolites
- Experiment data
- Data from particular experiments, e.g. an
experiment to assess the effect of changes in the
metabolome of a particular organism during times
of drought.
10Understanding metabolomics data (2) Biological
signal versus experimental noise
- Metabolomics experiments aim to measure
biological signal - Organisms differ.
- But the process of performing an experiment may
induce noise - Unintended side-effects of the experimental
process. - Study of the metabolome provides us with a means
for determining the extent of a biological signal
BUT the dynamic nature of the metabolome means
that experimentally induced noise may well
obscure the signal. - Proper understanding of experimental context is,
therefore, key to - Correct interpretation of experimental results.
- Selection of datasets that may be meaningfully
compared. - Experiment reproducibility.
11Understanding metabolomics data (3) The
metabolomics experiment pipeline
12Understanding metabolomics data (4) Data from
GC-MS
Output
GC-MS Machine
13Understanding metabolomics data (5) Processing
the output from GC-MS
Metabolite Area Peak one xxx Peak two xxx Peak
three xxx Peak four xxx
Noise Removal Peak Identification Peak
Deconvolution Peak Quantification
Quantified Peak List
Peak Labelling
Chemical Identity
14Understanding metabolomics data (6) Metabolome
signatures
FT-IR Output
DI-ESI Output
15Designing a metabolomics database
- From our understanding of the nature of
metabolomics data we can define the following
list of requirements for a metabolomics database - It should support data from all parts of the
experiment pipeline. - It should support a range of experimental
procedures as used by different laboratories. - It should support the range of dataset types as
produced by different chemical analysis
technologies. - It should facilitate sophisticated querying for
datasets and mining over experimental metadata. - Therefore, our database design must be both
flexible and extensible.
16A component-based structure
- Our solution was to specify a component-based
structure for our data model - A set of components, each of which aims to
describe a different part of a plant metabolomics
experiment. - Each component has a set of core, required data
items, relevant to all plant metabolomics
experiments. - Each component may have one or more
sub-components that extend the core data for the
component to describe particular methodologies
and/or technologies. - An experiment is then described by a complete set
of components each one of which may contain
either - The core data for that component or
- The core data for that component plus additional
data items specified in one of its
sub-components.
17ArMet (Architecture for Metabolomics) components
Metabolome Estimate
Genotype
DataPoint
Output
18Components and sub-components
19A general Metabolome Estimate component
Metabolome Estimate
Targeted output
Genotype
Metabolite
Output
and/or
Metabolome output
Peak
Output
Fingerprint output
Output
DataPoint
20Recent Developments
Protocol repository
Protocol log
21This lecture will include
- A word on the importance of data standards for
the omics and where our work fits in. - Discussion of some of the issues we have faced
whilst implementing our databases. - A brief look at where this field is going in the
future.
22Data standards
- Typically experimentalists will want to be able
to do one or many or the following - Transmit datasets to collaborators or other
external sites. - Store data with adequate curation.
- Make datasets available for statistical analysis
and data mining. - Data standards support these activities by
defining - The structure of datasets.
- The types of information that they should contain.
23Data standards for the -omics
- Transcriptomics
- MIAME (Nat. Genetics., 29365-371) defines the
data that should be recorded for a transcriptome
experiment. - MAGE (Spellman et. al., Genome. Biol., 3, 2002)
is an associated formal data description. - Proteomics
- PEDRo (Nat. Biotechnol., 21247-254) is a formal
data description for proteomics. - PSI-OM (Proteomics, 4490-491) is the result of
further development of PEDRo by HUPO PSI - MIAPE is an associated definition of the data
that should be recorded for proteomics
experiments.
24Data standards for metabolomics
- Until recently there was no equivalent
metabolomics-specific data standard for defining
metabolite complements in their experimental
context. Now we have - SMRS (http//www.smrsgroup.org/) is a draft
policy for standardisation of reporting for
metabolic analysis mainly for pre-clinical drug
trials - MIAMET (Trends in Plant Science, 9(9)418-425) is
a definition of the data that should be recorded
for a metabolomics experiment - ArMet (Nat. Biotechnol., 221601-1606) is a
formal data description for plant metabolomics.
25Benefits of data standards
- Data standards such as these, that describe
experiment results in their experimental context,
enable - Proper interpretation of results.
- Laboratory interoperability.
- Meaningful comparison of datasets.
- Replication of experiments.
- More options for retrospective analysis of data.
26Other benefits of data standards
- Data standards, and the formal data descriptions
that underlie, them - Encourage consideration and development of best
practice and Standard Operating Procedures. - Standardize the reporting of experiments and
archiving of data associated with publications. - Enable the development of databases and
verifiable transmission mechanisms to facilitate
the storage, collection and dissemination of
logically correct datasets.
27Implementation Issues
- Speed and optimisation of querying
- Building acceptable user interfaces for data
submission - Standardising terminology
28Speed of querying
- Metabolomics experiments can produce vast
quantities of results data. - For example
- Each experiment within our HiMet project involves
220 samples. - There are currently 11 experiments.
- Each sample is subjected to multiple types of
chemical analysis. - Each dataset produced from a single chemical
analysis contains 2000 datapoints. - All of this results data is, of course, in
addition to the experiment metadata that provides
the experimental context for the results. - Storing all of these points in a single unordered
database table can lead to query times of hours
for the data from a single experiment or sample.
29Optimisation (1) Database tuning
- A key aspect of the development of relational
databases is normalisation - Rules for grouping the fields that we want to
store into tables - Denormalisation involves reducing the degree of
normalisation of the tables in a relational
database to improve performance - Should not be done when a table is regularly
updated as normalisation protects us against
insert/update/delete anomalies that can reduce
the integrity of a database.
30Optimisation (2) Data Organisation
- Hash files
- Records in a table are written to disk based on a
hash function. - Indexes
- A structure that is separate from, but
associated with, a table contains an entry for
each value in the indexed column(s) of the table
provides direct pointers to the rows in the
table on the disk. - B-trees
- Records in a table/index organised into a
balanced-tree structure. - Clustered tables
- Groups of one or more tables physically stored
together because they share common columns and
are often used together.
31Optimisation (3) Distribution
- Table partitioning
- Break tables or index files down into smaller
more manageable chunks that may be distributed
across disks. - Parallelising queries
- Breaking down the query into a number of
functions that are carried out by different
processes.
32User Interface Issues
- Experimentalists gather the data about their
experiments in a number of formats - Lab books
- Protocol documentation
- Spreadsheets for material tracking
- Instrument-specific data files for chemical
analysis output. - Initially we produced web-based, form-filling
interfaces for metadata submission linked to
upload facilities for directly uploading dataset
files, but - The detail required to capture the sources of
experimentally induced noise makes this type of
form-filling interface overwhelming.
33Standardising Terminology
- Crucial to the production of standard datasets
that can be compared and interpreted correctly is
the use of common (standard) terminology to
describe the concepts involved in an experiment. - Many biological terms have other meanings when
used in other situations, e.g. everyday meaning,
chemical meaning. - In our solutions so far we have employed the
notion of extensible controlled vocabularies. - Certain key fields within the database must take
their values from a controlled list - The controlled list entries include information
on the authority for that term, i.e. the
standard definition or meaning.
34The Future
- Omic ontologies
- Pan-omic data models
35Ontologies and controlled vocabularies
- An ontology is a hierarchical formal
specification of the concepts within a domain. It
specifies the vocabulary that may be used to
refer to those concepts and how the concepts may
be related. - Controlled vocabularies are required by
MIAME/MAGE, PEDRo/PSI-OM and ArMet. - The development of ontologies will help to
formalise these controlled vocabularies which are
currently modelled as values with authorities. - Both MGED in the microarray community and the PSI
in the proteomics community have recognised this
and are working on ontologies to support the
description of transcriptomics and proteomics
experiments.
36Ontology for experiments (EXPO)
- Work carried out by colleagues at Aberystwyth
(Larissa Soldatova and Prof. Ross King) has
resulted in an ontology for experiments (EXPO). - They are now planning to validate the ontology by
showing that it may be instantiated for
metabolomics experiments. - Benefits to ArMet
- Formalisation of the required controlled
vocabularies - Alternative modelling approach is helping to
inform the development of the ArMet data model. - As EXPO is at a higher conceptual level than any
of the omics (and could potentially be
instantiated for transcriptomics and proteomics
experiments as well) we are also hoping to
evaluate its usefulness to the process of
development of common models for sample
description.
37A Single Data Model for Sample Description
- Metabolomics analyses will be performed in
conjunction with those on the transciptome and
proteome in integrative experiments. - Need a common data standard for experiment
descriptions. - Must contain the metadata required to fully
evaluate the different omic analyses performed
on samples from a single experiment - Others working in this area have produced
- SysBio-OM (Bioinformatics, 20(12), 2004-2015).
- FGE-OM (Bioinformatics, 20(10), 1583-1590).