An Introduction to Databases and Data Standards for Metabolomics

About This Presentation

Title:

An Introduction to Databases and Data Standards for Metabolomics

Description:

http://www.aber.ac.uk/compsci/ An Introduction to Databases and ... Aliquot. Material. Specific Prep. Method. Procedure. 1. Instrumental Analysis. Machine. Run ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 38

Provided by: helenj9

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Databases and Data Standards for Metabolomics

1
An Introduction to Databases and Data Standards
for Metabolomics

Dr Helen Jenkins
Computational Biology Research Group
haf_at_aber.ac.uk

2
This lecture will cover

A brief introduction to the -omes and the
-omics.
Description of the projects that we are involved
in and our remit within them.
Discussion of the requirements for metabolomics
experiment databases.
A brief description of our solution.

3
The -omes and -omics (1)

Genome The complete set of genes in an organism.
Genomics Identification of genes and sequencing
their exact order.
Transcriptome The set of RNA transcripts
produced by a genome at any one time.
Transcriptomics Study of which genes are
switched on when.

4
The -omes and omics (2)

Proteome The complete set of proteins produced
by a genome at any one time.
Proteomics Study of the proteome.
Metabolome The complete set of metabolites
produced by a genome at any one time.
Metabolomics Study of the metabolome

5
Metabolomics
Comprehensive (?) estimation of the low molecular
weight compounds (e.g. sugars) in a biological
sample.

Example applications
Basic biology
Food quality (human/animal)
Clinical trials
Examples of samples
Plant material
Blood/urine from humans and animals
Microbial material (yeast/fungi)
Samples may be taken from
The organisms themselves or
Their environment (footprinting)

6
Our projects (1) Metabolome technology for the
profiling of GM and conventionally bred plant
materials

3 year project (Dec 2001 to Dec 2004)
Funded by the UK Food Standards Agency (FSA).
Collaboration between DCS and IBS at Aberystwyth
and the Max Planck Institute of Molecular Plant
Physiology in Golm, Germany.
Aim To build a technology platform for
performing metabolomics to assess the differences
between GM and conventionally bred versions of
plant material.
Using Arabidopsis thaliana and potato as model
plants.
Our remit To build a robust database for
metabolome and related data

7
Our projects (2) Hierarchical plant metabolomics
(HiMet)

3 year project (Apr 2003 to Apr 2006)
Funded by the Biotechnology and Biological
Sciences Research Council (BBSRC)
Collaboration between DCS and IBS at Aberystwyth,
the John Innes Centre in Norwich, Rothamsted
Research in Harpenden and CNAP at York University
Aim To understand metabolome patterns through
analysis of the data produced by high-throughput
chemical analyses of metabolism mutants.
Using mutant lines of Arabidopsis thaliana.
Our remit To build a database to store
information about plant metabolomics experiments
and their results and facilitate statistical and
data mining analysis of the datasets.

8
Our projects (3) The UK Centre for Plant and
Microbial Metabolomics (MeT-RO)

4 year project (Dec 2004 to Dec 2008)
Funded by the BBSRC
Collaboration between Rothamsted Research in
Harpenden, DCS and IBS at Aberystwyth and the
Department of Chemistry at the University of
Manchester.
Aims
To establish and operate a high-throughput,
state-of-the-art global metabolome
fingerprinting (as distinct from metabolic
profiling) service - accessible on a user pays
basis to all members of the biological sciences
community.
Carry out research and training in plant and
microbial metabolomics and associated
bioinformatics.
Our remit To build a robust data handling
platform for storing and querying the datasets
produced by experiments carried out through the
centre.

9
Understanding metabolomics data (1) The
different types of omic data

Reference (standard) data
Gene sequences
Gene expressions
Proteins
Metabolites
Experiment data
Data from particular experiments, e.g. an
experiment to assess the effect of changes in the
metabolome of a particular organism during times
of drought.

10
Understanding metabolomics data (2) Biological
signal versus experimental noise

Metabolomics experiments aim to measure
biological signal
Organisms differ.
But the process of performing an experiment may
induce noise
Unintended side-effects of the experimental
process.
Study of the metabolome provides us with a means
for determining the extent of a biological signal
BUT the dynamic nature of the metabolome means
that experimentally induced noise may well
obscure the signal.
Proper understanding of experimental context is,
therefore, key to
Correct interpretation of experimental results.
Selection of datasets that may be meaningfully
compared.
Experiment reproducibility.

11
Understanding metabolomics data (3) The
metabolomics experiment pipeline
12
Understanding metabolomics data (4) Data from
GC-MS
Output
GC-MS Machine
13
Understanding metabolomics data (5) Processing
the output from GC-MS
Metabolite Area Peak one xxx Peak two xxx Peak
three xxx Peak four xxx
Noise Removal Peak Identification Peak
Deconvolution Peak Quantification
Quantified Peak List
Peak Labelling
Chemical Identity
14
Understanding metabolomics data (6) Metabolome
signatures
FT-IR Output
DI-ESI Output
15
Designing a metabolomics database

From our understanding of the nature of
metabolomics data we can define the following
list of requirements for a metabolomics database
It should support data from all parts of the
experiment pipeline.
It should support a range of experimental
procedures as used by different laboratories.
It should support the range of dataset types as
produced by different chemical analysis
technologies.
It should facilitate sophisticated querying for
datasets and mining over experimental metadata.
Therefore, our database design must be both
flexible and extensible.

16
A component-based structure

Our solution was to specify a component-based
structure for our data model
A set of components, each of which aims to
describe a different part of a plant metabolomics
experiment.
Each component has a set of core, required data
items, relevant to all plant metabolomics
experiments.
Each component may have one or more
sub-components that extend the core data for the
component to describe particular methodologies
and/or technologies.
An experiment is then described by a complete set
of components each one of which may contain
either
The core data for that component or
The core data for that component plus additional
data items specified in one of its
sub-components.

17
ArMet (Architecture for Metabolomics) components
Metabolome Estimate

Genotype
DataPoint
Output
18
Components and sub-components
19
A general Metabolome Estimate component
Metabolome Estimate
Targeted output

Genotype
Metabolite
Output
and/or
Metabolome output
Peak
Output
Fingerprint output
Output
DataPoint
20
Recent Developments
Protocol repository
Protocol log
21
This lecture will include

A word on the importance of data standards for
the omics and where our work fits in.
Discussion of some of the issues we have faced
whilst implementing our databases.
A brief look at where this field is going in the
future.

22
Data standards

Typically experimentalists will want to be able
to do one or many or the following
Transmit datasets to collaborators or other
external sites.
Store data with adequate curation.
Make datasets available for statistical analysis
and data mining.
Data standards support these activities by
defining
The structure of datasets.
The types of information that they should contain.

23
Data standards for the -omics

Transcriptomics
MIAME (Nat. Genetics., 29365-371) defines the
data that should be recorded for a transcriptome
experiment.
MAGE (Spellman et. al., Genome. Biol., 3, 2002)
is an associated formal data description.
Proteomics
PEDRo (Nat. Biotechnol., 21247-254) is a formal
data description for proteomics.
PSI-OM (Proteomics, 4490-491) is the result of
further development of PEDRo by HUPO PSI
MIAPE is an associated definition of the data
that should be recorded for proteomics
experiments.

24
Data standards for metabolomics

Until recently there was no equivalent
metabolomics-specific data standard for defining
metabolite complements in their experimental
context. Now we have
SMRS (http//www.smrsgroup.org/) is a draft
policy for standardisation of reporting for
metabolic analysis mainly for pre-clinical drug
trials
MIAMET (Trends in Plant Science, 9(9)418-425) is
a definition of the data that should be recorded
for a metabolomics experiment
ArMet (Nat. Biotechnol., 221601-1606) is a
formal data description for plant metabolomics.

25
Benefits of data standards

Data standards such as these, that describe
experiment results in their experimental context,
enable
Proper interpretation of results.
Laboratory interoperability.
Meaningful comparison of datasets.
Replication of experiments.
More options for retrospective analysis of data.

26
Other benefits of data standards

Data standards, and the formal data descriptions
that underlie, them
Encourage consideration and development of best
practice and Standard Operating Procedures.
Standardize the reporting of experiments and
archiving of data associated with publications.
Enable the development of databases and
verifiable transmission mechanisms to facilitate
the storage, collection and dissemination of
logically correct datasets.

27
Implementation Issues

Speed and optimisation of querying
Building acceptable user interfaces for data
submission
Standardising terminology

28
Speed of querying

Metabolomics experiments can produce vast
quantities of results data.
For example
Each experiment within our HiMet project involves
220 samples.
There are currently 11 experiments.
Each sample is subjected to multiple types of
chemical analysis.
Each dataset produced from a single chemical
analysis contains 2000 datapoints.
All of this results data is, of course, in
addition to the experiment metadata that provides
the experimental context for the results.
Storing all of these points in a single unordered
database table can lead to query times of hours
for the data from a single experiment or sample.

29
Optimisation (1) Database tuning

A key aspect of the development of relational
databases is normalisation
Rules for grouping the fields that we want to
store into tables
Denormalisation involves reducing the degree of
normalisation of the tables in a relational
database to improve performance
Should not be done when a table is regularly
updated as normalisation protects us against
insert/update/delete anomalies that can reduce
the integrity of a database.

30
Optimisation (2) Data Organisation

Hash files
Records in a table are written to disk based on a
hash function.
Indexes
A structure that is separate from, but
associated with, a table contains an entry for
each value in the indexed column(s) of the table
provides direct pointers to the rows in the
table on the disk.
B-trees
Records in a table/index organised into a
balanced-tree structure.
Clustered tables
Groups of one or more tables physically stored
together because they share common columns and
are often used together.

31
Optimisation (3) Distribution

Table partitioning
Break tables or index files down into smaller
more manageable chunks that may be distributed
across disks.
Parallelising queries
Breaking down the query into a number of
functions that are carried out by different
processes.

32
User Interface Issues

Experimentalists gather the data about their
experiments in a number of formats
Lab books
Protocol documentation
Spreadsheets for material tracking
Instrument-specific data files for chemical
analysis output.
Initially we produced web-based, form-filling
interfaces for metadata submission linked to
upload facilities for directly uploading dataset
files, but
The detail required to capture the sources of
experimentally induced noise makes this type of
form-filling interface overwhelming.

33
Standardising Terminology

Crucial to the production of standard datasets
that can be compared and interpreted correctly is
the use of common (standard) terminology to
describe the concepts involved in an experiment.
Many biological terms have other meanings when
used in other situations, e.g. everyday meaning,
chemical meaning.
In our solutions so far we have employed the
notion of extensible controlled vocabularies.
Certain key fields within the database must take
their values from a controlled list
The controlled list entries include information
on the authority for that term, i.e. the
standard definition or meaning.

34
The Future

Omic ontologies
Pan-omic data models

35
Ontologies and controlled vocabularies

An ontology is a hierarchical formal
specification of the concepts within a domain. It
specifies the vocabulary that may be used to
refer to those concepts and how the concepts may
be related.
Controlled vocabularies are required by
MIAME/MAGE, PEDRo/PSI-OM and ArMet.
The development of ontologies will help to
formalise these controlled vocabularies which are
currently modelled as values with authorities.
Both MGED in the microarray community and the PSI
in the proteomics community have recognised this
and are working on ontologies to support the
description of transcriptomics and proteomics
experiments.

36
Ontology for experiments (EXPO)

Work carried out by colleagues at Aberystwyth
(Larissa Soldatova and Prof. Ross King) has
resulted in an ontology for experiments (EXPO).
They are now planning to validate the ontology by
showing that it may be instantiated for
metabolomics experiments.
Benefits to ArMet
Formalisation of the required controlled
vocabularies
Alternative modelling approach is helping to
inform the development of the ArMet data model.
As EXPO is at a higher conceptual level than any
of the omics (and could potentially be
instantiated for transcriptomics and proteomics
experiments as well) we are also hoping to
evaluate its usefulness to the process of
development of common models for sample
description.

37
A Single Data Model for Sample Description

Metabolomics analyses will be performed in
conjunction with those on the transciptome and
proteome in integrative experiments.
Need a common data standard for experiment
descriptions.
Must contain the metadata required to fully
evaluate the different omic analyses performed
on samples from a single experiment
Others working in this area have produced
SysBio-OM (Bioinformatics, 20(12), 2004-2015).
FGE-OM (Bioinformatics, 20(10), 1583-1590).

Write a Comment

User Comments (0)

About PowerShow.com

An Introduction to Databases and Data Standards for Metabolomics - PowerPoint PPT Presentation

An Introduction to Databases and Data Standards for Metabolomics

http://www.aber.ac.uk/compsci/ An Introduction to Databases and ... Aliquot. Material. Specific Prep. Method. Procedure. 1. Instrumental Analysis. Machine. Run ... – PowerPoint PPT presentation