An Introduction to Databases and Data Standards for Metabolomics - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

An Introduction to Databases and Data Standards for Metabolomics

Description:

http://www.aber.ac.uk/compsci/ An Introduction to Databases and ... Aliquot. Material. Specific Prep. Method. Procedure. 1. Instrumental Analysis. Machine. Run ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 38
Provided by: helenj9
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Databases and Data Standards for Metabolomics


1
An Introduction to Databases and Data Standards
for Metabolomics
  • Dr Helen Jenkins
  • Computational Biology Research Group
  • haf_at_aber.ac.uk

2
This lecture will cover
  • A brief introduction to the -omes and the
    -omics.
  • Description of the projects that we are involved
    in and our remit within them.
  • Discussion of the requirements for metabolomics
    experiment databases.
  • A brief description of our solution.

3
The -omes and -omics (1)
  • Genome The complete set of genes in an organism.
  • Genomics Identification of genes and sequencing
    their exact order.
  • Transcriptome The set of RNA transcripts
    produced by a genome at any one time.
  • Transcriptomics Study of which genes are
    switched on when.

4
The -omes and omics (2)
  • Proteome The complete set of proteins produced
    by a genome at any one time.
  • Proteomics Study of the proteome.
  • Metabolome The complete set of metabolites
    produced by a genome at any one time.
  • Metabolomics Study of the metabolome

5
Metabolomics
Comprehensive (?) estimation of the low molecular
weight compounds (e.g. sugars) in a biological
sample.
  • Example applications
  • Basic biology
  • Food quality (human/animal)
  • Clinical trials
  • Examples of samples
  • Plant material
  • Blood/urine from humans and animals
  • Microbial material (yeast/fungi)
  • Samples may be taken from
  • The organisms themselves or
  • Their environment (footprinting)

6
Our projects (1) Metabolome technology for the
profiling of GM and conventionally bred plant
materials
  • 3 year project (Dec 2001 to Dec 2004)
  • Funded by the UK Food Standards Agency (FSA).
  • Collaboration between DCS and IBS at Aberystwyth
    and the Max Planck Institute of Molecular Plant
    Physiology in Golm, Germany.
  • Aim To build a technology platform for
    performing metabolomics to assess the differences
    between GM and conventionally bred versions of
    plant material.
  • Using Arabidopsis thaliana and potato as model
    plants.
  • Our remit To build a robust database for
    metabolome and related data

7
Our projects (2) Hierarchical plant metabolomics
(HiMet)
  • 3 year project (Apr 2003 to Apr 2006)
  • Funded by the Biotechnology and Biological
    Sciences Research Council (BBSRC)
  • Collaboration between DCS and IBS at Aberystwyth,
    the John Innes Centre in Norwich, Rothamsted
    Research in Harpenden and CNAP at York University
  • Aim To understand metabolome patterns through
    analysis of the data produced by high-throughput
    chemical analyses of metabolism mutants.
  • Using mutant lines of Arabidopsis thaliana.
  • Our remit To build a database to store
    information about plant metabolomics experiments
    and their results and facilitate statistical and
    data mining analysis of the datasets.

8
Our projects (3) The UK Centre for Plant and
Microbial Metabolomics (MeT-RO)
  • 4 year project (Dec 2004 to Dec 2008)
  • Funded by the BBSRC
  • Collaboration between Rothamsted Research in
    Harpenden, DCS and IBS at Aberystwyth and the
    Department of Chemistry at the University of
    Manchester.
  • Aims
  • To establish and operate a high-throughput,
    state-of-the-art global metabolome
    fingerprinting (as distinct from metabolic
    profiling) service - accessible on a user pays
    basis to all members of the biological sciences
    community.
  • Carry out research and training in plant and
    microbial metabolomics and associated
    bioinformatics.
  • Our remit To build a robust data handling
    platform for storing and querying the datasets
    produced by experiments carried out through the
    centre.

9
Understanding metabolomics data (1) The
different types of omic data
  • Reference (standard) data
  • Gene sequences
  • Gene expressions
  • Proteins
  • Metabolites
  • Experiment data
  • Data from particular experiments, e.g. an
    experiment to assess the effect of changes in the
    metabolome of a particular organism during times
    of drought.

10
Understanding metabolomics data (2) Biological
signal versus experimental noise
  • Metabolomics experiments aim to measure
    biological signal
  • Organisms differ.
  • But the process of performing an experiment may
    induce noise
  • Unintended side-effects of the experimental
    process.
  • Study of the metabolome provides us with a means
    for determining the extent of a biological signal
    BUT the dynamic nature of the metabolome means
    that experimentally induced noise may well
    obscure the signal.
  • Proper understanding of experimental context is,
    therefore, key to
  • Correct interpretation of experimental results.
  • Selection of datasets that may be meaningfully
    compared.
  • Experiment reproducibility.

11
Understanding metabolomics data (3) The
metabolomics experiment pipeline
12
Understanding metabolomics data (4) Data from
GC-MS
Output
GC-MS Machine
13
Understanding metabolomics data (5) Processing
the output from GC-MS
Metabolite Area Peak one xxx Peak two xxx Peak
three xxx Peak four xxx
Noise Removal Peak Identification Peak
Deconvolution Peak Quantification
Quantified Peak List
Peak Labelling
Chemical Identity
14
Understanding metabolomics data (6) Metabolome
signatures
FT-IR Output
DI-ESI Output
15
Designing a metabolomics database
  • From our understanding of the nature of
    metabolomics data we can define the following
    list of requirements for a metabolomics database
  • It should support data from all parts of the
    experiment pipeline.
  • It should support a range of experimental
    procedures as used by different laboratories.
  • It should support the range of dataset types as
    produced by different chemical analysis
    technologies.
  • It should facilitate sophisticated querying for
    datasets and mining over experimental metadata.
  • Therefore, our database design must be both
    flexible and extensible.

16
A component-based structure
  • Our solution was to specify a component-based
    structure for our data model
  • A set of components, each of which aims to
    describe a different part of a plant metabolomics
    experiment.
  • Each component has a set of core, required data
    items, relevant to all plant metabolomics
    experiments.
  • Each component may have one or more
    sub-components that extend the core data for the
    component to describe particular methodologies
    and/or technologies.
  • An experiment is then described by a complete set
    of components each one of which may contain
    either
  • The core data for that component or
  • The core data for that component plus additional
    data items specified in one of its
    sub-components.

17
ArMet (Architecture for Metabolomics) components
Metabolome Estimate


Genotype
DataPoint
Output
18
Components and sub-components
19
A general Metabolome Estimate component
Metabolome Estimate
Targeted output


Genotype
Metabolite
Output
and/or
Metabolome output
Peak
Output
Fingerprint output
Output
DataPoint
20
Recent Developments
Protocol repository
Protocol log
21
This lecture will include
  • A word on the importance of data standards for
    the omics and where our work fits in.
  • Discussion of some of the issues we have faced
    whilst implementing our databases.
  • A brief look at where this field is going in the
    future.

22
Data standards
  • Typically experimentalists will want to be able
    to do one or many or the following
  • Transmit datasets to collaborators or other
    external sites.
  • Store data with adequate curation.
  • Make datasets available for statistical analysis
    and data mining.
  • Data standards support these activities by
    defining
  • The structure of datasets.
  • The types of information that they should contain.

23
Data standards for the -omics
  • Transcriptomics
  • MIAME (Nat. Genetics., 29365-371) defines the
    data that should be recorded for a transcriptome
    experiment.
  • MAGE (Spellman et. al., Genome. Biol., 3, 2002)
    is an associated formal data description.
  • Proteomics
  • PEDRo (Nat. Biotechnol., 21247-254) is a formal
    data description for proteomics.
  • PSI-OM (Proteomics, 4490-491) is the result of
    further development of PEDRo by HUPO PSI
  • MIAPE is an associated definition of the data
    that should be recorded for proteomics
    experiments.

24
Data standards for metabolomics
  • Until recently there was no equivalent
    metabolomics-specific data standard for defining
    metabolite complements in their experimental
    context. Now we have
  • SMRS (http//www.smrsgroup.org/) is a draft
    policy for standardisation of reporting for
    metabolic analysis mainly for pre-clinical drug
    trials
  • MIAMET (Trends in Plant Science, 9(9)418-425) is
    a definition of the data that should be recorded
    for a metabolomics experiment
  • ArMet (Nat. Biotechnol., 221601-1606) is a
    formal data description for plant metabolomics.

25
Benefits of data standards
  • Data standards such as these, that describe
    experiment results in their experimental context,
    enable
  • Proper interpretation of results.
  • Laboratory interoperability.
  • Meaningful comparison of datasets.
  • Replication of experiments.
  • More options for retrospective analysis of data.

26
Other benefits of data standards
  • Data standards, and the formal data descriptions
    that underlie, them
  • Encourage consideration and development of best
    practice and Standard Operating Procedures.
  • Standardize the reporting of experiments and
    archiving of data associated with publications.
  • Enable the development of databases and
    verifiable transmission mechanisms to facilitate
    the storage, collection and dissemination of
    logically correct datasets.

27
Implementation Issues
  • Speed and optimisation of querying
  • Building acceptable user interfaces for data
    submission
  • Standardising terminology

28
Speed of querying
  • Metabolomics experiments can produce vast
    quantities of results data.
  • For example
  • Each experiment within our HiMet project involves
    220 samples.
  • There are currently 11 experiments.
  • Each sample is subjected to multiple types of
    chemical analysis.
  • Each dataset produced from a single chemical
    analysis contains 2000 datapoints.
  • All of this results data is, of course, in
    addition to the experiment metadata that provides
    the experimental context for the results.
  • Storing all of these points in a single unordered
    database table can lead to query times of hours
    for the data from a single experiment or sample.

29
Optimisation (1) Database tuning
  • A key aspect of the development of relational
    databases is normalisation
  • Rules for grouping the fields that we want to
    store into tables
  • Denormalisation involves reducing the degree of
    normalisation of the tables in a relational
    database to improve performance
  • Should not be done when a table is regularly
    updated as normalisation protects us against
    insert/update/delete anomalies that can reduce
    the integrity of a database.

30
Optimisation (2) Data Organisation
  • Hash files
  • Records in a table are written to disk based on a
    hash function.
  • Indexes
  • A structure that is separate from, but
    associated with, a table contains an entry for
    each value in the indexed column(s) of the table
    provides direct pointers to the rows in the
    table on the disk.
  • B-trees
  • Records in a table/index organised into a
    balanced-tree structure.
  • Clustered tables
  • Groups of one or more tables physically stored
    together because they share common columns and
    are often used together.

31
Optimisation (3) Distribution
  • Table partitioning
  • Break tables or index files down into smaller
    more manageable chunks that may be distributed
    across disks.
  • Parallelising queries
  • Breaking down the query into a number of
    functions that are carried out by different
    processes.

32
User Interface Issues
  • Experimentalists gather the data about their
    experiments in a number of formats
  • Lab books
  • Protocol documentation
  • Spreadsheets for material tracking
  • Instrument-specific data files for chemical
    analysis output.
  • Initially we produced web-based, form-filling
    interfaces for metadata submission linked to
    upload facilities for directly uploading dataset
    files, but
  • The detail required to capture the sources of
    experimentally induced noise makes this type of
    form-filling interface overwhelming.

33
Standardising Terminology
  • Crucial to the production of standard datasets
    that can be compared and interpreted correctly is
    the use of common (standard) terminology to
    describe the concepts involved in an experiment.
  • Many biological terms have other meanings when
    used in other situations, e.g. everyday meaning,
    chemical meaning.
  • In our solutions so far we have employed the
    notion of extensible controlled vocabularies.
  • Certain key fields within the database must take
    their values from a controlled list
  • The controlled list entries include information
    on the authority for that term, i.e. the
    standard definition or meaning.

34
The Future
  • Omic ontologies
  • Pan-omic data models

35
Ontologies and controlled vocabularies
  • An ontology is a hierarchical formal
    specification of the concepts within a domain. It
    specifies the vocabulary that may be used to
    refer to those concepts and how the concepts may
    be related.
  • Controlled vocabularies are required by
    MIAME/MAGE, PEDRo/PSI-OM and ArMet.
  • The development of ontologies will help to
    formalise these controlled vocabularies which are
    currently modelled as values with authorities.
  • Both MGED in the microarray community and the PSI
    in the proteomics community have recognised this
    and are working on ontologies to support the
    description of transcriptomics and proteomics
    experiments.

36
Ontology for experiments (EXPO)
  • Work carried out by colleagues at Aberystwyth
    (Larissa Soldatova and Prof. Ross King) has
    resulted in an ontology for experiments (EXPO).
  • They are now planning to validate the ontology by
    showing that it may be instantiated for
    metabolomics experiments.
  • Benefits to ArMet
  • Formalisation of the required controlled
    vocabularies
  • Alternative modelling approach is helping to
    inform the development of the ArMet data model.
  • As EXPO is at a higher conceptual level than any
    of the omics (and could potentially be
    instantiated for transcriptomics and proteomics
    experiments as well) we are also hoping to
    evaluate its usefulness to the process of
    development of common models for sample
    description.

37
A Single Data Model for Sample Description
  • Metabolomics analyses will be performed in
    conjunction with those on the transciptome and
    proteome in integrative experiments.
  • Need a common data standard for experiment
    descriptions.
  • Must contain the metadata required to fully
    evaluate the different omic analyses performed
    on samples from a single experiment
  • Others working in this area have produced
  • SysBio-OM (Bioinformatics, 20(12), 2004-2015).
  • FGE-OM (Bioinformatics, 20(10), 1583-1590).
Write a Comment
User Comments (0)
About PowerShow.com