Databases for Microarrays - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Databases for Microarrays

Description:

Databases for Microarrays Vidhya Jagannathan SIB, Lausanne Overview Microarray data in a nutshell Why databases? What data to represent? What is a database? – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 50
Provided by: chEmbnetO7
Learn more at: http://www.ch.embnet.org
Category:

less

Transcript and Presenter's Notes

Title: Databases for Microarrays


1
Databases for Microarrays
  • Vidhya Jagannathan
  • SIB, Lausanne

2
Overview
  • Microarray data in a nutshell
  • Why databases?
  • What data to represent?
  • What is a database?
  • Different data models
  • E-R modelling
  • Microarray Databases
  • Standards being developed

3
Microarray Experiment
4
Microarray Data in a Nutshell
  • Lots of data to be managed before and after the
    experiment.
  • Data to be stored before the experiment .
  • Description of the array and the sample.
  • Direct access to all the cDNA and gene sequences,
    annotations, and physical DNA resources.
  • Data to be stored after the experiment
  • Raw Data - scanned images.
  • Gene Expression Matrix - Relative expression
    levels observed on various sites on the array.
  • Hence we can see that database software capable
    of dealing with larger volumes of numeric and
    image data is required.

5
Why Databases?
  • Tailored to datatype
  • Tailored to the Scientists
  • Intuitive ways to query the data
  • Diagrams, forms, point and click, text etc.
  • Support for efficient answering of queries.
  • Query optimisation, indexes, compact physical
    storage.

6
Data Representation
  • Goal Represent data in an intuitive and
    convenient manner
  • Without unnecessary replication of information
  • Making it easy to write queries to find required
    information
  • Supporting efficient retrieval of required
    information

7
What is a Database?
  • A database is an organised collection of pieces
    of structured electronic information.
  • Example 1 Libraires use a database system to
    keep track of library inventory and loans.
  • Example 2 All airlines use database system to
    manage their flights and reservations.
  • The collection of records kept for a common
    purpose such as these is known as a database.
  • The records of the database normally reside on a
    hard disk and the records are retrieved into
    computer memory only when they are accessed.
  • So the reasons are obvious why we need to discuss
    about a Microarray database.

8
Data Models
  • Describes a container for data and methods to
    store and retrieve data from that container.
  • Abstract math algorithms and concepts.
  • Cannot touch a data model.
  • Very useful

9
Types of Data Models
  • Ad-hoc file formats (not really data models!)
  • Relational data model
  • Object-relational data model
  • Object-oriented data model
  • XML (Extensible Markup Language)

10
Ad-hoc File Formats
  • The various 'ad-hoc' file formats in use for
    microarray data are
  • Flat file formats.
  • Spread sheet formats.
  • Not the least - Even MS-Word documents !!!
  • Very rudimentary method to store data .
  • Sometimes contains redundant information.
  • Extremely inefficient for retrieval of particular
    subsets of the results.

11
Relational Data Model
  • Most prevalent and used in many databases
    developed today.
  • The collection of related information is
    represented as a set of tables.
  • Data value is stored in the intersection of row
    and column
  • Column values are of the same kind. A Simple
    data validation.
  • Rows are unique. So no data redundancy and every
    row is meaningful and can be identified by the
    unique key.
  • Utilises Structured Query Language (SQL) for data
    storage, retrival and manipulation.

12
Example
Table
Row or Record
Field or Column
13
Example
14
Advantages of Relational Model
  • Allows information to be broken up into logical
    units and stored in tables.
  • Allows combining data from different tables in
    different ways to derive useful information.
  • Great for queries involving information from
    multiple original sources.
  • Can easily gather related information.
  • e.g. information about a particular gene from
    multiple datasets/experiments

15
Object Oriented Model
  • Object Oriented Model allows real world data to
    be represented as objects.
  • Objects encapsulate the data and provide methods
    to access or manipulate it.
  • Objects with specific structure and set of
    methods are said to belong to the object class.
  • Allows new classes to be created by extending
    the description of the parent class.
  • Child classes inherit the data and methods of the
    parent class.

16
Example
OODBMS
17
Example - ArrayExpress
18
Database Design Entity-Relationship Concept
Relationship
Entity A
Entity B
Examples
19
Entities
  • are real world objects
  • ex gene
  • contain attributes
  • ex gene_id, sequence
  • are drawn as rectangle boxes that holds the name
    of the entity and attribute in two different
    notations as there is no standard!

Gene
notation 2
notation 1
20
Relationship
  • Relationships provide connections between two or
    more entities
  • ex Which genes were used in which experiment
  • When two entities are involved in a relationship,
    it is known as binary relationship.
  • When three entities are invoved in a
    relationship, it is called as ternary
    relationship.
  • When more than three entities are involved in a
    relationship, it is usually broken in to one or
    more binary or ternary relationships.
  • are drawn as a line linking the involved entities
    as

used_in
Gene
Experiment
21
Example E-R Diagram
Expt-Exptr
Expt-Sample
Multivalued attribute
Notation
Expt-Array
Many-to-one
22
Object relational data model
  • Improved relational model by adding some features
    from object data models.
  • Information is represented as in relational
    models but column values not restricted to one
    mutliple values are allowed.
  • Example (sample table in previous slides)

sampe-id organism celltype drug_id
d1 d2
s001 ecoli c123 ac1 nm ac3 nm
s002 ecoli c123 ac1 nm ac3 nm
23
Queries, queries, queries!!
  • Given a collection of microarray generated gene
    expression data, what kind of questions the users
    wish to pose.
  • Constructing an extensive list of possible
    interesting queries and data mining problems that
    has to be supported by the database will
    facilitate the design process.

24
Queries, queries, queries!!
  • Query to the data
  • Which genes are linked ?
  • Which genes are expressed similarly to my gene
    XYZ?
  • Which genes have a changed the expression in a
    second condition ?
  • Which genes are co-expressed in differing
    conditions ?
  • classification (of tumors, diseased tissues
    etc.) which patterns are characteristic for a
    certain class of samples, which genes are
    involved?

25
More Queries !!!
  • Queries that add a link in additional knowledge
  • functional classification of genes Are changes
    clustered in particular classes?
  • metabolic pathway information Is a certain
    pathway/route in a pathway affected?
  • disease information clinical follow up
    correlation to expression patterns.
  • phenotype information for mutants Are there
    correlations between particular phenotypes and
    expression patterns?

26
More Queries !!!
  • in what region is the interesting gene located in
    the genome?
  • is there synteny in this region with other
    species?
  • is there a known trait that maps to this region?

27
Query Language
  • Language in which user requests information from
    the database.
  • SQL
  • Data definition helps you implement your model
    and data manipulation helps you modify and
    retrive data
  • Advantages
  • Can specify query declaratively and let database
    system figure out best way of finding answers
  • Supports queries of medium complexity
  • Specialized languages
  • SQL language statements are not abstract but very
    close to spoken language.

28
Basic SQL Queries
  • Find the image for experiment number 1345
  • select image from experiment where
    experiment-id 1345
  • Find the experiment-id and image of all
    experiments involving e-coli
  • select experiment-id, image from experiment,
    sample where experiment.sample-id
    sample.sample-id and sample.organism e.coli
  • All combinations of rows from the relations in
    the from clause are considered, and those that
    satisfy the where conditions are output

29
Interfacing
  • SQL queries are carried out on terminal screen
    which is not very useful and user friendly for an
    end user, so applications are created to
    interface more friendly staments with the SQL
    statements
  • A web form is a typical example of interface for
    SQL
  • Applications for data loading.
  • More complex queries (e.g. data mining such as
    classification and clustering) are very
    imporatant part of the Microarray Analyis
    Protocol
  • It is very important to interface the various
    applications we use to analyse the retrieved data
    with database.

30
(No Transcript)
31
Gene Expression Databases Require Integration
  • There are many different types of data presenting
    numerous relationships.
  • There are a number of Databases with lots of
    information.
  • Experiments need to be compared because the
    experiments are very difficult to perform and
    very expensive.
  • Solution Make all the databases talk the same
    language.
  • XML was the choice of data interchange format.

32
Why XML?
  • Why XML ? XML provides the method for defining
    the meaning or semantics of data.
  • Example A XML file of the earlier table we
    defined

ltgene_featuresgt ltgene_idgtGBVN32lt/gene_idgt
ltcontig_idgtNT_010651lt/contig_idgt
ltcontig_startgt2354807lt/contig_startgt
ltcontig_endgt2360778lt/contig_endgt
ltcontig_strandgtComplementlt/contig_strandgt lt/gene_f
eaturesgt
33
Mapping XML to Relational Database
  • The Data Structure in XML is defined in Document
    Type Descrciptor as follows
  • lt!ELEMENT gene_id (PCDATA)gt
  • lt!ELEMENT contig_id (PCDATA)gt
  • lt!ELEMENT contig_start (PCDATA)gt
  • lt!ELEMENT contig_end (PCDATA)gt
  • lt!ELEMENT contig_sequence (PCDATA)gt
  • This kind of DTD also helps us to have control
    over the vocabulary used.
  • SQL
  • create table gene (
  • gene_id varchar(5) primary key,
  • contig_id varchar(10) not null,
  • contig_start integer not null,
  • contig_end integer not null,
  • contig_sequence text not null)
  • So the DTD can be directly mapped into a
    relational database.

34
MAGE-ML As Data Interchage Format
Expression Data
Converter (program)
MAGE-ML
Databases
35
Existing Microarray Databases
  • Several gene expression databases existBoth
    commercial and non-commercial.
  • Most focus on either a particular technolgy or a
    particular organism or both.
  • Commercial databases
  • Rosetta Inpharmatics and Genelogic, the specifics
    of their internal structure is not available for
    internal scrutiny due to their proprietary
    nature.
  • Some non-commercial efforts to design more
    general databases merit particular mention.
  • We will discuss few of the most promising ones
  • ArrayExpress - EBI
  • The Gene expression Omnibus (GEO) - NLM
  • The Standford microarray Database
  • ExpressDB - Harvard
  • Genex - NCGR

36
(No Transcript)
37
ArrayExpress
  • Public repository of microarray based gene
    expression data.
  • Implemented in Oracle at EBI.
  • Contains
  • several curated gene expression datasets
  • possible introduction of an image server to
    archive raw image data associated with the
    experiments.
  • Accepts submissions in MAGE-ML format via a
    web-based data annotation/submission tool called
    MIAMExpress.
  • A demo version of MIAMExpress is available at
    http//industry.ebi.ac.uk/parkinso/subtool/subtyp
    e.html
  • Provides a simple web-based query interface and
    is directly linked to the Expression Profiler
    data analysis tool which allows expression data
    clustering and other types of data exploration
    directly through the web.

38
Gene Express Omnibus
  • The Gene Expression Omnibus ia a gene expression
    database hosted at the National library of
    Medicine
  • It supports four basic data elements
  • Platform ( the physical reagents used to generate
    the data)
  • Sample (information about the mRNA being used)
  • Submitter ( the person and organisation
    submitting the data)
  • Series ( the relationship among the samples).
  • It allows download of entire datasets, it has not
    ability to query the relationships
  • Data are entered as tab delimited ASCII
    records,with a number of columns that depend on
    the kind of array selected.
  • Supports Serial Analysis of Gene Expression
    (SAGE) data.

39
Stanford Microarray Database
  • Contains the largest amount of data.
  • Uses relational database to answer queries.
  • Associated with numerious clustering and analysis
    features.
  • Users can access the data in SMD from the web
    interface of the package.
  • Disadvantage
  • It supports only Cy3/Cy5 glass slide data
  • It is designed to exclusively use an oracle
    database
  • Has been recently released outside without
    anykind of support !!

40
MaxdSQL
  • Minor changes to the ArrayExpress object data
    model allowed it to be instantiated as a
    relational database, and MaxdSQL is the resulting
    implementation.
  • MaxdSQL supports both Spotted and Affymetrix data
    and not SAGE data.
  • MaxdSQL is associated with the maxdView, a java
    suite of analysis and visualisation tools.This
    tool also provides an environment for developing
    tools and intergrating existing software.
  • MaxdLoad is the data-loading application software.

41
(No Transcript)
42
GeneX
  • Open source database and integrated tool set
    released by NCGR http//www.ncgr.org.
  • Open source - provides a basic infrastructure
    upon which others can build.
  • Stores numeric values for a spot measurement
    (primary or raw data), ratio and averaged data
    across array measurements.
  • Includes a web interface to the database that
    allow users to retrieve
  • Entire datasets, subsets
  • Guided queries for processing by a particular
    analysis routine
  • Download data in both tab delimited form and
    GeneXML format ( more descriptions later)

43
(No Transcript)
44
ExpressDB
  • ExpressDB is a relational database containing
    yeast and E.coli RNA expression data.
  • It has been conceived as an example on how to
    manage that kind of data.
  • It allows web-querying or SQL-querying.
  • It is linked to an integrated database for
    functional genomics called Biomolecule
    Interaction Growth and Expression Database
    (BIGED).
  • BIGED is intended to support and integrate RNA
    expression data with other kinds of functional
    genomics data

45
Survey of existing microarray systems
This survey is based on the article published
in BRIEFINGS IN BIOINFORMATICS, Vol 2, No 2, pp
143-158, May 2001 A comparison of
microarray databases
46
The Microarray Gene Expression Database Group
(MGED)
  • History and Future
  • Founded at a meeting in November, 1999 in
    Cambridge, UK.
  • In May 2000 and March 2001 development of
    recommendations for microarray data annotations
    (MAIME, MAML).
  • MGED 2nd meeting
  • establishment of a steering committee consisting
    of representatives of many of the worlds leading
    microarray laboratories and companies
  • MGED 4th meeting in 2002
  • MAIME 1.0 will be published
  • MAML/GEML and object models will be accepted by
    the OMG
  • concrete ontology and data normalization
    recommendations will be published.
  • information can be obtained from
    http//www.mged.org

47
The Microarray Gene Expression Database Group
(MGED)
  • Goals
  • Facilitate the adoption of standards for
    DNA-array experiment annotation and data
    representation.
  • Introduce standard experimental controls and data
    normalization methods.
  • Establish gene expression data repositories.
  • Allow comparision of gene expression data from
    different sources.

48
MGED Working Groups
  • Goals
  • MIAME Experiment description and data
    representation standards - Alvis Brazma
  • MAGE Introduce standard experimental controls
    and data normalization methods - Paul Spellman.
    This group includes the MAGE-OM and MAGE-ML
    development.
  • OWG Microarray data standards, annotations,
    ontologies and databases - Chris Stoeckert
  • NWG Standards for normalization of microarray
    data and cross-platform comparison - Gavin
    Sherlock

49
References
  • URL
  • Tutorial on Information Management for Genome
    Level Bioinformatics, Paton and Goble, at VLDB
    2001 http//www.dia.uniroma3.it/vldbproc/tutEur
    opea
  • European Molecular Biology Network
    http//www.embnet.org/
  • Univ. Manchester site (with relational version of
    Microarray data representation, and links to
    other sites)
  • http//www.bioinf.man.ac.uk
  • Database textbook with absolutely no
    bioinformatics coverage
  • For Microarray Data
  • http//linkage.rockefeller.edu/wli/microarray/
Write a Comment
User Comments (0)
About PowerShow.com