New Architectural Paradigms in Data Integration: Dynamic Distributed Federated Databases PowerPoint PPT Presentation

presentation player overlay
1 / 14
About This Presentation
Transcript and Presenter's Notes

Title: New Architectural Paradigms in Data Integration: Dynamic Distributed Federated Databases


1
New Architectural Paradigms in Data Integration
Dynamic Distributed Federated Databases
Rome, July 10 2009
G. Scioscia, IBM Bari P. Leo, IBM Bari M.
Santamaria, ITB-CNR Bari G. Pappadà, Exhicon Bari
Title slide
2
Agenda
  • The eResearch Workplace in the MBLab initiative
  • A Dynamic Distributed Federated Database the
    GaianDB
  • Data Federation in our molecular Biodiversity
    activities

3
The Molecular Biodiversty Laboratory
MBLab is a private-public research initiative,
involved in the study and the use of Molecular
Biodiversity. It aims to build innovative
bioinformatic systems, both applied to human
health in order to monitor safety and risks -
and to agro-industrial, to trace products along
food production and supply chains
Project co-funded by Italian Minister of Research
(MIUR) and has 7M budget Fondo FAR - Legge
297/1999 Art. 12/lab Project Grant DM19410
4
The eResearch Workplace in the MBLab initiative
5
A new type of database Architecture
  • Military, industrial, scientific data volumes
    continue to grow in size and complexity
  • Current trend is a move away from large
    monolithic database architectures to more
    distributed database technologies

Within the International Technology Alliance in
Network and Information Science
(http//www.usukita.org/), a U.S. - U.K
partnership, IBM researchers conceived a new
type of database architecture a Dynamic
Distributed Federated Database (DDFD) (G. Bent,
et al. Proceedings of the 2nd Conference of the
ITA, London U.K., September 2008)
  • DDFD combines the principles of large distributed
    databases, database federation, network topology,
    semantics of data
  • Using biologically inspired principles of network
    growth combined with graph theoretic methods a
    DDFD can be developed and maintained

The Gaian DB
6
The Gaian Database Architecture
  • The Gaian DB consists of a set of interconnected
    vertices (N1, N2, ) each of which is a federated
    Relational Database Management System (RDBMS)
  • Federated DB means that the DB engine is able to
    access internal and external sources as if it
    were one logical DB External sources may include
    other RDBMS, such as Oracle, SQLServer, etc., or
    any other data source such as flat file, excel
    files, REST services, etc.
  • Data can be stored in any vertex in the network
    of database with the table in which it is stored
    being available to other vertices through a
    logical table mapping
  • At present the Gaian DB has been implemented on
    top of the Derby database engine (an open source
    Java RDBMS)
  • It has the advantage of having a small SW
    footprint (less than 4MB)
  • Derby facilitates the implementation of
    heterogeneous federation

7
The Gaian DB Query Mechanism
  • The Store Locally Query Anywhere (SQLA) mechanism
    provides for global access to data from any
    vertex in the database network
  • Data is stored in local database tables at any
    vertex in the network, and it is accessible from
    any other vertex using Structured Query Language
    (SQL) like queries and distributed stored
    procedures-like processes
  • The query propagates through the network and
    result sets are returned to the querying vertex
  • Each vertex that forwards the query is
    responsible for processing the results obtained
    from vertices to which the query was forwarded,
    leading to a distributed aggregation of results
  • The efficiency of this database concept is in par
    determined by how the vertices of the database
    are connected together. Specific algorithms have
    been developed
  • To create appropriate connections (edges) between
    the vertices in the network (each vertex is not
    connected to all other vertices)
  • To add new vertices to an existing network
  • To add a vertex that does not require any queries
    to be issued from the new vertex
  • To minimize the number of paths traversed where
    the same query is being issued from different
    vertex

8
Some Gaian DB performances
  • In the Gaian DB tools for gathering distributed
    statistics on each query and corresponding result
    set are also available
  • An explain query allows a user to obtain
    information about the route of a query through
    the network of Gaian nodes
  • Gaian DB demonstrated its scalability
  • An experimental network consisting of 1000
    vertices has been setup
  • Query Time A query involving all 1000 nodes
    replies in about 1/8 second. The query time
    grows logarithmically - in other words as you add
    more and more databases, the increase in query
    time slows down, providing excellent scaling
  • Fetch Time The Gaian DB in the experiment was
    able to fetch 1 million rows of data in under 5
    seconds
  • Concurrent Queries Queries from up to 40 nodes
    at the same timehave been injected, the Gaian DB
    showed that it could handle these queries
    robustly with a modest increase in the query time

9
Fields of applicability of the Gaian DB and our
experiments
  • Gaian DBs have been tested with success in
    different domains
  • In wide sensor networks each sensor host a
    vertex of the Gaian DB
  • In distributed semantic search applications
    Structured information extracted from
    unstructured information is stored as RDF triples
    in various locations
  • OUR EXPERIMENTS, in MBLab, with GAIAN DB in the
    BIOINFORMATIC DOMAIN
  • We setup a Gaian DB with multiple vertices (up to
    6) with Bioinformatics databases
  • Databases are both local private DBs and remote
    public DBs (e.g. GenBank)
  • One database is flat file (UNIPROT replica) to
    test heterogeneous data sources
  • For each database a Gaian DB vertex has been
    deployed
  • RESULTS
  • Performances are good as we can expect
  • The largest effort in the federation task is
    creating, on different DBs, logical tables with
    consistent mapping of the available data

In other words it is important that DBs to be
federated converge (or are mapped) toward a
common logical schema
10
A query-based approach to identify the best
mitochondrial barcode candidate.Applied to Fungi

Intron distribution and length in the Ascomycota
mtDNA genes
The pervasiveness of introns in the Ascomycota is
highlighted but specific regions from at least
five alternative genes, namely ND2, ND4, ND6,
LSRNA and SSRNA are intron-poor and large enough
to be barcode candidates for Ascomycota.
11
Barcode Primers Designer workflow for database
retrieval, analysis and design of barcode primers
for a definite taxonomic group
Multiple alignment of the multifasta files (emma
EMBOSS)
Retrieve fwd and rev primers from the GenBank
entry
Query GenBank
Retrieval
Select a taxonomic group
Calculate oligonucleotide properties
Barcode primers designer
Calculate consensus Sequence (cons EMBOSS)
Check for amplification of targets other than
the input template
Consensus sequence viewer (Ex. WebLogo)
12
Barcode Primers Designer A web-based graphical
interface for users
Download
File1 File2 File3 File3
Forward barcode primer details
Reverse barcode primer details
Consensus sequence
Consensus sequence
atgctagcatgcatgcgactg
atgctagcatgcatgcgactg
View alignment
View WebLogo
View alignment
View WebLogo
Consensus primer features
Consensus primer features
Molecular Weight 10087 Param1 986 Param2 56
Molecular Weight 10087 Param1 986 Param2 56
13
Conclusions
  • Data federation is a viable solution to integrate
    heterogeneous dislocated pieces of information
  • Within the federation approach (when it is
    suitable), the Gaian DB architecture is a great,
    innovative solution
  • In setup a Gaian DB large effort for
    accommodating different schemas should be done
  • This is one more reason to converge toward a
    common schema and shared ontologies in the
    Biodiversity domain (where different attempts
    have already been made)
  • One of the main tasks of the MBLab initiative is
    setting up a wide-converging information schema
    for molecular Biodiversity studies

14
Questions?
Write a Comment
User Comments (0)
About PowerShow.com