New Architectural Paradigms in Data Integration: Dynamic Distributed Federated Databases presentation

About This Presentation

Transcript and Presenter's Notes

Title: New Architectural Paradigms in Data Integration: Dynamic Distributed Federated Databases

1
New Architectural Paradigms in Data Integration
Dynamic Distributed Federated Databases
Rome, July 10 2009
G. Scioscia, IBM Bari P. Leo, IBM Bari M.
Santamaria, ITB-CNR Bari G. Pappadà, Exhicon Bari
Title slide
2
Agenda

The eResearch Workplace in the MBLab initiative
A Dynamic Distributed Federated Database the
GaianDB
Data Federation in our molecular Biodiversity
activities

3
The Molecular Biodiversty Laboratory
MBLab is a private-public research initiative,
involved in the study and the use of Molecular
Biodiversity. It aims to build innovative
bioinformatic systems, both applied to human
health in order to monitor safety and risks -
and to agro-industrial, to trace products along
food production and supply chains
Project co-funded by Italian Minister of Research
(MIUR) and has 7M budget Fondo FAR - Legge
297/1999 Art. 12/lab Project Grant DM19410
4
The eResearch Workplace in the MBLab initiative
5
A new type of database Architecture

Military, industrial, scientific data volumes
continue to grow in size and complexity
Current trend is a move away from large
monolithic database architectures to more
distributed database technologies

Within the International Technology Alliance in
Network and Information Science
(http//www.usukita.org/), a U.S. - U.K
partnership, IBM researchers conceived a new
type of database architecture a Dynamic
Distributed Federated Database (DDFD) (G. Bent,
et al. Proceedings of the 2nd Conference of the
ITA, London U.K., September 2008)

DDFD combines the principles of large distributed
databases, database federation, network topology,
semantics of data
Using biologically inspired principles of network
growth combined with graph theoretic methods a
DDFD can be developed and maintained

The Gaian DB
6
The Gaian Database Architecture

The Gaian DB consists of a set of interconnected
vertices (N1, N2, ) each of which is a federated
Relational Database Management System (RDBMS)
Federated DB means that the DB engine is able to
access internal and external sources as if it
were one logical DB External sources may include
other RDBMS, such as Oracle, SQLServer, etc., or
any other data source such as flat file, excel
files, REST services, etc.
Data can be stored in any vertex in the network
of database with the table in which it is stored
being available to other vertices through a
logical table mapping

At present the Gaian DB has been implemented on
top of the Derby database engine (an open source
Java RDBMS)
It has the advantage of having a small SW
footprint (less than 4MB)
Derby facilitates the implementation of
heterogeneous federation

7
The Gaian DB Query Mechanism

The Store Locally Query Anywhere (SQLA) mechanism
provides for global access to data from any
vertex in the database network
Data is stored in local database tables at any
vertex in the network, and it is accessible from
any other vertex using Structured Query Language
(SQL) like queries and distributed stored
procedures-like processes
The query propagates through the network and
result sets are returned to the querying vertex

Each vertex that forwards the query is
responsible for processing the results obtained
from vertices to which the query was forwarded,
leading to a distributed aggregation of results
The efficiency of this database concept is in par
determined by how the vertices of the database
are connected together. Specific algorithms have
been developed
To create appropriate connections (edges) between
the vertices in the network (each vertex is not
connected to all other vertices)
To add new vertices to an existing network
To add a vertex that does not require any queries
to be issued from the new vertex
To minimize the number of paths traversed where
the same query is being issued from different
vertex

8
Some Gaian DB performances

In the Gaian DB tools for gathering distributed
statistics on each query and corresponding result
set are also available
An explain query allows a user to obtain
information about the route of a query through
the network of Gaian nodes

Gaian DB demonstrated its scalability
An experimental network consisting of 1000
vertices has been setup
Query Time A query involving all 1000 nodes
replies in about 1/8 second. The query time
grows logarithmically - in other words as you add
more and more databases, the increase in query
time slows down, providing excellent scaling
Fetch Time The Gaian DB in the experiment was
able to fetch 1 million rows of data in under 5
seconds
Concurrent Queries Queries from up to 40 nodes
at the same timehave been injected, the Gaian DB
showed that it could handle these queries
robustly with a modest increase in the query time

9
Fields of applicability of the Gaian DB and our
experiments

Gaian DBs have been tested with success in
different domains
In wide sensor networks each sensor host a
vertex of the Gaian DB
In distributed semantic search applications
Structured information extracted from
unstructured information is stored as RDF triples
in various locations

OUR EXPERIMENTS, in MBLab, with GAIAN DB in the
BIOINFORMATIC DOMAIN
We setup a Gaian DB with multiple vertices (up to
6) with Bioinformatics databases
Databases are both local private DBs and remote
public DBs (e.g. GenBank)
One database is flat file (UNIPROT replica) to
test heterogeneous data sources
For each database a Gaian DB vertex has been
deployed

RESULTS
Performances are good as we can expect
The largest effort in the federation task is
creating, on different DBs, logical tables with
consistent mapping of the available data

In other words it is important that DBs to be
federated converge (or are mapped) toward a
common logical schema
10
A query-based approach to identify the best
mitochondrial barcode candidate.Applied to Fungi

Intron distribution and length in the Ascomycota
mtDNA genes
The pervasiveness of introns in the Ascomycota is
highlighted but specific regions from at least
five alternative genes, namely ND2, ND4, ND6,
LSRNA and SSRNA are intron-poor and large enough
to be barcode candidates for Ascomycota.
11
Barcode Primers Designer workflow for database
retrieval, analysis and design of barcode primers
for a definite taxonomic group
Multiple alignment of the multifasta files (emma
EMBOSS)
Retrieve fwd and rev primers from the GenBank
entry
Query GenBank
Retrieval
Select a taxonomic group
Calculate oligonucleotide properties
Barcode primers designer
Calculate consensus Sequence (cons EMBOSS)
Check for amplification of targets other than
the input template
Consensus sequence viewer (Ex. WebLogo)
12
Barcode Primers Designer A web-based graphical
interface for users
Download
File1 File2 File3 File3
Forward barcode primer details
Reverse barcode primer details
Consensus sequence
Consensus sequence
atgctagcatgcatgcgactg
atgctagcatgcatgcgactg
View alignment
View WebLogo
View alignment
View WebLogo
Consensus primer features
Consensus primer features
Molecular Weight 10087 Param1 986 Param2 56
Molecular Weight 10087 Param1 986 Param2 56
13
Conclusions

Data federation is a viable solution to integrate
heterogeneous dislocated pieces of information
Within the federation approach (when it is
suitable), the Gaian DB architecture is a great,
innovative solution
In setup a Gaian DB large effort for
accommodating different schemas should be done
This is one more reason to converge toward a
common schema and shared ontologies in the
Biodiversity domain (where different attempts
have already been made)
One of the main tasks of the MBLab initiative is
setting up a wide-converging information schema
for molecular Biodiversity studies

14
Questions?

Write a Comment

User Comments (0)

About PowerShow.com

New Architectural Paradigms in Data Integration: Dynamic Distributed Federated Databases PowerPoint PPT Presentation