Title: New Architectural Paradigms in Data Integration: Dynamic Distributed Federated Databases
1New Architectural Paradigms in Data Integration
Dynamic Distributed Federated Databases
Rome, July 10 2009
G. Scioscia, IBM Bari P. Leo, IBM Bari M.
Santamaria, ITB-CNR Bari G. Pappadà , Exhicon Bari
Title slide
2Agenda
- The eResearch Workplace in the MBLab initiative
- A Dynamic Distributed Federated Database the
GaianDB - Data Federation in our molecular Biodiversity
activities
3The Molecular Biodiversty Laboratory
MBLab is a private-public research initiative,
involved in the study and the use of Molecular
Biodiversity. It aims to build innovative
bioinformatic systems, both applied to human
health in order to monitor safety and risks -
and to agro-industrial, to trace products along
food production and supply chains
Project co-funded by Italian Minister of Research
(MIUR) and has 7M budget Fondo FAR - Legge
297/1999 Art. 12/lab Project Grant DM19410
4The eResearch Workplace in the MBLab initiative
5A new type of database Architecture
- Military, industrial, scientific data volumes
continue to grow in size and complexity - Current trend is a move away from large
monolithic database architectures to more
distributed database technologies
Within the International Technology Alliance in
Network and Information Science
(http//www.usukita.org/), a U.S. - U.K
partnership, IBM researchers conceived a new
type of database architecture a Dynamic
Distributed Federated Database (DDFD) (G. Bent,
et al. Proceedings of the 2nd Conference of the
ITA, London U.K., September 2008)
- DDFD combines the principles of large distributed
databases, database federation, network topology,
semantics of data - Using biologically inspired principles of network
growth combined with graph theoretic methods a
DDFD can be developed and maintained
The Gaian DB
6The Gaian Database Architecture
- The Gaian DB consists of a set of interconnected
vertices (N1, N2, ) each of which is a federated
Relational Database Management System (RDBMS) - Federated DB means that the DB engine is able to
access internal and external sources as if it
were one logical DB External sources may include
other RDBMS, such as Oracle, SQLServer, etc., or
any other data source such as flat file, excel
files, REST services, etc. - Data can be stored in any vertex in the network
of database with the table in which it is stored
being available to other vertices through a
logical table mapping
- At present the Gaian DB has been implemented on
top of the Derby database engine (an open source
Java RDBMS) - It has the advantage of having a small SW
footprint (less than 4MB) - Derby facilitates the implementation of
heterogeneous federation
7The Gaian DB Query Mechanism
- The Store Locally Query Anywhere (SQLA) mechanism
provides for global access to data from any
vertex in the database network - Data is stored in local database tables at any
vertex in the network, and it is accessible from
any other vertex using Structured Query Language
(SQL) like queries and distributed stored
procedures-like processes - The query propagates through the network and
result sets are returned to the querying vertex
- Each vertex that forwards the query is
responsible for processing the results obtained
from vertices to which the query was forwarded,
leading to a distributed aggregation of results - The efficiency of this database concept is in par
determined by how the vertices of the database
are connected together. Specific algorithms have
been developed - To create appropriate connections (edges) between
the vertices in the network (each vertex is not
connected to all other vertices) - To add new vertices to an existing network
- To add a vertex that does not require any queries
to be issued from the new vertex - To minimize the number of paths traversed where
the same query is being issued from different
vertex
8Some Gaian DB performances
- In the Gaian DB tools for gathering distributed
statistics on each query and corresponding result
set are also available - An explain query allows a user to obtain
information about the route of a query through
the network of Gaian nodes
- Gaian DB demonstrated its scalability
- An experimental network consisting of 1000
vertices has been setup - Query Time A query involving all 1000 nodes
replies in about 1/8 second. The query time
grows logarithmically - in other words as you add
more and more databases, the increase in query
time slows down, providing excellent scaling - Fetch Time The Gaian DB in the experiment was
able to fetch 1 million rows of data in under 5
seconds - Concurrent Queries Queries from up to 40 nodes
at the same timehave been injected, the Gaian DB
showed that it could handle these queries
robustly with a modest increase in the query time
9Fields of applicability of the Gaian DB and our
experiments
- Gaian DBs have been tested with success in
different domains - In wide sensor networks each sensor host a
vertex of the Gaian DB - In distributed semantic search applications
Structured information extracted from
unstructured information is stored as RDF triples
in various locations
- OUR EXPERIMENTS, in MBLab, with GAIAN DB in the
BIOINFORMATIC DOMAIN - We setup a Gaian DB with multiple vertices (up to
6) with Bioinformatics databases - Databases are both local private DBs and remote
public DBs (e.g. GenBank) - One database is flat file (UNIPROT replica) to
test heterogeneous data sources - For each database a Gaian DB vertex has been
deployed
- RESULTS
- Performances are good as we can expect
- The largest effort in the federation task is
creating, on different DBs, logical tables with
consistent mapping of the available data
In other words it is important that DBs to be
federated converge (or are mapped) toward a
common logical schema
10A query-based approach to identify the best
mitochondrial barcode candidate.Applied to Fungi
Intron distribution and length in the Ascomycota
mtDNA genes
The pervasiveness of introns in the Ascomycota is
highlighted but specific regions from at least
five alternative genes, namely ND2, ND4, ND6,
LSRNA and SSRNA are intron-poor and large enough
to be barcode candidates for Ascomycota.
11Barcode Primers Designer workflow for database
retrieval, analysis and design of barcode primers
for a definite taxonomic group
Multiple alignment of the multifasta files (emma
EMBOSS)
Retrieve fwd and rev primers from the GenBank
entry
Query GenBank
Retrieval
Select a taxonomic group
Calculate oligonucleotide properties
Barcode primers designer
Calculate consensus Sequence (cons EMBOSS)
Check for amplification of targets other than
the input template
Consensus sequence viewer (Ex. WebLogo)
12Barcode Primers Designer A web-based graphical
interface for users
Download
File1 File2 File3 File3
Forward barcode primer details
Reverse barcode primer details
Consensus sequence
Consensus sequence
atgctagcatgcatgcgactg
atgctagcatgcatgcgactg
View alignment
View WebLogo
View alignment
View WebLogo
Consensus primer features
Consensus primer features
Molecular Weight 10087 Param1 986 Param2 56
Molecular Weight 10087 Param1 986 Param2 56
13Conclusions
- Data federation is a viable solution to integrate
heterogeneous dislocated pieces of information - Within the federation approach (when it is
suitable), the Gaian DB architecture is a great,
innovative solution - In setup a Gaian DB large effort for
accommodating different schemas should be done - This is one more reason to converge toward a
common schema and shared ontologies in the
Biodiversity domain (where different attempts
have already been made) - One of the main tasks of the MBLab initiative is
setting up a wide-converging information schema
for molecular Biodiversity studies
14Questions?