Title: Approaches to the Integration of Distributed and Heterogeneous Data Resources
1Approaches to the Integration of Distributed and
Heterogeneous Data Resources
- Ahmet Sayar
- Indiana University
- Computer Science Department
2Motivation
- Integrating data from multiple data sources
- Distributed query and transactions of data.
- Definitions and adoptions of data, metadata and
their storages. - Accessing the data seamlessly.
- Transparency, support for heterogeneity,
extensibility and scalability.
3Outline
- Data Integration Approaches
- Application Specific Solutions
- Application-Integration Framework
- ASIS (Application Specific Information System)
- Database Federation
- Ogsa-DAI (Ogsa-Data Access and Integration)
- Compare ASIS with Ogsa-DAI
- Digital Libraries
- SRB (Storage Resource Broker)
- Sompels Digital Library Approach
- Compare ASIS with SRB and Sompels DL
4Application Specific Solutions
- The most common means of data integration
- Expensive -in terms of time and skills
- Developing and using requires deep system
knowledge - Better results for special-purpose applications
- Fragile
- Changes to the underlying sources may easily
break the application - Hard to extend
- A new data source requires new code to be written
5Outline
- Data Integration Approaches
- Application Specific Solutions
- Application-Integration Framework
- ASIS
- Database Federation
- Ogsa-DAI
- Compare ASIS with Ogsa-DAI
- Digital Libraries
- SRB
- Sompels DL
- Compare ASIS with SRB and Sompels DL
6Application-Integration Framework
- It can also be called component-based framework
- Such as CORBA or Filters with common interfaces
- Not necessarily address data integration issues
- Based on common data model (such as CML and GML)
- With adaptors, if the source change the adaptor
may have to change, but application may never see
it. - Adding a new source is easy
- a new adaptor may need to be written.
- The adaptor may already be exist online.
- No need to detailed system knowledge
- Ex. ASIS - OGC GIS Application Integration
Framework
7ASIS (1)
- Enables inter-service communication through
well-defined service interfaces, message formats
and capabilities metadata. - Data model is ASL (Application Specific Lang.)
- Metadata model is capability document
- Data and metadata have common predefined schema
- Components are Filter Services
- Web Services, comon service interfaces defined in
WSDL - Information/data services enabling distributed
access, querying and transformation through their
predictable input/output interfaces. - Chainable, located, and capable of updating their
metadata manually or dynamically
8ASIS (2)
- Data and data storage model
- Any data can be integrated into the system after
transforming to ASL. - Heterogeneity is handled at the end-Filters with
adaptors. - ASL is community-accepted application specific
language - GML (Geographic Markup Lang.) in GIS applications
- CML (Chemistry Markup Lang.) in Chemistry
applications - Filters common service interfaces
- getCapabilities, getData, getFeatureInfo.
- Requests to Filters interfaces
- getCapabilitiesReq, getDataReq, getFeatureInfoReq
- Expected return types are defined in Filters
capability metadata
9ASIS (3)
- Metadata and Metadata storage model
- Data integration is done through Filters
capability metadata - Metadata is stored in local Filters file system
as a flat file. - Capability
- Inspired from OGC WMS capability specification.
- Look like Dublin Core format.
- Capability like structure is also used in
Gannons approach (XPOLA), for Grid services
security issues. - Describes dynamic Web/Grid resources.
- Updated manually or dynamically.
- Consists of descriptor, service and provider
metadata - Inter-service communication is achieved without a
third-party. Enables chain of Filters.
10ASIS (4)Data Access and Filter Chaining
- Each Filter is capable of acting as both a server
and a client - Capability integration is done through
getCapability service interface - Requests for common service interfaces are
created in accordance with predefined XML schema
F3
F1
State Boundary
F2
F4
Earth
Fault
Fault
11Outline
- Data Integration Approaches
- Application Specific Solutions
- Application-Integration Framework
- ASIS
- Database Federation
- Ogsa-DAI
- Compare ASIS with Ogsa-DAI
- Digital Libraries
- SRB
- Sompels DL
- Compare ASIS with SRB and Sompels DL
12Database Federation
- Middleware consisting of database management
system - Uniform access to number of heterogeneous data
sources - Provides query language used to combine,
contrast, analyze and manipulate the data - Data integration is done through Database
integration. - Combine data from multiple sources in a single
SQL statement query recreation. - Ex. Ogsa-DAI (Open Grid Service Architecture
Data Access and Integration)
13Ogsa-DAI (1)
- Provides common Java API for accessing and
integrating data resources such relational and
XML databases, and files- in Grid environment - Specifically designed for OGSA architecture
- SQL queries on relational resources and XPath
statements on XML collections - Provides data pipelining (similar to Filter
chaining) via an XML document called perform
document. - Allows developers to easily add or extend
functionality within Ogsa-DAI, activity
document.
14Ogsa-DAI (2)
- Data and storage model
- Any data stored in XML or relational databases,
files - No common data model
- Data is provided through GDS (Grid Data Services)
- Uses Ogsa-DQP (Distributed Query Processor) to
coordinate to access to multiple data services - The enactment engine is the core of Ogsa-DAI.
Orchestrate running of the perform document - Information in perform document includes
- The list of activities and their XML schemas and
implementation classes. - The list of role mappers and details
- The info about data resource
15Ogsa-DAI (3)
- Metadata storage model
- Metadata is kept in Catalog Service (MCS)
- MCS enables attribute-based querying
- Metadata is for the datasets, data can be
anything (binary, text ..) - Data integration is done through XML based
activity file mixing activities (in SQL queries)
and metadata - Simple data access scenario
- A client contacts a DAISGR first to locate the
GDSFs. - Accesses suitable GDSFs directly to find out more
about their properties and the data resources
they represent. - Asks GDSF to instantiate a GDS
- Accesses resource by sending the GDS the
GDS-Perform doc.
16Ogsa-DAI (4)
- Metadata model
- No common schema for metadata like capability
- Defines Metadata for the datasets
- No schema in XML
- Stored in Database tables as attributes
- Defines Metadata for the Database system to
enable querying and defining activities - Schema in XML (mcsActivity.xsd schema file)
- Kept as XML file in the file system
(mcsActivity.xml)
17ASIS vs. Ogsa-DAI
- Ogsa-DAI does not define metadata and data in XML
schema. Metadata is mixed with Database schema.
ASIS has predefined data and metadata models. - Ogsa-DAI uses any data, and they have predefined
Database schema to enable querying and accessing
data. - ASISs data integration is on demand and based on
capability federation. Instead, Ogsa-DAIs data
integration is coded in XML struc perform and
activity documents. - Ogsa-DAI has central (MCS), ASIS has distributed
metadata approach. - Both system are based on Web Services.
- Ogsa-DAI uses GridFTP, and ASIS uses
NaradaBrokering for the performance issues in
data transfers.
18Outline
- Data Integration Approaches
- Application Specific Solutions
- Application-Integration Framework
- ASIS
- Database Federation
- Ogsa-DAI
- Compare ASIS with Ogsa-DAI
- Digital Libraries
- SRB
- Sompels DL
- Compare ASIS with SRB and Sompels DL
19Digital Libraries
- Main focus is publishing and discovering of the
digital objects. - Digital Objects file, URL, SQL command string
and any string of bits. - Collects data from multiple different data
sources. - It is little bit different from the other data
integration approaches - Data curation services such as publishing and
removing data from the data sources. - Ex. SRB (Storage Resource Broker) and Sompels
Digital Library Approach
20SRB (1)
- A federated client server system
- Each server managing/brokering a set of resources
- An implementation architecture for
- Data grids
- Digital Libraries.
- Storage resources include digital libraries, MSS,
UniTree and file systems - SRB consists of three components
- MCAT services,
- SRB servers to access to storage repositories and
- SRB clients
- Mediates access to distributed heterogeneous
resources - Uses MCAT (Metadata Catalog Service) to
facilitate brokering and attribute based
querying. - Integrates data and metadata
21SRB (2)
- Data and storage model
- Uniform storage interface
- Resource-specific drivers to map from defined
storage to interface - Storage resources are registered within SRB as
physical resources - Logical resources (LSR) enable replication.
- LSR one or more than one physical resource
- Client API refers to LSR. Collections are created
by LSR - Metadata storage model (MCAT)
- Serves both a core-metadata and domain-dependent
metadata - Core-metadata is a standardized schema like
Dublin Core - Stores metadata about data, collections, users,
resources, methods - Attribute based access and querying, updating
metadata catalog - Implemented as a relational database. Oracle, DB2
or Sybase - Abstraction and Replica information for data
- Global user name space and authentication
- Authorization through ACL and tickets
22SRB (3)
- Metadata and Metadata Exchange Model
- MAPS (Metadata Attribute Presentation Structure)
- Independent of the internal representation of the
attributes inside the catalog. - Provides a uniform interface specification that
can be used between user applications and the
MCAT catalog and vice verse. - Structures which form the MAPS
- MAPS_Query_Struct,
- MAPS_Result_Struct,
- MAPS_Update_Struct and
- MAPS_Definition_Struct
- Mapping from MAPS to other models and exchange
format. Dublin Core format is under
implementation.
23SRB (4)
- Simple data access scenario
- SRB server spawns SRB agent to authenticate the
user/Application by comparing it with information
stored in MCAT. - Find the location in MCAT.
- Check user request against permissions stored in
MCAT. - SRB agent contacts user with the result of his
request. - SRB agent communicates with the user through a
port specific to this client session. - SRB server chaining scenario (integrated SRBs)
- First 3 steps from simple data access case.
- SRB agent contacts remote SRB agent via remote
SRB server. - The second SRB agent returns the pointer to the
data item to the first SRB agent which passes it
on to the user. - The SRB client interact with the data item
directly. The federated SRB scheme -SRB server
acts as a client to another.
24ASIS vs. SRB
- SRB doesnt define metadata in XML structure (as
ASIS does) - SRB uses any data but ASIS uses ASL
- SRB keeps the metadata in Catalogue Services
(MCAT). ASIS uses XML structured capability
metadata - SRB has central metadata handling approach, ASIS
has distributed metadata handling approach - ASISs data integration is based on metadata
federation, SRBs data integration is based on
SRB server federation. - Instead of Filters, SRB uses SRB server and
agents for accessing data resources.
25Sompels DL (1)
- Scholarly communication as a network-based
workflow - Instead of Filters and ASL in ASIS, Sompel
defines repositories and digital objects,
respectively. - Repository is a networked system that provides
services pertaining to a collection of Digital
Objects - Repositories have common service interfaces.
- Obtain, Harvest and Put.
- Two classes of participants.
- Data providers (DP) and Service providers (SP)
- SP collect metadata from DPs (via 3 service
interface) normalize and cluster it to deal with
duplicates. - DP offer some type of search mechanism for their
own repositories.
26Sompels DL (2)
- Data and storage model
- Data is the abstraction of the Digital Objects
- Digital Objects Digital data key metadata.
- Serialization of Digital Objects Surrogates
- Surrogates
- Information for the value chains and service
- information used at repository service
interfaces. - In the XML/RDF format
- Composed of dataStream and/or Entity tag
elements. - Chained object is defined by keymetadataID or
providerInfo. - Different storage types book repositories,
teaching object repositories, dataset
repositories etc. - Repositories are active nodes. Repositories
enable the use and re-use of materials in many
contexts.
27Sompels DL (3)
- Metadata model
- Surrogates are essentially metadata records for
objects - Based on Dublin Core format with domain specific
extensions. - Dublin core has 15 standard entities to define
resources. - For more details see http//doublincore.org
- Chaining for integrating data
- Application/User doesnt need to use workflow
engine or script to create or run the chain. (As
in ASIS) - Chain (they call value chain) is hidden in the
surrogates. - Surrogates are updated through the common
interfaces (put obtain and harvest) of the
resources. - Chain is defined in the Entity element in the
surrogate document with the Lineage sub
element. - Sample chaining scenario
- A paper might have references to some papers and
these papers might be references to some other
papers. - Value chain does not stop.
- Papers have different metadata (value added)
through value chain
28ASIS vs. Sompels Approach
- Instead of Filters and ASL in ASIS, Sompel
defines repositories and digital objects
respectively - DP correspond to End-Filters, and SP correspond
to Filters in ASIS - ASIS do not have publishing or putting service
interfaces - Obtain corresponds to getData in ASIS
- Harvest corresponds to getCapabilities in
ASIS - Both have distributed metadata approaches for
data integration - ASIS direct communication between Filters by
using GetCapabilities interface - Sompes DL direct communication between
repositories and services by using Harvest
interface - Sompels DL uses Dublin Core for the
representation of the resources ASIS uses its
own schema. - ASIS uses ASL for the representation of the data
- Sompels approach doesnt have common data
model.
29Summary
- Application-Integration Framework (ASIS)
- Easy to add new sources
- Using online Filters providing required adaptors
- peer-to-peer chain of Filters
- no central metadata catalog server Distributed
capability exchange and aggregation - SOA
- Re-usable components (Filters) for different
applications in predefined domain - Implications of Filter services
- Scalable and Fault-tolerant
- Load-balancing and caching
- Dynamically updating capability metadata
30THANKS !
31APPENDIX
32Capability in Grid Services Security
- XPOLA
- The infrastructure is built on a peer-to-peer
chain-of-trust model. No central admins - WS-Security compliant
- Extensible PKI and SAML based
- Dynamic and reusable (manually or automatically
generated) - Composed of two sectors.
- Policy document (SAML, lifetime info, binding
info etc.) - Providers signature
- Existing grid security solutions to fine-grained
authorization were not addressing general
Web/Grid services in compliant with Web Services
security specs. - With central admins, other approaches dont
address dynamic services
33Sample Capabilities File (too simplified) GIS
Domain
- lt?xml version'1.0' encoding"UTF-8"
standalone"no" ?gt lt!DOCTYPE WMT_MS_Capabilities
SYSTEM "http//toro.ucs.indiana.edu8086/xml/capab
ilities.dtd"gt ltCapabilities version"1.1.1"
updateSequence"0"gt ltServicegt
ltNamegtCGL_Mappinglt/Namegt
ltTitlegtCGL_Mapping WMSlt/Titlegt
ltOnlineResource xmlnsxlink"http//www.w3.org/199
9/xlink" xlinktype"simple -
xlinkhref"http//toro.ucs.indiana.edu8086/WMSSe
rvices.wsdl" /gt ltContactInformationgt - ..
- lt/ContactInformationgt
- lt/Servicegt
- ltCapabilitygt ltRequestgt
ltGetCapabilitiesgt
ltFormatgtWMS_XMLlt/Formatgt
ltDCPTypegtltHTTPgtltGetgt
ltOnlineResource
xmlnsxlink"http//w3.org/1999/xlink"
xlinktype"simple -
xlinkhref"http//toro.ucs.indiana.edu8086/WMS
Services.wsdl" /gt
lt/Getgtlt/HTTPgtlt/DCPTypegt
lt/GetCapabilitiesgt ltGetMapgt
ltFormatgtimage/GIFlt/Formatgt
ltFormatgtimage/PNGlt/Formatgt
ltDCPTypegtltHTTPgtltGetgt
ltOnlineResource
xmlnsxlink"http//w3.org/1999/xlink"
xlinktype"simple -
xlinkhref"http//toro.ucs.indiana.edu8086/WMS
Services.wsdl" /gt
lt/Getgtlt/HTTPgtlt/DCPTypegt lt/GetMapgt
lt/Requestgt ltLayergt
ltNamegtCaliforniaFaultslt/Namegt
ltTitlegtCaliforniaFaultslt/Titlegt
ltSRSgtEPSG4326lt/SRSgt
ltLatLonBoundingBox minx"-180"
miny"-82" maxx"180" maxy"82" / gt
lt/Layergt lt/Capabilitygt lt/Capabilitiesgt
34Dublin Core
- Challenge of resource description and discovery
- Language for making a particular class of
statements about resources - There 2 namespaces Dublin Core element set
(dc)and Dublin Core qualifiers (dcq ex.
dcqiso8601). - Some of Dublin core metadata element set
- Title (dctitle), subject, description, creator,
publisher, type, format, source, language, rights - Using DC in RDF, specifications for DC in RDF
(work in progress) - Resource has(verb) property(dccreator)
X(dcAhmet)
35Sample Dublin Core
http//www.ils.unc.edu/mrc/jcdl2006/slides/kunze.p
df
36Open Archive InitiativeOAI
37OAI
- Deals with e-print server world
- Need to develop services that permitted searching
across papers housed at multiple repositories - Repositories also needed capabilities to
automatically identify and copy papers that had
been deposited in them. - Definition of an interface to permit e-print
servers to expose the metadata for the papers
that it held. - Service providers with similar metadata standards
need to harvest this metadata - Service providers act as a federation of
repositories, by indexing documents, so that
multiple collections cen be searched as though
they form a single collection
38OAI-PMH
- For the variety of the communities engaged in
publishing content on the Web - Any networked server can emplly the protocol to
enable service providers to collect its metadata - HTTP-based request-response transaction
- Service Providers
- Harvest metadata from Data Providers using the
OAI protocol and use the returned metadata as a
basis for building value-added services. - Data Providers (repositories)
- Adopt OAI technical as a means of exposing
metadata about their content.
39Comments on OAI
- OAI-PMH is ultimately only as useful as the
metadata it transports. - The tendency of implementers to almost
exclusively apply the lowest common denominator
of unqualified dublin core makes it difficult to
implement more advanced search interface
features. - Content providers should prefer more expressive
metadata schema like MARC or qualified DC and
find ways to augment human-generated descriptive
metadata.
40Sompels Digital Library Approach
41Sompels ApproachHierarchy steps
http//msc.mellon.org/Meetings/Interop/lagoze_data
_model.pdf
42Sompels DLData Model
msc.mellon.org/Meetings/Interop/lagoze_data_model.
pdf
43Ogsa-DAI
44Ogsa-DAI Figure
http//www.globus.org/grid_software/data/dai.php
45Perform Document
http//www.ogsadai.org.uk/documentation/ogsadai-ws
i-2.2/doc/interaction/Perform.html
46MCS
- MCS present a design of Metadata Catalog Service
that provides mechanism for storing and accessing
descriptive metadata attributes - Requirements Store domain-independent
attributes, user-defined attributes, query with a
set of attributes, query with a logical name,
authentication, authorization and auditing - Allows users to discover data sets based on the
value of descriptive attributes, rather then
requiring to know specific names or physical
locations of data items
47MCAT vs. MCS
- MCAT can be used just with SRB
- MCS can be used just in OGSA architecture
- MCAT stores both physical and logical addresses
- MCS stores logical metadata attributes and
handles that can be resolved by a data location
or data access services. - They can both be extended for serving
application-specific metadata, but they dont
have generalized way for doing that.
48SRB
49SRB
50CLIENT
- Example interaction with SRB using Scommands
- Sinit
- Start interaction with SRB
- Spwd
- Display current position within SRB repository
- Smeta -i I UDSMD0author I UDSMD1bob
myfile - Add metadata describing the author the file
- Smeta -i I UDSMD0author I
UDSMD1arthur - Search for files with author metadata set as
arthur - Sget myFile
- Copy myFile from SRB to local storage
- Sreplicate S anotherResource myFile
- Create a replica of myFile on anotherResource
- Srm myFile
- Remove myFile (and all replicas) from SRB
- Sexit
- End interaction with SRB