Title: Grid Based Solutions for Distributed Data Management
1Grid Based Solutions for Distributed Data
Management
- Reagan W. Moore
- San Diego Supercomputer Center
- http//www.npaci.edu/DICE
- moore_at_sdsc.edu
2Topics
- Managing data residing in multiple storage
systems - Building collections of distributed data
- Supporting digital library services
- Federating collections
- Preserving collections
3Storage Resource Broker
- Generic data management infrastructure that is
used to support - Data grids for data sharing
- Digital libraries for data publication
- Persistent archives for data preservation
- Manages distributed data on national and
international scales - California Digital Library
- NSF National Science Digital Library
- Worldwide Universities Network data grid
4(No Transcript)
5Managing Distributed Data
Data Access Methods (Web Browser, DSpace, OAI-PMH)
- Storage Repository
- Storage location
- User name
- File name
- File context (creation date,)
- Access constraints
Naming conventions provided by storage systems
6Storage Resource Broker Data Grid
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Collection
- Storage Repository
- Storage location
- User name
- File name
- File context (creation date,)
- Access constraints
- Data Grid
- Logical resource name space
- Logical user name space
- Logical file name space
- Logical context (metadata)
- Control/consistency constraints
7Discovery
- Data grids associate metadata with each digital
entity (file, SQL command, URL) that is
registered - Administrative metadata (location of file, owner,
access controls, size, audit trail) - Descriptive metadata (Dublin core, annotations)
- Curator-defined metadata (can define collection
level metadata, and metadata unique to a digital
entity) - Metadata query mechanisms include
- Web browsers, DSpace, OAI-PMH, WSDL, Perl,
Python, Windows browser, Java class library, Unix
shell commands, C library calls
8(No Transcript)
9Search Capabilities
- Browse within collection hierarchy
- Search by attribute name and operations on
attribute values across all types of metadata - Dublin core attributes
- Administrative attributes
- Curator-defined attributes
- SRB manages access controls on metadata
attributes and on digital entities - Metadata not displayed for digital entities that
have restricted access - Metadata not displayed for attributes that have
restricted access
10Access Mechanisms
- Files, clicking on the record downloads the file
- URLs, clicking redirects to the web page
- SQL commands, clicking causes the SQL command
(with input parameters) to be issued to the
database and the result is returned as HTML or
XML - Additional operations that support
- Replication / Caching / Staging / Pre-fetch
(partial read) / Bulk unload / Parallel I/O
streams / Remote procedures for filtering and
subsetting - Asynchronous interfaces
- DSpace mechanisms, Storage Resource Manager
11Timeliness
- Data grids self-consistently manage all
registered digital entities - All operations on digital entities automatically
update the administrative metadata - Synchronization flags kept for replicas
- Write locks kept for files aggregated into
containers - Federated digital libraries are synchronized
under curator control
12Federation
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Collection B
Data Collection A
- Data Grid
- Logical resource name space
- Logical user name space
- Logical file name space
- Logical context (metadata)
- Control/consistency constraints
- Data Grid
- Logical resource name space
- Logical user name space
- Logical file name space
- Logical context (metadata)
- Control/consistency constraints
Access controls and consistency constraints on
cross registration of digital entities
13Consistency Constraints
- Master-slave data grids
- The entries in the slave data grid are registered
under control of the master data grid - Peer-to-peer data grids
- Curators register selected material into another
data grid. Access controls are kept by the
original data grid. - Central repository
- Remote data grids push material, user names,
metadata into a central repository - Deep archive
- Digital entities and metadata are replicated into
a data grid under curator control, but no other
users are allowed access
14Software Costs
- Storage Resource Broker clients are open source -
distributed for free - Storage Resource Broker server source code is
distributed to academic institutions for free - Commercial companies should talk to the
University of California, San Diego Technology
Transfer Office for server source code - SRB data grid uses commercially available systems
for storing - Metadata - Oracle, DB2, Sybase, Informix,
PostgreSQL, mySQL - Files - Unix file systems, Linux, Mac OS X,
Windows, binary large objects in databases,
object ring buffers, HPSS, UniTree, ADSM, DMF,
archival storage systems - If you use Postgres or mySQL for your database,
the cost is zero. However large collections
(millions of files) should use a commercial
database
15Hardware Costs
- SRB software can be installed on laptops
(Windows, Linux, Mac OS X), servers (Sun, Linux,
Irix, AIX, HP), and supercomputers (clusters) - Installation on a Mac laptop takes 15 minutes,
including a Postgres database, metadata catalog,
server, and clients - Grid Bricks - commodity-based disk systems
- Provide 2.5 Ghz CPU, 1 Gbyte of memory, Gig-E
network connection, 5 terabytes of disk, RAID
controller, Linux operating system - Effective cost is 2000 per terabyte
- Modular system that can be expanded by adding
grid bricks. The SRB data grid manages global
name spaces. - If you use your own storage system, the cost is
zero
16Processing and Administrative Costs
- SRB data grid supports digital entities
- Any type of file can be stored
- Files can be registered from an existing storage
system, preserving both the organization and
names - Administration costs
- Data grid administrator - manage the data grid
servers, track problems with access to storage
systems, installation of additional servers,
registration of users - Database administrator - manage the database in
which the metadata is stored, perform backups,
track software upgrades - Security, network, and storage system
administrators - standard administrative support
for storage systems and networks
17Summary
- SRB provides collection management of data
distributed across multiple storage systems - Support technology evolution - migration to new
storage systems and new databases - Support federation - controlled sharing and
publication of data between data grids - Support preservation - tracking of audit trails,
checksums for validating integrity - Support all sizes of collections - thousands to
hundreds of millions of records
18Data Grid Federation - zoneSRB
Application
HTTP DSpace OpenDAP
OAI, WSDL, WSRF
DLL / Python, Perl
Linux I/O
Java, NT Browsers
Federation Management
Consistency Metadata Management /
Authorization,Authentication,Audit
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Repository Virtualization
Catalog Abstraction
Databases DB2, Oracle, Sybase, Postgres,
mySQL, Informix
ORB
19For More Information
- Reagan W. Moore
- San Diego Supercomputer Center
- moore_at_sdsc.edu
- http//www.npaci.edu/DICE
- http//www.npaci.edu/DICE/SRB
- http//www.npaci.edu/dice/srb/mySRB/mySRB.html
20PRDLA Collection at SDSC (2003)
- Collection size
- 800 Gbytes
- 14 million files
- Server capacity
- Windows NT with 2 Tbytes disk
- AIT2 tape library for backup, 1 Tbyte of tape
- 3 web servers
- Access rate
- Average 1 million web page accesses per month
- Does not count Siku server
21Data Grid Opportunities
- Provide uniform interface to data collections
that reside at member sites - Provides way to extend PRDLA published holdings
by incorporating new material - Replicate collections between sites
- Provides way to protect against natural disasters
- Integrate file access with archive access
- Provides way to preserve collections
22Data Grids
- Software systems that manage distributed data
- Organize distributed data into a logical
collection - Provide global naming conventions
- Location independent identifiers
- Support curation processes
- Access controls for adding files
- Browsing and discovery services
23Accessing Data at Multiple Sites
Each site has their own naming convention
for files
User Application
A data grid provides a uniform way to name and
access the files across the sites
Archive at SDSC
File System in Australia
File System in Taiwan
24Building Distributed Collection
Logical name space Location independent
identifier Persistent identifier Collection
owned data Access controls Audit trails
Checksums Descriptive metadata Inter-realm
authentication Single sign-on system
User Application
Data Grid Common naming convention and set of
attributes for describing digital entities
Archive at SDSC
File System in Australia
File System in Taiwan
25Collection Metadata Catalog
Logical file name space (associate metadata
attributes with the logical file name) Physical
location of the file Name of the file on the
storage system Size of the files Owner of the
file Access controls on the file (associate
digital library attributes with the logical file
name) Descriptive metadata about the
file Dublin Core provenance information about
the file Annotations on the file
26Storage Systems Provide
- File name - naming convention for files
- Storage location - IP address of the storage
system - User name - persons who have access to the
storage system - File context (creation date,) - state
information about each file - Access constraints - controls on access
Each storage repository uses a different set of
naming conventions
27Managing Distributed Data(Replace naming
conventions used by a storage repository with
naming conventions managed by the data grid)
- Storage Repository
- File name
- Storage location
- User name
- File context (creation date,)
- Access constraints
- Data Grid
- Logical file name space
- Logical resource name space
- Logical user name space
- Logical metadata context
- Control/consistency constraints
28Accessing Multiple Types of Storage Systems
User Application
Archive at SDSC
Database in Australia
File System in Taiwan
29Standard Data Access Operations
Remote operations Unix file system Latency
management Procedures Transformations Third
party transfer Filtering Queries
User Application
Common set of operations for interacting with
every type of storage repository
Archive at SDSC
Database in Australia
File System in Taiwan
30Data Grid Applications
- Data grid for managing distributed data
- Latency management for bulk analyses of
collections - Infrastructure independent name spaces for
describing data, resources, users, and state
information - Digital library for managing data context
- Curation services for managing collections
- Descriptive metadata
- Persistent archive to manage technology evolution
- Interoperability mechanisms between heterogeneous
storage systems and user access mechanisms
31Provide uniform interface to data collections
that reside at member sites
- Install a Storage Resource Broker application
level server on each storage system that holds
data - Register the data into the PRDLA data grid
- Establishes a logical file name for each file
- Create a collection hierarchy to support browsing
and discovery - Register PRDLA metadata for each file
- The SRB data grid manages the metadata for the
data grid automatically updates information on
the location of the file - Provide web-based access to the collections
- Other access mechanisms support bulk load
operations
32Replicate collections between sites
- Use data grid commands to replicate a collection
onto a remote storage system - Information about the replicated files is kept in
the metadata catalog - Provides way to support load balancing
- Sites access data that is closer to them
- Provides a way to protect against a local natural
disaster - Files can be retreived from the remote site
33Integrate file access with archive access
- Can also replicate metadata catalog between sites
- Provides way to manage long-term preservation, a
deep archive - Data grid provides synchronization mechanisms to
update the metadata catalog - Can control execution of the synchronization
mechanisms - Data grid provides file validation mechanisms to
verify file integrity (checksums) - Can verify a local copy against the checksums
stored in the metadata catalog
34PRDLA Data Grid
- Propose the formation of a data grid linking
PRDLA sites - Support data sharing
- Support integration of digital libraries
- Support preservation environments
- Storage Resource Broker data grid is in
production use in international projects
35Data Grid Installations
- Australia - University of Queensland, APAC
- Japan - KEK (Tsukuba)
- Korea - KISTI, Korea Institute of Science and
Technology Information - Singapore - National University of Singapore
- Taiwan - National Taiwan University
- University of California - California Digital
Library, UCSD