Grid Based Solutions for Distributed Data Management - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Grid Based Solutions for Distributed Data Management

Description:

San Diego Supercomputer Center. National Partnership for Advanced Computational Infrastructure ... University of California, San Diego Technology Transfer ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Grid Based Solutions for Distributed Data Management


1
Grid Based Solutions for Distributed Data
Management
  • Reagan W. Moore
  • San Diego Supercomputer Center
  • http//www.npaci.edu/DICE
  • moore_at_sdsc.edu

2
Topics
  • Managing data residing in multiple storage
    systems
  • Building collections of distributed data
  • Supporting digital library services
  • Federating collections
  • Preserving collections

3
Storage Resource Broker
  • Generic data management infrastructure that is
    used to support
  • Data grids for data sharing
  • Digital libraries for data publication
  • Persistent archives for data preservation
  • Manages distributed data on national and
    international scales
  • California Digital Library
  • NSF National Science Digital Library
  • Worldwide Universities Network data grid

4
(No Transcript)
5
Managing Distributed Data
Data Access Methods (Web Browser, DSpace, OAI-PMH)
  • Storage Repository
  • Storage location
  • User name
  • File name
  • File context (creation date,)
  • Access constraints

Naming conventions provided by storage systems
6
Storage Resource Broker Data Grid
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Collection
  • Storage Repository
  • Storage location
  • User name
  • File name
  • File context (creation date,)
  • Access constraints
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints

7
Discovery
  • Data grids associate metadata with each digital
    entity (file, SQL command, URL) that is
    registered
  • Administrative metadata (location of file, owner,
    access controls, size, audit trail)
  • Descriptive metadata (Dublin core, annotations)
  • Curator-defined metadata (can define collection
    level metadata, and metadata unique to a digital
    entity)
  • Metadata query mechanisms include
  • Web browsers, DSpace, OAI-PMH, WSDL, Perl,
    Python, Windows browser, Java class library, Unix
    shell commands, C library calls

8
(No Transcript)
9
Search Capabilities
  • Browse within collection hierarchy
  • Search by attribute name and operations on
    attribute values across all types of metadata
  • Dublin core attributes
  • Administrative attributes
  • Curator-defined attributes
  • SRB manages access controls on metadata
    attributes and on digital entities
  • Metadata not displayed for digital entities that
    have restricted access
  • Metadata not displayed for attributes that have
    restricted access

10
Access Mechanisms
  • Files, clicking on the record downloads the file
  • URLs, clicking redirects to the web page
  • SQL commands, clicking causes the SQL command
    (with input parameters) to be issued to the
    database and the result is returned as HTML or
    XML
  • Additional operations that support
  • Replication / Caching / Staging / Pre-fetch
    (partial read) / Bulk unload / Parallel I/O
    streams / Remote procedures for filtering and
    subsetting
  • Asynchronous interfaces
  • DSpace mechanisms, Storage Resource Manager

11
Timeliness
  • Data grids self-consistently manage all
    registered digital entities
  • All operations on digital entities automatically
    update the administrative metadata
  • Synchronization flags kept for replicas
  • Write locks kept for files aggregated into
    containers
  • Federated digital libraries are synchronized
    under curator control

12
Federation
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Collection B
Data Collection A
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints

Access controls and consistency constraints on
cross registration of digital entities
13
Consistency Constraints
  • Master-slave data grids
  • The entries in the slave data grid are registered
    under control of the master data grid
  • Peer-to-peer data grids
  • Curators register selected material into another
    data grid. Access controls are kept by the
    original data grid.
  • Central repository
  • Remote data grids push material, user names,
    metadata into a central repository
  • Deep archive
  • Digital entities and metadata are replicated into
    a data grid under curator control, but no other
    users are allowed access

14
Software Costs
  • Storage Resource Broker clients are open source -
    distributed for free
  • Storage Resource Broker server source code is
    distributed to academic institutions for free
  • Commercial companies should talk to the
    University of California, San Diego Technology
    Transfer Office for server source code
  • SRB data grid uses commercially available systems
    for storing
  • Metadata - Oracle, DB2, Sybase, Informix,
    PostgreSQL, mySQL
  • Files - Unix file systems, Linux, Mac OS X,
    Windows, binary large objects in databases,
    object ring buffers, HPSS, UniTree, ADSM, DMF,
    archival storage systems
  • If you use Postgres or mySQL for your database,
    the cost is zero. However large collections
    (millions of files) should use a commercial
    database

15
Hardware Costs
  • SRB software can be installed on laptops
    (Windows, Linux, Mac OS X), servers (Sun, Linux,
    Irix, AIX, HP), and supercomputers (clusters)
  • Installation on a Mac laptop takes 15 minutes,
    including a Postgres database, metadata catalog,
    server, and clients
  • Grid Bricks - commodity-based disk systems
  • Provide 2.5 Ghz CPU, 1 Gbyte of memory, Gig-E
    network connection, 5 terabytes of disk, RAID
    controller, Linux operating system
  • Effective cost is 2000 per terabyte
  • Modular system that can be expanded by adding
    grid bricks. The SRB data grid manages global
    name spaces.
  • If you use your own storage system, the cost is
    zero

16
Processing and Administrative Costs
  • SRB data grid supports digital entities
  • Any type of file can be stored
  • Files can be registered from an existing storage
    system, preserving both the organization and
    names
  • Administration costs
  • Data grid administrator - manage the data grid
    servers, track problems with access to storage
    systems, installation of additional servers,
    registration of users
  • Database administrator - manage the database in
    which the metadata is stored, perform backups,
    track software upgrades
  • Security, network, and storage system
    administrators - standard administrative support
    for storage systems and networks

17
Summary
  • SRB provides collection management of data
    distributed across multiple storage systems
  • Support technology evolution - migration to new
    storage systems and new databases
  • Support federation - controlled sharing and
    publication of data between data grids
  • Support preservation - tracking of audit trails,
    checksums for validating integrity
  • Support all sizes of collections - thousands to
    hundreds of millions of records

18
Data Grid Federation - zoneSRB
Application
HTTP DSpace OpenDAP
OAI, WSDL, WSRF
DLL / Python, Perl
Linux I/O
Java, NT Browsers
Federation Management

Consistency Metadata Management /
Authorization,Authentication,Audit
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Repository Virtualization
Catalog Abstraction
Databases DB2, Oracle, Sybase, Postgres,
mySQL, Informix
ORB
19
For More Information
  • Reagan W. Moore
  • San Diego Supercomputer Center
  • moore_at_sdsc.edu
  • http//www.npaci.edu/DICE
  • http//www.npaci.edu/DICE/SRB
  • http//www.npaci.edu/dice/srb/mySRB/mySRB.html

20
PRDLA Collection at SDSC (2003)
  • Collection size
  • 800 Gbytes
  • 14 million files
  • Server capacity
  • Windows NT with 2 Tbytes disk
  • AIT2 tape library for backup, 1 Tbyte of tape
  • 3 web servers
  • Access rate
  • Average 1 million web page accesses per month
  • Does not count Siku server

21
Data Grid Opportunities
  • Provide uniform interface to data collections
    that reside at member sites
  • Provides way to extend PRDLA published holdings
    by incorporating new material
  • Replicate collections between sites
  • Provides way to protect against natural disasters
  • Integrate file access with archive access
  • Provides way to preserve collections

22
Data Grids
  • Software systems that manage distributed data
  • Organize distributed data into a logical
    collection
  • Provide global naming conventions
  • Location independent identifiers
  • Support curation processes
  • Access controls for adding files
  • Browsing and discovery services

23
Accessing Data at Multiple Sites
Each site has their own naming convention
for files
User Application
A data grid provides a uniform way to name and
access the files across the sites
Archive at SDSC
File System in Australia
File System in Taiwan
24
Building Distributed Collection
Logical name space Location independent
identifier Persistent identifier Collection
owned data Access controls Audit trails
Checksums Descriptive metadata Inter-realm
authentication Single sign-on system
User Application
Data Grid Common naming convention and set of
attributes for describing digital entities
Archive at SDSC
File System in Australia
File System in Taiwan
25
Collection Metadata Catalog
Logical file name space (associate metadata
attributes with the logical file name) Physical
location of the file Name of the file on the
storage system Size of the files Owner of the
file Access controls on the file (associate
digital library attributes with the logical file
name) Descriptive metadata about the
file Dublin Core provenance information about
the file Annotations on the file
26
Storage Systems Provide
  • File name - naming convention for files
  • Storage location - IP address of the storage
    system
  • User name - persons who have access to the
    storage system
  • File context (creation date,) - state
    information about each file
  • Access constraints - controls on access

Each storage repository uses a different set of
naming conventions
27
Managing Distributed Data(Replace naming
conventions used by a storage repository with
naming conventions managed by the data grid)
  • Storage Repository
  • File name
  • Storage location
  • User name
  • File context (creation date,)
  • Access constraints
  • Data Grid
  • Logical file name space
  • Logical resource name space
  • Logical user name space
  • Logical metadata context
  • Control/consistency constraints

28
Accessing Multiple Types of Storage Systems
User Application
Archive at SDSC
Database in Australia
File System in Taiwan
29
Standard Data Access Operations
Remote operations Unix file system Latency
management Procedures Transformations Third
party transfer Filtering Queries
User Application
Common set of operations for interacting with
every type of storage repository
Archive at SDSC
Database in Australia
File System in Taiwan
30
Data Grid Applications
  • Data grid for managing distributed data
  • Latency management for bulk analyses of
    collections
  • Infrastructure independent name spaces for
    describing data, resources, users, and state
    information
  • Digital library for managing data context
  • Curation services for managing collections
  • Descriptive metadata
  • Persistent archive to manage technology evolution
  • Interoperability mechanisms between heterogeneous
    storage systems and user access mechanisms

31
Provide uniform interface to data collections
that reside at member sites
  • Install a Storage Resource Broker application
    level server on each storage system that holds
    data
  • Register the data into the PRDLA data grid
  • Establishes a logical file name for each file
  • Create a collection hierarchy to support browsing
    and discovery
  • Register PRDLA metadata for each file
  • The SRB data grid manages the metadata for the
    data grid automatically updates information on
    the location of the file
  • Provide web-based access to the collections
  • Other access mechanisms support bulk load
    operations

32
Replicate collections between sites
  • Use data grid commands to replicate a collection
    onto a remote storage system
  • Information about the replicated files is kept in
    the metadata catalog
  • Provides way to support load balancing
  • Sites access data that is closer to them
  • Provides a way to protect against a local natural
    disaster
  • Files can be retreived from the remote site

33
Integrate file access with archive access
  • Can also replicate metadata catalog between sites
  • Provides way to manage long-term preservation, a
    deep archive
  • Data grid provides synchronization mechanisms to
    update the metadata catalog
  • Can control execution of the synchronization
    mechanisms
  • Data grid provides file validation mechanisms to
    verify file integrity (checksums)
  • Can verify a local copy against the checksums
    stored in the metadata catalog

34
PRDLA Data Grid
  • Propose the formation of a data grid linking
    PRDLA sites
  • Support data sharing
  • Support integration of digital libraries
  • Support preservation environments
  • Storage Resource Broker data grid is in
    production use in international projects

35
Data Grid Installations
  • Australia - University of Queensland, APAC
  • Japan - KEK (Tsukuba)
  • Korea - KISTI, Korea Institute of Science and
    Technology Information
  • Singapore - National University of Singapore
  • Taiwan - National Taiwan University
  • University of California - California Digital
    Library, UCSD
Write a Comment
User Comments (0)
About PowerShow.com