Title: An Introduction to Data Grid Management Systems (DGMS)
1An Introduction to Data Grid Management Systems
(DGMS)
- Arun Jagatheesan
- San Diego Supercomputer Center
- University of California, San Diego
Tutorial at the 13th International Conference on
Management of Data (COMAD 2006) IIT-Delhi,
India December 14 - 16, 2006
2Dynamic Calibration of Content
- Academic researchers and Students ()
- Project Managers, Analysts, Office of the CTO ()
- Software architects, developers ()
- Savvy users, I just want to use this hands-on ()
There are around 146 slides. We dont plan to go
through each of them in 2 hours. We need to make
sure all of us get the most out of this tutorial
when we leave this room
3Tutorial Outline
- Introduction to Data Grids
- Data Grid Design Philosophies
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
4We will also participate in this tutorial on DGMS
5Tutorial Outline
- Introduction to Data Grids
- Who are we? What we do? What problem we faced?
- Data Grid Design Philosophies
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
6SDSC The Pride of NSF
- SDSC San Diego Supercomputer Center. Located
and run by University of California at San Diego - Leader in high end computing with focus on data
management (academic data center for US) - Founded in 1985 by National Science Foundation
(NSF) - 170 million - 400 researchers, high performance hardware and
software for the academic greater good in US
7Our Customer Science Research
8Some SDSC Production Machines (2005-6)
TeraGrid Linux Cluster Intel IA-64 4.4 TFlops
Intimidata IBM Blue Gene 2.8/5.7 TFlops
DataStar IBM Power4 10.4 TFlops
Storage Area Network Disk
Sun F15K Disk Server
Archive Systems
600 500 TB
6 PB
9- Volume Visualization of the Orion Nebula
- The San Diego Supercomputer Center andThe
American Museum of Natural History Hayden
Planetarium
10Visualizations
11TeraGrid
12TeraGrid Resource Partners
13(No Transcript)
14Large data volume Large distribution
So, you just got tons of data so what is the
problem
Well, its more than super data. Data is
distributed. The requirements were more
challenging and it got interesting
15NIH BIRN Data Grid
16BIRN Inter-organizational Data
17BBSRC Agrenet Archive Service Architecture
BBSRC SRB Archive process Data Path
Central Cache Site
RAL
Site WAN
Firewall
JANET WAN
ads0sb01.cc.rl.ac.uk
Central SRB Server
2
Tape Traffic
SRB-ADS Server
ADS Tape Resource
3
disk
Sreplcont
ADS SRB Disk Cache Resource
disk
4
Central cache Vault
Firewall
- Archive Submission Interface
- Data Ingestion of collection hierarchy into SRB
- Uses Java jargon API interface (equivalent of
Sput b) - Ingested to /bbsrc/institute/scratch/project/year
/user/dateandtime - At end of ingestion data logically moved using
Smv to/bbsrc/institute/local-archive/project/yea
r/user/dateandtime
1
- Scheduled transfer to ADS resource
- Implemented via CRON job using Sreplcont command
which is driven by central SRB Server - Entire container replicated using Sreplcont
command - Logical Structure preserved as /bbsrc/institute/re
mote-archive/project/year/user/dateandtime
3
- Scheduled transfer to Central SRB Server (Driven
from Central SRB Server) - Smkcont command used to create container on
central SRB Server - Data moved from Site SRB to container on central
SRB Server using Sphymove - Upon data transfer completion archived data is
logically move with Smv to /bbsrc/institute/remote
-archive/project/year/user/dateandtime
2
- Synchronization of container to tape resource and
removal of original container from Central SRB
Server - Ssyncont d a command used, allowing for a
family of containers
4
18KeK federated zones
IHEP, CN
Krakow, PL
SDSC, US
KEK, JP
KNU, KR
Destination (from KEK) RTT (msec) Nominal BW (Mbps)
IHEP 502 10
Krakow 327 100
ANU 292 622
ASCC 33 100
KNU 23.6 100
ASCC, TW
ANU, AU
The latency was measured in December 2004
19LSST in Media
130 Petabytes of image data
20Problem Statement - Pattern
- Large scale Collaboration
- Large scale data distribution
- Large scale data organization and discovery of
unstructured data resources
21Problem solved / Requirements 1
- Large scale Collaboration
- Multiple autonomous organizations (and/or)
- Multiple user communities sharing large data
storage - Collaborative logical namespace
- Avoid multiple mount points as they restrict
scalability of the collaboration - Coordinated data sharing at any granular level
(data, metadata, annotations,)
22Problem solved / Requirements 2
- Large scale data distribution
- Number of files (and/or)
- Number of distributed storage resources
- Multi-site Data Distribution
- Concept or replica, copy, dirty-copy
- Multi-site replicas reduce access times
- Replicas have the same logical name everywhere in
the enterprise (big plus for users) - Replicas controlled by user, admin,
system-enabled (automated or policy based) - Reduce WAN latency (chattiness)
23Problem solved / Requirements 3
- Large scale data organization and discovery
- Unstructured data or files
- Real-time data and more
- Data Classification and Discovery
- Tag data with any arbitrary metadata schema
- Data is organized based on user-defined
attributes - Multiple teams can have different metadata
attributes Query, discover and access data
without knowing path or protocol to be used
24Our Solution DGMS
- Large-scale logical file system
- File System
- Database System
- Grid Computing
- Data Grid Management System (DGMS)
- Core Concepts
- Logical shared collections
- Logical shared resources
- Collaborative communities
25Is there a simple explanation of what is Grid
Computing?
26The Data Grid Vision
Data Grid Collaborative logical namespace of
data and storage
Multiple domains, multiple resource
types, Autonomous control
27Its the data.
- Resources shared (coordinated sharing)
- Storage resources
- Labor support
- Bandwidth
- But most importantly
- Data, Metadata, data organization or namespace
- Having people on the same page
- Coordinated sharing of knowledge/information
became a vital link for each organization or
project
28Its not just the academia alone
- Rebels and misfits of (existing) technology
- Big time academic projects, Large enterprises
- Remember databases and DBMS
- Enterprises need it too
- Fortune 500, Forbes Global 2000
- Distributed global teams requiring collaborations
- Distribution of unstructured data and storage
- Unstructured data and distribution is seen in
multiple vertical domains chip design,
automobile or aircraft engineering, patient
records,
29(Again) Our Solution DGMS
- Large-scale logical file system
- File System
- Database System
- Grid Computing
- Data Grid Management System (DGMS)
- Core Concepts
- Logical shared collections
- Logical shared resources
- Collaborative communities
Basic unit of data management in a DGMS
30Tutorial Outline
- Introduction to Data Grids
- Who are we? What we do? What problem we faced?
- Data Grid Design Philosophies
- How we solved our problem?
- Theory to Implementation
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
31Using a Data Grid in Abstract
Data Grid
/home/arun.sdsc/exp1 /home/arun.sdsc/exp1/text1.tx
t /home/arun.sdsc/exp1/text2.txt /home/arun.sdsc/e
xp1/text3.txt data storage (100)
32Real World Physical Heterogeneities
- Multiple autonomous administrative domains
- Distributed data (replicas) in different domains
- Heterogeneous storage resources and systems
- Distributed users and authentication mechanisms
33Is it possible?
- A global collaborative namespace to organize
unstructured data - Discover data without even knowing the name or
location or protocol to be used? - Scale the solution for millions of files and
Petabytes of data
34The answer is out there
- This is actually a namespace problem (apart from
the physics of moving data faster) - Database basics
- Isolate logical schema (the namespace presented
to the user) from physical organization
(distribution) of data - The users will see the same data namespace
irrespective of where the data is physically
located or how it is stored. - The data can be freely moved or replicated around
without affecting the users view
35Logical data namespace
- A logical data namespace is a mapping from real
physical data namespace to a logical namespace - The logical namespace hides the physical
organization (physical distribution) of data.
36Mapping physical data to logical view
Hierarchical view, independent of network, disk,
sector, track, fragments
Rule Storage Abstraction Hide storage
resources
37Mapping physical data to logical view
Relational view, independent of network, disk,
sector, track, fragments where each field
(cell) is stored
38Mapping distributed data storage to logical view
39Mapping distributed data storage to logical view
25 Universities or Research Hospitals, Multiple
heterogeneous storage resources
40Is logical data namespace sufficient?
- Logical Namespace hides all the heterogeneities
All data seems to be on single resource. - Users are hidden from the complexities of
distributed data management - However, restricts users from doing distributed
data management operations - How do we handle replication or migration of data
between storage resources? (There is no concept
of resource or storage space in logical data
namespace)
41Possible solution (1)
- Within the logical data namespace, we need a way
to indicate the presence of distributed resources
(from different organizations) - Why not add an attribute to every data item to
specify where the data is physically present - Users still see the same logical data namespace.
But in addition, can also see the distribution of
data
42Logical namespace
43Logical namespace location
Location
Add location info??
//121121/a/file1.txt
//sandiego/a/file1.txt
//India123/a/file1.txt
44Possible Solution (2)
- But, providing direct physical information in
logical data namespace might not be a good
solution. All advantages due to
infrastructure-independence will be lost. - System administrators night mare. Imagine the
Internet without www and only physical IP
addresses. - Also, users need a way to specify resources
(example which resource to replicate data)
45Logical namespace location
Logical Location
Add logical location??
Default resource
San diego sdsc-disk
India-delhi-tape
46Possible Solution (3)
- Houston, We have a solution!
- We will create a separate logical resource
namespace of all data resources in the data grid - The well-known logical data namespace can be
joined with this logical resource namespace to
create another logical view data grid
namespace.
47The solution (or a part of the solution)
Logical Resource Namespace (Each logical
resource is a combination of one or more
physical resources)
Traditional Logical Namespace
Data Grid Namespace
48Data Grid Namespace
Logical Resources
Multiple Replicas
Users from different organizations
49An hierarchical logical resource space
--- Any Possible Resource (ANY) -------
Asia Resources (2 of N) ------- Asia-disk-1
------- Asia-disk-2 -------
Asia-disk-3 ------- Europe Resources
(ANY) ------Europe-Unix ------Europe-Linux-Clu
ster ------- US Resources (ANY) ------
US-disk ------- Client Resources
(ANY) ------ client-cifs -------
Critical Replicated Data (ALL) ------
Asia-disk-1 ------Europe-Unix ------ US-disk
This sounds like storage and not DBMS to me.
Well DGMS involves file systems (storage), DBMS
and grid concepts
50More with Data Grid Namespace
Logical Resources
Multiple Replicas
User defined meta data (schema) for data discovery
51So the data grid is a simple GUI? I can click on
it why a tutorial?
Bart, that GUI was just used explain the
concepts. It can be used without a GUI too. I
wonder what are the generic concepts and how they
can be used.
52More with Data Grid Namespace
Logical Name
Physical details
Logical resource
Metadata
Unstructured becomes structured?
53Hierarchical view by relational schema
/usr/a/1.txt //vault/1q.txt Fast-disk Colorblue Galaxy r2d2
Wan-tape Colorred GalaxyYoda
Logical Name
Physical details
Logical resource
Metadata
Magic Recipe
Adding the bells and whistles
54Logical collections by relational schema
/usr/a/1.txt //vault/1q.txt Fast-disk Colorblue Galaxy r2d2
Wan-tape Colorred GalaxyYoda
SATA Colorred Galaxy Yoda
SATA MusicARR Artist 1
SATA Music RD Artist Lata
Logical Name
Physical details
Logical resource
Metadata
55Logical collections by relational schema
/usr/a/1.txt //vault/1q.txt Fast-disk Colorblue Galaxy r2d2
Wan-tape Colorred GalaxyYoda
SATA Colorred Galaxy Yoda
SATA MusicARR Artist 1
SATA Music RD Artist LataJi
Logical Name
Physical details
Logical resource
Metadata
StarCollection(L_Name,PhyDetails, LogRes, Color,
Galaxy)
MusicFileCollection(L_Name,PhyDetails, LogRes,
Music, Artist)
56Collection as a relation
- StarCollection(L_Name, PhyDetails, LogRes, Color,
Galaxy) - MusicFileCollection(L_Name,PhyDetails, LogRes,
Music, Artist) - Employee (Emp, Emp_Name, Dept, )
57Collection as a relation
Data (files) are not part of the relationship
only metadata.
- StarCollection(L_Name, PhyDetails, LogRes, Color,
Galaxy) - MusicFileCollection(L_Name,PhyDetails, LogRes,
Music, Artist) - Employee (Emp, Emp_Name, Dept, )
Data forms the relationship
58Every thing is logical !
Hierarchical Logical namespace or logical view
powered by a relational or collection-oriented
schema
59Logical view to relational
- Logical View (User)
- ls .jpg ( dir .jpg)
- ls .jpg -q (galaxyyoda)
- Inside DB (Internal)
- Select file-name
- Select file-name where
- yoda and
60Theory discussed so far
- Data grid Namespace with logical resources
- Well-known magic recipe to distribute data
resources - Resources part of the logical namespace
- Collections as a relationship to organize data
- Relationship of data based on meta-data
- Elaborating on top of these concepts
- Collections as a data management entity
- But before doing that a warning from the speaker
61Warning What they might say
DGMS is just another application running on top
of database (this solves a problem for some
high-end users). Advice Just stay away from
this guy (Arun) and dont fall for DGMS.
DB-Guru
62Collections for data management
- Data grid collection similar to common usage of
the word collection - The basic unit of data management in DGMS
- Mathematically a pair (L,m,P)
- L is set of logical identifier strings (logical
namespace) - P is set of physical locations of data (physical
details) - mL ? P non-injective and surjective function
- In traditional file systems
- mL ? P is an injective function (1-to-1 mapping,
no replicas)
63Just for fun comparison
- DBMS
- Relation (Table)
- Relationship on data. Physical data involved
- Well defined schema on physical bytes
- Tablespace
- Very well understood concepts
- Triggers, vendors
- DGMS
- Collection (directory-like)
- Relationship on meta-data or operations
- No schema on bytes or content
- Vaultspace
- The users say it works who wants theory?
- Work in progress
64Tutorial Outline
- Introduction to Data Grids
- Who are we? What we do? What problem we faced?
- Data Grid Design Philosophies
- How we solved our problem?
- Theory to Implementation
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
65Break? or Continue?
66Data Grid and DGMS
- Data Grid A logical collaborative namespace
that enables coordinated sharing of distributed
storage resources and data based on local and
global policies across administrative domains. - DGMS The middleware (software) infrastructure
that enables the creation and management of
datagrids
67Data Grid Collection
- Very similar to directory (based on set theory)
- Files or digital entities from different storage
or data management systems are represented with
logical names - A user-defined or flexible meta data schema
described the data (relationship to collection) - Query and discover data based on attributes
- Browse the collection as an hierarchical
namespace - Collections can have some predefined storage
resources - Other higher level concepts include views,
triggers or rules
68Tutorial Outline
- Introduction to Data Grids
- Who are we? What we do? What problem we faced?
- Data Grid Design Philosophies
- How we solved our problem?
- Theory to Implementation
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
69Heterogeneous Resources
70Transparencies/Virtualizations (bits,data,informat
ion,..)
Storage Resource Transparency Have a driver for
each new storage resource type that can perform
the same pre-defined set of functions like
read/write bytes .. (on archive resource or disk
or database..)
Storage Resource Transparency
71Data Identifiers for different systems
/usr/homes/arun/1.txt
C\docs\arun\2.txt
Select name from docs where user arun and
type text
72Transparencies/Virtualizations (bits,data,informat
ion,..)
Data Identifier Transparency Each underlying
data storage system has its own convention of
local naming. The driver for each resource type
understands the local naming and function calls
like read/write bytes .. (on archive resource or
disk or database..)
Data Identifier Transparency
Storage Resource Transparency
73Distributed storage resources - latency
74Transparencies/Virtualizations (bits,data,informat
ion,..)
Network Speed can not be changed by the
software.In DGMS reduced WAN messages and bulk
operations are used. DGMS has to make the best
use of bandwidth and has to create automated
replicas on demand. Another follow-up hardware
technology is WAFS.
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
75Replicas
76Transparencies/Virtualizations (bits,data,informat
ion,..)
Replica selection based on replica administrative
location, client location, replica storage
resource type, network latency etc.,
Data Replica Transparency
image_0.jpgimage_100.jpg
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
77Transparencies/Virtualizations (bits,data,informat
ion,..)
Inter-organizational Information Storage
Management
Semantic data Organization (with behavior)
Virtual Data Transparency
Data Replica Transparency
image_0.jpgimage_100.jpg
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
78Tutorial Outline
- Introduction to Data Grids
- Data Grid Design Philosophies
- How we solved our problem?
- Theory to Implementation
- Where and how this solution can be used
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
79What type of data?
- Unstructured data sets (File-like)
- Images
- Movies
- E-Mail
- Why file-like? Each digital entity can have
its own meta data, can be part of multiple
collections with many replicas. (It is not the
case in traditional file systems) - Also for..
- Data Streams
- Semi-structured data
80Why they use Data Grids?
- Inter/Intra Organizational Sharing
- Inter/Intra Organizational Data Storage Utility
- Data Storage Resource Plug-n-play provisioning
- Data Preservation (Technology Migration)
- Information Lifecycle Management (ILM)
- Collaborative data lifecycle management
81Inter/intra Organizational Sharing (1)
Research Lab 1 You can use our storage for this
project
Research Lab 2 We have relevant data for this
project
GRP
82Inter/intra Organizational Sharing (2)
- Example
- LSST Project LSST, NCSA, SDSC -100 PB data
- Sharing of resources
- Autonomous domains - either same (Inter) or
different (Intra) organizations - Shared resources
- Data, storage, IT staff
- Logical namespace for Collaboration
- Shared data and physical resources available in
the logical namespace for usage
83Inter/intra Organizational Utility (1)
Data Center
West Coast Offices
GRP
East Coast Office
84Inter/intra Organizational Utility (2)
That was so easy in slide show ?
Data Center
West Coast Offices
GRP
East Coast Office
85Inter/intra Organizational Utility (3)
- Example NIH BIRN Project
- Connecting the research hospitals/universities in
US - Create a logical data storage utility
- Virtualization of enterprise resources (data and
storage) - Plug-n-play of resources on data grid
- Collaboration by sharing both data and storage
86Data Preservation (Technology Migration)
- Example
- National Archives Prototype (IDEA Award 2006)
- Library of Congress
- Facilitate Technology Migration
- Flexible data architecture for technology
evolution - Hardware changes, Software changes
- The application and users not aware of any change
- Significant saving by avoiding cost of migration
- Reduction of downtime, operational cost of
migration - Create a replicated resource of all or selected
data
87Data Grid technologies are used for many
data-intensive environments including Distributed
Digital Libraries, Persistent Archives and
Dataflow Systems. The core concept is the same
for all these requirements.
88Tutorial Outline
- Introduction to Data Grids
- Data Grid Design Philosophies
- How we solved our problem?
- Theory to Implementation
- Analogy to database concepts
- Where and how this solution can be used
- Solutions and Products
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
89Similar Solutions and Products
- HPSS (used IBM DB2) but only for archival
- Oracle Collaboration Suite only single site,
very limited - Sybase Avaki treats everything as object
- Globus Data Grid
- Nirvana SRB
- AFS Only global namespace
90Tutorial Outline
- Introduction to Data Grids
- Data Grid Design Philosophies
- How we solved our problem?
- Theory to Implementation
- Analogy to database concepts
- Where and how this solution can be used
- Solutions and Products
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
91Storage Resource Broker
- Distributed data management technology
- Developed at San Diego Supercomputer Center
(Univ. of California, San Diego) - 1996 - DARPA Massive Data Analysis
- 1998 - DARPA/USPTO Distributed Object Computation
Test bed - 2001 to present - NSF, NASA, NARA, DOE, DOD, NIH,
NLM, NHPRC - Applications
- Data grids - data sharing
- Digital libraries - data publication
- Persistent archives - data preservation
- Used in national and international projects in
support of Astronomy, Bio-Informatics, Biology,
Earth Systems Science, Ecology, Education,
Geology, Government records, High Energy Physics,
Seismology
92Data brokered by SDSC SRB (2004)
358 TB
324 TB
682 TB
93SRB DGMS - Numbers tell the story
830 TB, 126 million files and counting. Globally
must be around 2 Petabytes
94GGF - SRB Data Grid Federation Status
95Three Tier Architecture
- Client Implementations (Any interface/API)
- Your preferred access mechanism
- Servers (SRB Server)
- Federated to support direct interactions between
servers - Metadata catalog (MCAT)
- Separation of metadata management from data
storage - State persistence using a well-tuned database
- Storage Resources (SRB drivers)
- Your storage device with our SRB driver
- Just some basic 16 POSIX-like operations
96Peer-2-peer (like) SRB servers
Client can connect to any distributed SRB server
MES MCAT-Enabled SRB Server
SRB Zone All servers including the MES
MCAT
The role of MES and lack of leader election
protocol does not make the servers fully P2P
SRB Storage Driver
SRB Storage Driver
SRB Storage Driver
97Peer-2-peer SRB Zones
MCAT
SRB Storage Driver
SRB Storage Driver
SRB Storage Driver
98SDSC Storage Resource Broker Meta-data Catalog
Application
Linux I/O
OAI WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP
Consistency Management /
Authorization-Authentication
SRB Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase, SQLServer
Storage Drivers
HRM
99Data Grid Logical Namespace in SRB
Storage Resource Transparency SRB supports
resources including archival systems (HPSS, ADS),
File Systems (NFS, CIFS), FTP/GridFTP servers,
Relational Databases, Data Stream Systems
(Antelope). A SRB resource driver that supports
basic functions is part of SRB Resource Server.
Storage Resource Transparency
100Data Grid Logical Namespace in SRB
Network Speed can not be changed by the software.
In SRB DGMS reduced WAN messages and bulk
operations are used. Bulk operations include
register, load, unload, delete to trash,
metadata, access control and more.
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
101Data Grid Logical Namespace in SRB
Replica selection in SRB DGMS based on some
heuristics we observed in our users
environments. It takes into account replica
administrative location, client location, replica
storage resource type,
Data Replica Transparency
image_0.jpgimage_100.jpg
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
102Data Grid Logical Namespace in SRB
Data Organization
Data Replica Transparency
image_0.jpgimage_100.jpg
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
103SRB APIs and Intefaces
- C library calls
- Provide access to all SRB functions
- Shell commands
- Provide access to all SRB functions (mature)
- mySRB web browser
- Provides hierarchical collection view
- inQ Windows browser
- Provides Windows style directory view
- Jargon Java API
- Similar to java.io.File API (very well
used/tested) - Matrix WSDL/SOAP Interface
- Aggregate SRB requests into a SOAP request. Has a
Java API - Python, Perl, C, OAI, Windows DLL, Mac DLL,
Linux I/O redirection, GridFTP
104Tutorial Outline
- Introduction to Data Grids
- Data Grid Design Philosophies
- How we solved our problem?
- Theory to Implementation
- Analogy to database concepts
- Where and how this solution can be used
- Solutions and Products
- SDSC Storage Resource Broker
- iRODS DGMS
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
105Success brings more opportunity
106A Rule Oriented Data Management System
- The intelligence in SRB was hard-coded
- Extensions/modifications require extreme care
- Can we make SRB more flexible
- Easy to customize at finer level
- Example Can we add additional post processing on
ingestion - Example Can we use workflows for server-side
data management - Example Can we provide queued and batch
processing - Solution Use rule-based architecture to provide
flexibility with server-side workflow services
107iRODS Architecture (internal)
Resources
108iRODS Architecture (high level)
109Join iRODS
- Community research bed
- Research on unstructured data management
- Most of the worlds data is in files or no schema
- Very young field with promising breakthroughs
- Workflow, data classification, rule engineering
- Community development
- Develop new plug-ins, client side tools
- Promising to be the data management's operating
system - Interest from Fortune 500, Academia,
110Tutorial Outline
- Introduction to Data Grids
- Data Grid Design Philosophies
- How we solved our problem?
- Theory to Implementation
- Analogy to database concepts
- Where and how this solution can be used
- Solutions and Products
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
111InQ
- inQ is a browser query tool for SRB.
- Supports the file and directory functionality of
Windows Explorer. - Adds support for user-defined metadata and nested
queries.
112SRB InQ
Show it for real - Its real, it works!
113Introduction
- JARGON is a pure java API for interacting with a
data grid. - The API currently handles file I/O for local and
SRB file systems, as well as querying and modify
SRB metadata.
114The Transparent Grid
- The APIs structure, factory methods and
programming model, unify diverse file systems
into a single simple interface. - File handling exactly matches Suns java.io.File
API. - Factory methods even hide the type of file
system, e.g. if it is local or remote. - The API has been developed to allow for the easy
inclusion of other grid file systems.
115Class Structure
- java.lang.Object -edu.sdsc.grid.io.Genera
l -edu.sdsc.grid.io.Local
-edu.sdsc.grid.io.Remote
-edu.sdsc.grid.io.srb.SRB -
...(other filesystems)
116Connecting to a File System
- SRBAccount account new SRBAccount(userInfo)
- SRBFileSystem srbFileSystem
- new SRBFileSystem( account )
- GeneralFile file FileFactory.newFile(
fileSystem, Test.txt ) - GeneralFile file
- FileFactory.newFile( uri )
117Scommands
- Command line access to the SRB
- Download then compile from http//www.sdsc.edu/srb
/tarfiles/main.html (make sure you get
non-encrypted client only, as well as correct
version that matches server version) - Login to a machine with Scommand binaries
- via ssh to a nix machine
- Win32 binaries from command window
118Other authentication methods
- AUTH_SCHEME
- 'ENCRYPT1' random message encrypted with
your password between clients servers. - 'GSI_AUTH' - Use the Globus GSI
authentication scheme. - 'GSI_DELEGATE' - Use the GSI Delegation
(proxy) certificate for authentication. The
advantage is that this certificate can be passed
from server to server whereby the user's identity
continues to be maintained across servers and
across zones. This scheme solves the cross zone
authentication issues. - 'GSI_SECURE_COMM' - Use the GSI
authentication scheme and use the GSI I/O library
for all socket communication between client and
server.
119Extra env vars for GSI auth
- SRB (.MdasEnv file or env vars)
- AUTH_SCHEME GSI_AUTH
- SERVER_DN
- /CUS/ONPACI/OUSDSC/UIDsrb/CNStorage
Resource Broker/Emailsrb_at_sdsc.edu - GLOBUS
- X509_USER_PROXY"/home/du0/du0.proxy
- GLOBUS_LOCATION"/usr/local/apps/nmi-2.1"
- GLOBUS_INSTALL_PATH"/usr/local/apps/nmi-2.1"
- LD_LIBRARY_PATH"LD_LIBRARY_PATH/usr/local/apps/
nmi-2.1/lib" - PATH"PATH/usr/local/apps/nmi-2.1/bin"
120Tutorial Outline
- Introduction to Data Grids
- Data Grid Design Philosophies
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
121Commercial or Enterprise Use of DGMS
- Yes, they will be used. When? is the billion
(or multi-million) dollar question. - History of many storage or enterprise software
- Relational Databases, WWW only a few rebels
and misfits saw the problem or need first. Later
everyone starts to use the solution. - Problem statement Unstructured data in
enterprises is growing at a fast rate. Some day
enterprises will have to exchange this with each
other (like they do structured data now). How
will multiple enterprises work together?
122Excellent! At last I understood some part of this
tutorial. Smithers, I see a business
opportunity for us. We build this DGMS using
Indian developers here and We will be the next
big thing in data management.
123Open Grid Forum (OGF)
- Global Forum for Information Exchange and
Collaboration - Promote and support the development and
deployment of Grid Technologies - Creation and documentation of best practices,
technical specifications (standards), user
experiences, - Modeled after Internet Standards Process (IETF,
RFC 2026) - http//www.ogf.org
- Grid File System Working Group (GFS-WG)
124India and Grids (Back in 2002)
125India and Grid now (2006)
- Garuda Grid
- Ambitious effort by CDAC and multiple Indian
Universities - 45 Research Academic Institutions
- 17 cities across India
- Storage Resource Broker (SRB) for data management
- There is no place like 127.0.0.1 Grid
- Infrastructure for research and development of
sciences
126Research Issues
- Data Grid Workflow and Data Grid ILM, Triggers
- Automated replica creation and data migration
based on distributed access patterns - Knowledge Grids
- Self-organizing data grid communities
127Current Datagrid Collections
Digital entities
Meta-data
128Active Data grid collection
- We already know a data grid collection
- Relationship of related digital entities (files)
that have common metadata schema (galaxies,
songs,) - What if
- We are able to describe common behavior or
workflows associated with each digital entity in
a collection? - Example, after inserting a file regarding a
galaxy calculate some attribute or have a trigger - What we could be doing
129Current Datagrid Collections
Logical Collection gives location and naming
transparency
Meta-data
SDSC
130Active Datagrid Collections
Now add behavior or services to this logical
collection
Collection state and services
Horizontal Services
Meta-data
SDSC
131Active Datagrid Collections
Collection specific services Model View
Controllers
Logical view of data operations
Collection state and services
Horizontal Services
Meta-data
SDSC
132Active Datagrid Collections
133Data Grid ILM
- ILM Information Lifecycle Management (Sales
Jargon) - Dynamic re-orientation of data placement and data
retention policies (rules) - Based on business value of data and storage
cost - HSM Hierarchical Storage Management, based on
data freshness. ILM goes one step further - Applying this concept on Data Grid, very tricky
as different autonomous domains have different
business rules
134Data Grid Triggers
- Similar to triggers in databases
- Based on ECA concepts
- Event
- Condition
- Action
- Example
- Event Insert new file in collection
(/ourProject/data) - Condition (color blue galaxy
Andromedia) - Action Run ( selectiveDataReplicator.dgl )
135Data Grid Language
- Requirement
- Data Grid ILM process
- The long run process that has to be run is
described in DGL - Data Grid Triggers
- Action part of the ECA (Event-Condition-Action)
logic - Data Gridflows
- Step by step execution of long run process on
Data Grid - Analogy of SQL in relational databases
- Long-run procedures stored and executed in Data
Grid it self - Captures the Infrastructure Execution Logic
136DGL Request
Annotations about the Data Grid Request
Can be either a Flow or a Status Query
137DGL Requests (2 types)
- Data Grid Flow
- An XML Structure that describes the execution
logic, associated procedural rules and DGL
variables. Can be synchronous or asynchronous
flow - Status Query
- An XML Structure used to query the execution
status any gridflow or a sub-flow at any granular
level. Status Queries can be made for both
synchronous and asynchronous flows
138Flow
Scoped Variables that can control the flow
Logic used by the sub-members
Sub-members that are the real execution statements
139Flow Logic (How a flow executes)
140DGL-Response
Responses can be synchronous or asynchronous
141Tutorial Outline
- Introduction to Data Grids
- Data Grid Design Philosophies
- SDSC Storage Resource Broker
- Hands on Experience / Demo
- Ongoing Activities or Suggested Research
- Open QA and Discussion
142Pop Quiz
- What is a collection?
- Autonomous Administrative Domain?
- DataSet Vs File ??
- Why not use database to store all files?
- Extra Credit NFSv4 Vs DGMS ??
- Who want to join iRODS?
- What is going on in India for Grids?
143We are SDSC SRB
Arun is here! - Shameless Self promotion ?
Some have moved on with their career
144(No Transcript)
145Its your time for your questions there are no
dumb questions only dumb answers. So, please
contribute to the discussion.
146An Introduction to Data Grid Management Systems
(DGMS)
- Arun Jagatheesan
- San Diego Supercomputer Center
- University of California, San Diego
Tutorial at the 13th International Conference on
Management of Data (COMAD 2006) IIT-Delhi,
India December 14 - 16, 2006 arun_at_sdsc.edu