An Introduction to Data Grid Management Systems (DGMS) - PowerPoint PPT Presentation

About This Presentation
Title:

An Introduction to Data Grid Management Systems (DGMS)

Description:

An Introduction to Data Grid Management Systems DGMS – PowerPoint PPT presentation

Number of Views:823
Avg rating:3.0/5.0
Slides: 147
Provided by: arunswaran
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Data Grid Management Systems (DGMS)


1
An Introduction to Data Grid Management Systems
(DGMS)
  • Arun Jagatheesan
  • San Diego Supercomputer Center
  • University of California, San Diego

Tutorial at the 13th International Conference on
Management of Data (COMAD 2006) IIT-Delhi,
India December 14 - 16, 2006
2
Dynamic Calibration of Content
  • Academic researchers and Students ()
  • Project Managers, Analysts, Office of the CTO ()
  • Software architects, developers ()
  • Savvy users, I just want to use this hands-on ()

There are around 146 slides. We dont plan to go
through each of them in 2 hours. We need to make
sure all of us get the most out of this tutorial
when we leave this room
3
Tutorial Outline
  • Introduction to Data Grids
  • Data Grid Design Philosophies
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

4
We will also participate in this tutorial on DGMS
5
Tutorial Outline
  • Introduction to Data Grids
  • Who are we? What we do? What problem we faced?
  • Data Grid Design Philosophies
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

6
SDSC The Pride of NSF
  • SDSC San Diego Supercomputer Center. Located
    and run by University of California at San Diego
  • Leader in high end computing with focus on data
    management (academic data center for US)
  • Founded in 1985 by National Science Foundation
    (NSF) - 170 million
  • 400 researchers, high performance hardware and
    software for the academic greater good in US

7
Our Customer Science Research
8
Some SDSC Production Machines (2005-6)
TeraGrid Linux Cluster Intel IA-64 4.4 TFlops
Intimidata IBM Blue Gene 2.8/5.7 TFlops
DataStar IBM Power4 10.4 TFlops
Storage Area Network Disk
Sun F15K Disk Server
Archive Systems
600 500 TB
6 PB
9
  • Volume Visualization of the Orion Nebula
  • The San Diego Supercomputer Center andThe
    American Museum of Natural History Hayden
    Planetarium

10
Visualizations
11
TeraGrid
12
TeraGrid Resource Partners
13
(No Transcript)
14
Large data volume Large distribution
So, you just got tons of data so what is the
problem
Well, its more than super data. Data is
distributed. The requirements were more
challenging and it got interesting
15
NIH BIRN Data Grid
16
BIRN Inter-organizational Data
17
BBSRC Agrenet Archive Service Architecture
BBSRC SRB Archive process Data Path
Central Cache Site
RAL
Site WAN
Firewall
JANET WAN
ads0sb01.cc.rl.ac.uk
Central SRB Server
2
Tape Traffic
SRB-ADS Server
ADS Tape Resource
3
disk
Sreplcont
ADS SRB Disk Cache Resource
disk
4
Central cache Vault
Firewall
  • Archive Submission Interface
  • Data Ingestion of collection hierarchy into SRB
  • Uses Java jargon API interface (equivalent of
    Sput b)
  • Ingested to /bbsrc/institute/scratch/project/year
    /user/dateandtime
  • At end of ingestion data logically moved using
    Smv to/bbsrc/institute/local-archive/project/yea
    r/user/dateandtime

1
  • Scheduled transfer to ADS resource
  • Implemented via CRON job using Sreplcont command
    which is driven by central SRB Server
  • Entire container replicated using Sreplcont
    command
  • Logical Structure preserved as /bbsrc/institute/re
    mote-archive/project/year/user/dateandtime

3
  • Scheduled transfer to Central SRB Server (Driven
    from Central SRB Server)
  • Smkcont command used to create container on
    central SRB Server
  • Data moved from Site SRB to container on central
    SRB Server using Sphymove
  • Upon data transfer completion archived data is
    logically move with Smv to /bbsrc/institute/remote
    -archive/project/year/user/dateandtime

2
  • Synchronization of container to tape resource and
    removal of original container from Central SRB
    Server
  • Ssyncont d a command used, allowing for a
    family of containers

4
18
KeK federated zones
IHEP, CN
Krakow, PL
SDSC, US
KEK, JP
KNU, KR
Destination (from KEK) RTT (msec) Nominal BW (Mbps)
IHEP 502 10
Krakow 327 100
ANU 292 622
ASCC 33 100
KNU 23.6 100
ASCC, TW
ANU, AU
The latency was measured in December 2004
19
LSST in Media
130 Petabytes of image data
20
Problem Statement - Pattern
  • Large scale Collaboration
  • Large scale data distribution
  • Large scale data organization and discovery of
    unstructured data resources

21
Problem solved / Requirements 1
  • Large scale Collaboration
  • Multiple autonomous organizations (and/or)
  • Multiple user communities sharing large data
    storage
  • Collaborative logical namespace
  • Avoid multiple mount points as they restrict
    scalability of the collaboration
  • Coordinated data sharing at any granular level
    (data, metadata, annotations,)

22
Problem solved / Requirements 2
  • Large scale data distribution
  • Number of files (and/or)
  • Number of distributed storage resources
  • Multi-site Data Distribution
  • Concept or replica, copy, dirty-copy
  • Multi-site replicas reduce access times
  • Replicas have the same logical name everywhere in
    the enterprise (big plus for users)
  • Replicas controlled by user, admin,
    system-enabled (automated or policy based)
  • Reduce WAN latency (chattiness)

23
Problem solved / Requirements 3
  • Large scale data organization and discovery
  • Unstructured data or files
  • Real-time data and more
  • Data Classification and Discovery
  • Tag data with any arbitrary metadata schema
  • Data is organized based on user-defined
    attributes
  • Multiple teams can have different metadata
    attributes Query, discover and access data
    without knowing path or protocol to be used

24
Our Solution DGMS
  • Large-scale logical file system
  • File System
  • Database System
  • Grid Computing
  • Data Grid Management System (DGMS)
  • Core Concepts
  • Logical shared collections
  • Logical shared resources
  • Collaborative communities

25
Is there a simple explanation of what is Grid
Computing?
26
The Data Grid Vision
Data Grid Collaborative logical namespace of
data and storage
Multiple domains, multiple resource
types, Autonomous control
27
Its the data.
  • Resources shared (coordinated sharing)
  • Storage resources
  • Labor support
  • Bandwidth
  • But most importantly
  • Data, Metadata, data organization or namespace
  • Having people on the same page
  • Coordinated sharing of knowledge/information
    became a vital link for each organization or
    project

28
Its not just the academia alone
  • Rebels and misfits of (existing) technology
  • Big time academic projects, Large enterprises
  • Remember databases and DBMS
  • Enterprises need it too
  • Fortune 500, Forbes Global 2000
  • Distributed global teams requiring collaborations
  • Distribution of unstructured data and storage
  • Unstructured data and distribution is seen in
    multiple vertical domains chip design,
    automobile or aircraft engineering, patient
    records,

29
(Again) Our Solution DGMS
  • Large-scale logical file system
  • File System
  • Database System
  • Grid Computing
  • Data Grid Management System (DGMS)
  • Core Concepts
  • Logical shared collections
  • Logical shared resources
  • Collaborative communities

Basic unit of data management in a DGMS
30
Tutorial Outline
  • Introduction to Data Grids
  • Who are we? What we do? What problem we faced?
  • Data Grid Design Philosophies
  • How we solved our problem?
  • Theory to Implementation
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

31
Using a Data Grid in Abstract
Data Grid
/home/arun.sdsc/exp1 /home/arun.sdsc/exp1/text1.tx
t /home/arun.sdsc/exp1/text2.txt /home/arun.sdsc/e
xp1/text3.txt data storage (100)
32
Real World Physical Heterogeneities
  • Multiple autonomous administrative domains
  • Distributed data (replicas) in different domains
  • Heterogeneous storage resources and systems
  • Distributed users and authentication mechanisms

33
Is it possible?
  • A global collaborative namespace to organize
    unstructured data
  • Discover data without even knowing the name or
    location or protocol to be used?
  • Scale the solution for millions of files and
    Petabytes of data

34
The answer is out there
  • This is actually a namespace problem (apart from
    the physics of moving data faster)
  • Database basics
  • Isolate logical schema (the namespace presented
    to the user) from physical organization
    (distribution) of data
  • The users will see the same data namespace
    irrespective of where the data is physically
    located or how it is stored.
  • The data can be freely moved or replicated around
    without affecting the users view

35
Logical data namespace
  • A logical data namespace is a mapping from real
    physical data namespace to a logical namespace
  • The logical namespace hides the physical
    organization (physical distribution) of data.

36
Mapping physical data to logical view
Hierarchical view, independent of network, disk,
sector, track, fragments
Rule Storage Abstraction Hide storage
resources
37
Mapping physical data to logical view
Relational view, independent of network, disk,
sector, track, fragments where each field
(cell) is stored
38
Mapping distributed data storage to logical view
39
Mapping distributed data storage to logical view
25 Universities or Research Hospitals, Multiple
heterogeneous storage resources
40
Is logical data namespace sufficient?
  • Logical Namespace hides all the heterogeneities
    All data seems to be on single resource.
  • Users are hidden from the complexities of
    distributed data management
  • However, restricts users from doing distributed
    data management operations
  • How do we handle replication or migration of data
    between storage resources? (There is no concept
    of resource or storage space in logical data
    namespace)

41
Possible solution (1)
  • Within the logical data namespace, we need a way
    to indicate the presence of distributed resources
    (from different organizations)
  • Why not add an attribute to every data item to
    specify where the data is physically present
  • Users still see the same logical data namespace.
    But in addition, can also see the distribution of
    data

42
Logical namespace
43
Logical namespace location
Location
Add location info??
//121121/a/file1.txt
//sandiego/a/file1.txt
//India123/a/file1.txt
44
Possible Solution (2)
  • But, providing direct physical information in
    logical data namespace might not be a good
    solution. All advantages due to
    infrastructure-independence will be lost.
  • System administrators night mare. Imagine the
    Internet without www and only physical IP
    addresses.
  • Also, users need a way to specify resources
    (example which resource to replicate data)

45
Logical namespace location
Logical Location
Add logical location??
Default resource
San diego sdsc-disk
India-delhi-tape
46
Possible Solution (3)
  • Houston, We have a solution!
  • We will create a separate logical resource
    namespace of all data resources in the data grid
  • The well-known logical data namespace can be
    joined with this logical resource namespace to
    create another logical view data grid
    namespace.

47
The solution (or a part of the solution)
Logical Resource Namespace (Each logical
resource is a combination of one or more
physical resources)

Traditional Logical Namespace

Data Grid Namespace
48
Data Grid Namespace
Logical Resources
Multiple Replicas
Users from different organizations
49
An hierarchical logical resource space
--- Any Possible Resource (ANY) -------
Asia Resources (2 of N) ------- Asia-disk-1
------- Asia-disk-2 -------
Asia-disk-3 ------- Europe Resources
(ANY) ------Europe-Unix ------Europe-Linux-Clu
ster ------- US Resources (ANY) ------
US-disk ------- Client Resources
(ANY) ------ client-cifs -------
Critical Replicated Data (ALL) ------
Asia-disk-1 ------Europe-Unix ------ US-disk
This sounds like storage and not DBMS to me.
Well DGMS involves file systems (storage), DBMS
and grid concepts
50
More with Data Grid Namespace
Logical Resources
Multiple Replicas
User defined meta data (schema) for data discovery
51
So the data grid is a simple GUI? I can click on
it why a tutorial?
Bart, that GUI was just used explain the
concepts. It can be used without a GUI too. I
wonder what are the generic concepts and how they
can be used.
52
More with Data Grid Namespace
Logical Name
Physical details
Logical resource
Metadata
Unstructured becomes structured?
53
Hierarchical view by relational schema

/usr/a/1.txt //vault/1q.txt Fast-disk Colorblue Galaxy r2d2
Wan-tape Colorred GalaxyYoda

Logical Name
Physical details
Logical resource
Metadata
Magic Recipe
Adding the bells and whistles
54
Logical collections by relational schema

/usr/a/1.txt //vault/1q.txt Fast-disk Colorblue Galaxy r2d2
Wan-tape Colorred GalaxyYoda
SATA Colorred Galaxy Yoda
SATA MusicARR Artist 1
SATA Music RD Artist Lata
Logical Name
Physical details
Logical resource
Metadata
55
Logical collections by relational schema

/usr/a/1.txt //vault/1q.txt Fast-disk Colorblue Galaxy r2d2
Wan-tape Colorred GalaxyYoda
SATA Colorred Galaxy Yoda
SATA MusicARR Artist 1
SATA Music RD Artist LataJi
Logical Name
Physical details
Logical resource
Metadata
StarCollection(L_Name,PhyDetails, LogRes, Color,
Galaxy)
MusicFileCollection(L_Name,PhyDetails, LogRes,
Music, Artist)
56
Collection as a relation
  • StarCollection(L_Name, PhyDetails, LogRes, Color,
    Galaxy)
  • MusicFileCollection(L_Name,PhyDetails, LogRes,
    Music, Artist)
  • Employee (Emp, Emp_Name, Dept, )

57
Collection as a relation
Data (files) are not part of the relationship
only metadata.
  • StarCollection(L_Name, PhyDetails, LogRes, Color,
    Galaxy)
  • MusicFileCollection(L_Name,PhyDetails, LogRes,
    Music, Artist)
  • Employee (Emp, Emp_Name, Dept, )

Data forms the relationship
58
Every thing is logical !
Hierarchical Logical namespace or logical view
powered by a relational or collection-oriented
schema
59
Logical view to relational
  • Logical View (User)
  • ls .jpg ( dir .jpg)
  • ls .jpg -q (galaxyyoda)
  • Inside DB (Internal)
  • Select file-name
  • Select file-name where
  • yoda and

60
Theory discussed so far
  • Data grid Namespace with logical resources
  • Well-known magic recipe to distribute data
    resources
  • Resources part of the logical namespace
  • Collections as a relationship to organize data
  • Relationship of data based on meta-data
  • Elaborating on top of these concepts
  • Collections as a data management entity
  • But before doing that a warning from the speaker

61
Warning What they might say
DGMS is just another application running on top
of database (this solves a problem for some
high-end users). Advice Just stay away from
this guy (Arun) and dont fall for DGMS.
DB-Guru
62
Collections for data management
  • Data grid collection similar to common usage of
    the word collection
  • The basic unit of data management in DGMS
  • Mathematically a pair (L,m,P)
  • L is set of logical identifier strings (logical
    namespace)
  • P is set of physical locations of data (physical
    details)
  • mL ? P non-injective and surjective function
  • In traditional file systems
  • mL ? P is an injective function (1-to-1 mapping,
    no replicas)

63
Just for fun comparison
  • DBMS
  • Relation (Table)
  • Relationship on data. Physical data involved
  • Well defined schema on physical bytes
  • Tablespace
  • Very well understood concepts
  • Triggers, vendors
  • DGMS
  • Collection (directory-like)
  • Relationship on meta-data or operations
  • No schema on bytes or content
  • Vaultspace
  • The users say it works who wants theory?
  • Work in progress

64
Tutorial Outline
  • Introduction to Data Grids
  • Who are we? What we do? What problem we faced?
  • Data Grid Design Philosophies
  • How we solved our problem?
  • Theory to Implementation
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

65
Break? or Continue?
66
Data Grid and DGMS
  • Data Grid A logical collaborative namespace
    that enables coordinated sharing of distributed
    storage resources and data based on local and
    global policies across administrative domains.
  • DGMS The middleware (software) infrastructure
    that enables the creation and management of
    datagrids

67
Data Grid Collection
  • Very similar to directory (based on set theory)
  • Files or digital entities from different storage
    or data management systems are represented with
    logical names
  • A user-defined or flexible meta data schema
    described the data (relationship to collection)
  • Query and discover data based on attributes
  • Browse the collection as an hierarchical
    namespace
  • Collections can have some predefined storage
    resources
  • Other higher level concepts include views,
    triggers or rules

68
Tutorial Outline
  • Introduction to Data Grids
  • Who are we? What we do? What problem we faced?
  • Data Grid Design Philosophies
  • How we solved our problem?
  • Theory to Implementation
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

69
Heterogeneous Resources
70
Transparencies/Virtualizations (bits,data,informat
ion,..)
Storage Resource Transparency Have a driver for
each new storage resource type that can perform
the same pre-defined set of functions like
read/write bytes .. (on archive resource or disk
or database..)
Storage Resource Transparency
71
Data Identifiers for different systems
/usr/homes/arun/1.txt
C\docs\arun\2.txt
Select name from docs where user arun and
type text
72
Transparencies/Virtualizations (bits,data,informat
ion,..)
Data Identifier Transparency Each underlying
data storage system has its own convention of
local naming. The driver for each resource type
understands the local naming and function calls
like read/write bytes .. (on archive resource or
disk or database..)
Data Identifier Transparency
Storage Resource Transparency
73
Distributed storage resources - latency
74
Transparencies/Virtualizations (bits,data,informat
ion,..)
Network Speed can not be changed by the
software.In DGMS reduced WAN messages and bulk
operations are used. DGMS has to make the best
use of bandwidth and has to create automated
replicas on demand. Another follow-up hardware
technology is WAFS.
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
75
Replicas
76
Transparencies/Virtualizations (bits,data,informat
ion,..)
Replica selection based on replica administrative
location, client location, replica storage
resource type, network latency etc.,
Data Replica Transparency
image_0.jpgimage_100.jpg
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
77
Transparencies/Virtualizations (bits,data,informat
ion,..)
Inter-organizational Information Storage
Management
Semantic data Organization (with behavior)
Virtual Data Transparency
Data Replica Transparency
image_0.jpgimage_100.jpg
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
78
Tutorial Outline
  • Introduction to Data Grids
  • Data Grid Design Philosophies
  • How we solved our problem?
  • Theory to Implementation
  • Where and how this solution can be used
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

79
What type of data?
  • Unstructured data sets (File-like)
  • Images
  • Movies
  • E-Mail
  • Why file-like? Each digital entity can have
    its own meta data, can be part of multiple
    collections with many replicas. (It is not the
    case in traditional file systems)
  • Also for..
  • Data Streams
  • Semi-structured data

80
Why they use Data Grids?
  • Inter/Intra Organizational Sharing
  • Inter/Intra Organizational Data Storage Utility
  • Data Storage Resource Plug-n-play provisioning
  • Data Preservation (Technology Migration)
  • Information Lifecycle Management (ILM)
  • Collaborative data lifecycle management

81
Inter/intra Organizational Sharing (1)
Research Lab 1 You can use our storage for this
project
Research Lab 2 We have relevant data for this
project
GRP
82
Inter/intra Organizational Sharing (2)
  • Example
  • LSST Project LSST, NCSA, SDSC -100 PB data
  • Sharing of resources
  • Autonomous domains - either same (Inter) or
    different (Intra) organizations
  • Shared resources
  • Data, storage, IT staff
  • Logical namespace for Collaboration
  • Shared data and physical resources available in
    the logical namespace for usage

83
Inter/intra Organizational Utility (1)
Data Center
West Coast Offices
GRP
East Coast Office
84
Inter/intra Organizational Utility (2)
That was so easy in slide show ?
Data Center
West Coast Offices
GRP
East Coast Office
85
Inter/intra Organizational Utility (3)
  • Example NIH BIRN Project
  • Connecting the research hospitals/universities in
    US
  • Create a logical data storage utility
  • Virtualization of enterprise resources (data and
    storage)
  • Plug-n-play of resources on data grid
  • Collaboration by sharing both data and storage

86
Data Preservation (Technology Migration)
  • Example
  • National Archives Prototype (IDEA Award 2006)
  • Library of Congress
  • Facilitate Technology Migration
  • Flexible data architecture for technology
    evolution
  • Hardware changes, Software changes
  • The application and users not aware of any change
  • Significant saving by avoiding cost of migration
  • Reduction of downtime, operational cost of
    migration
  • Create a replicated resource of all or selected
    data

87
Data Grid technologies are used for many
data-intensive environments including Distributed
Digital Libraries, Persistent Archives and
Dataflow Systems. The core concept is the same
for all these requirements.
88
Tutorial Outline
  • Introduction to Data Grids
  • Data Grid Design Philosophies
  • How we solved our problem?
  • Theory to Implementation
  • Analogy to database concepts
  • Where and how this solution can be used
  • Solutions and Products
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

89
Similar Solutions and Products
  • HPSS (used IBM DB2) but only for archival
  • Oracle Collaboration Suite only single site,
    very limited
  • Sybase Avaki treats everything as object
  • Globus Data Grid
  • Nirvana SRB
  • AFS Only global namespace

90
Tutorial Outline
  • Introduction to Data Grids
  • Data Grid Design Philosophies
  • How we solved our problem?
  • Theory to Implementation
  • Analogy to database concepts
  • Where and how this solution can be used
  • Solutions and Products
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

91
Storage Resource Broker
  • Distributed data management technology
  • Developed at San Diego Supercomputer Center
    (Univ. of California, San Diego)
  • 1996 - DARPA Massive Data Analysis
  • 1998 - DARPA/USPTO Distributed Object Computation
    Test bed
  • 2001 to present - NSF, NASA, NARA, DOE, DOD, NIH,
    NLM, NHPRC
  • Applications
  • Data grids - data sharing
  • Digital libraries - data publication
  • Persistent archives - data preservation
  • Used in national and international projects in
    support of Astronomy, Bio-Informatics, Biology,
    Earth Systems Science, Ecology, Education,
    Geology, Government records, High Energy Physics,
    Seismology

92
Data brokered by SDSC SRB (2004)
358 TB
324 TB
682 TB
93
SRB DGMS - Numbers tell the story
830 TB, 126 million files and counting. Globally
must be around 2 Petabytes
94
GGF - SRB Data Grid Federation Status
95
Three Tier Architecture
  • Client Implementations (Any interface/API)
  • Your preferred access mechanism
  • Servers (SRB Server)
  • Federated to support direct interactions between
    servers
  • Metadata catalog (MCAT)
  • Separation of metadata management from data
    storage
  • State persistence using a well-tuned database
  • Storage Resources (SRB drivers)
  • Your storage device with our SRB driver
  • Just some basic 16 POSIX-like operations

96
Peer-2-peer (like) SRB servers
Client can connect to any distributed SRB server
MES MCAT-Enabled SRB Server
SRB Zone All servers including the MES
MCAT
The role of MES and lack of leader election
protocol does not make the servers fully P2P
SRB Storage Driver
SRB Storage Driver
SRB Storage Driver
97
Peer-2-peer SRB Zones
MCAT
SRB Storage Driver
SRB Storage Driver
SRB Storage Driver
98
SDSC Storage Resource Broker Meta-data Catalog
Application
Linux I/O
OAI WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP
Consistency Management /
Authorization-Authentication
SRB Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase, SQLServer
Storage Drivers
HRM
99
Data Grid Logical Namespace in SRB
Storage Resource Transparency SRB supports
resources including archival systems (HPSS, ADS),
File Systems (NFS, CIFS), FTP/GridFTP servers,
Relational Databases, Data Stream Systems
(Antelope). A SRB resource driver that supports
basic functions is part of SRB Resource Server.
Storage Resource Transparency
100
Data Grid Logical Namespace in SRB
Network Speed can not be changed by the software.
In SRB DGMS reduced WAN messages and bulk
operations are used. Bulk operations include
register, load, unload, delete to trash,
metadata, access control and more.
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
101
Data Grid Logical Namespace in SRB
Replica selection in SRB DGMS based on some
heuristics we observed in our users
environments. It takes into account replica
administrative location, client location, replica
storage resource type,
Data Replica Transparency
image_0.jpgimage_100.jpg
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
102
Data Grid Logical Namespace in SRB
Data Organization
Data Replica Transparency
image_0.jpgimage_100.jpg
Storage Location Transparency
Data Identifier Transparency
Storage Resource Transparency
103
SRB APIs and Intefaces
  • C library calls
  • Provide access to all SRB functions
  • Shell commands
  • Provide access to all SRB functions (mature)
  • mySRB web browser
  • Provides hierarchical collection view
  • inQ Windows browser
  • Provides Windows style directory view
  • Jargon Java API
  • Similar to java.io.File API (very well
    used/tested)
  • Matrix WSDL/SOAP Interface
  • Aggregate SRB requests into a SOAP request. Has a
    Java API
  • Python, Perl, C, OAI, Windows DLL, Mac DLL,
    Linux I/O redirection, GridFTP

104
Tutorial Outline
  • Introduction to Data Grids
  • Data Grid Design Philosophies
  • How we solved our problem?
  • Theory to Implementation
  • Analogy to database concepts
  • Where and how this solution can be used
  • Solutions and Products
  • SDSC Storage Resource Broker
  • iRODS DGMS
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

105
Success brings more opportunity
106
A Rule Oriented Data Management System
  • The intelligence in SRB was hard-coded
  • Extensions/modifications require extreme care
  • Can we make SRB more flexible
  • Easy to customize at finer level
  • Example Can we add additional post processing on
    ingestion
  • Example Can we use workflows for server-side
    data management
  • Example Can we provide queued and batch
    processing
  • Solution Use rule-based architecture to provide
    flexibility with server-side workflow services

107
iRODS Architecture (internal)
Resources
108
iRODS Architecture (high level)
109
Join iRODS
  • Community research bed
  • Research on unstructured data management
  • Most of the worlds data is in files or no schema
  • Very young field with promising breakthroughs
  • Workflow, data classification, rule engineering
  • Community development
  • Develop new plug-ins, client side tools
  • Promising to be the data management's operating
    system
  • Interest from Fortune 500, Academia,

110
Tutorial Outline
  • Introduction to Data Grids
  • Data Grid Design Philosophies
  • How we solved our problem?
  • Theory to Implementation
  • Analogy to database concepts
  • Where and how this solution can be used
  • Solutions and Products
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

111
InQ
  • inQ is a browser query tool for SRB.
  • Supports the file and directory functionality of
    Windows Explorer.
  • Adds support for user-defined metadata and nested
    queries.

112
SRB InQ
Show it for real - Its real, it works!
113
Introduction
  • JARGON is a pure java API for interacting with a
    data grid.
  • The API currently handles file I/O for local and
    SRB file systems, as well as querying and modify
    SRB metadata.

114
The Transparent Grid
  • The APIs structure, factory methods and
    programming model, unify diverse file systems
    into a single simple interface.
  • File handling exactly matches Suns java.io.File
    API.
  • Factory methods even hide the type of file
    system, e.g. if it is local or remote.
  • The API has been developed to allow for the easy
    inclusion of other grid file systems.

115
Class Structure
  • java.lang.Object      -edu.sdsc.grid.io.Genera
    l            -edu.sdsc.grid.io.Local       
         -edu.sdsc.grid.io.Remote                 
     -edu.sdsc.grid.io.srb.SRB                  -
    ...(other filesystems)

116
Connecting to a File System
  • SRBAccount account new SRBAccount(userInfo)
  • SRBFileSystem srbFileSystem
  • new SRBFileSystem( account )
  • GeneralFile file FileFactory.newFile(
    fileSystem, Test.txt )
  • GeneralFile file
  • FileFactory.newFile( uri )

117
Scommands
  • Command line access to the SRB
  • Download then compile from http//www.sdsc.edu/srb
    /tarfiles/main.html (make sure you get
    non-encrypted client only, as well as correct
    version that matches server version)
  • Login to a machine with Scommand binaries
  • via ssh to a nix machine
  • Win32 binaries from command window

118
Other authentication methods
  • AUTH_SCHEME
  • 'ENCRYPT1' random message encrypted with
    your password between clients servers.
  • 'GSI_AUTH' - Use the Globus GSI
    authentication scheme.
  • 'GSI_DELEGATE' - Use the GSI Delegation
    (proxy) certificate for authentication. The
    advantage is that this certificate can be passed
    from server to server whereby the user's identity
    continues to be maintained across servers and
    across zones. This scheme solves the cross zone
    authentication issues.
  • 'GSI_SECURE_COMM' - Use the GSI
    authentication scheme and use the GSI I/O library
    for all socket communication between client and
    server.

119
Extra env vars for GSI auth
  • SRB (.MdasEnv file or env vars)
  • AUTH_SCHEME GSI_AUTH
  • SERVER_DN
  • /CUS/ONPACI/OUSDSC/UIDsrb/CNStorage
    Resource Broker/Emailsrb_at_sdsc.edu
  • GLOBUS
  • X509_USER_PROXY"/home/du0/du0.proxy
  • GLOBUS_LOCATION"/usr/local/apps/nmi-2.1"
  • GLOBUS_INSTALL_PATH"/usr/local/apps/nmi-2.1"
  • LD_LIBRARY_PATH"LD_LIBRARY_PATH/usr/local/apps/
    nmi-2.1/lib"
  • PATH"PATH/usr/local/apps/nmi-2.1/bin"

120
Tutorial Outline
  • Introduction to Data Grids
  • Data Grid Design Philosophies
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

121
Commercial or Enterprise Use of DGMS
  • Yes, they will be used. When? is the billion
    (or multi-million) dollar question.
  • History of many storage or enterprise software
  • Relational Databases, WWW only a few rebels
    and misfits saw the problem or need first. Later
    everyone starts to use the solution.
  • Problem statement Unstructured data in
    enterprises is growing at a fast rate. Some day
    enterprises will have to exchange this with each
    other (like they do structured data now). How
    will multiple enterprises work together?

122
Excellent! At last I understood some part of this
tutorial. Smithers, I see a business
opportunity for us. We build this DGMS using
Indian developers here and We will be the next
big thing in data management.
123
Open Grid Forum (OGF)
  • Global Forum for Information Exchange and
    Collaboration
  • Promote and support the development and
    deployment of Grid Technologies
  • Creation and documentation of best practices,
    technical specifications (standards), user
    experiences,
  • Modeled after Internet Standards Process (IETF,
    RFC 2026)
  • http//www.ogf.org
  • Grid File System Working Group (GFS-WG)

124
India and Grids (Back in 2002)
125
India and Grid now (2006)
  • Garuda Grid
  • Ambitious effort by CDAC and multiple Indian
    Universities
  • 45 Research Academic Institutions
  • 17 cities across India
  • Storage Resource Broker (SRB) for data management
  • There is no place like 127.0.0.1 Grid
  • Infrastructure for research and development of
    sciences

126
Research Issues
  • Data Grid Workflow and Data Grid ILM, Triggers
  • Automated replica creation and data migration
    based on distributed access patterns
  • Knowledge Grids
  • Self-organizing data grid communities

127
Current Datagrid Collections
Digital entities
Meta-data
128
Active Data grid collection
  • We already know a data grid collection
  • Relationship of related digital entities (files)
    that have common metadata schema (galaxies,
    songs,)
  • What if
  • We are able to describe common behavior or
    workflows associated with each digital entity in
    a collection?
  • Example, after inserting a file regarding a
    galaxy calculate some attribute or have a trigger
  • What we could be doing

129
Current Datagrid Collections
Logical Collection gives location and naming
transparency
Meta-data
SDSC
130
Active Datagrid Collections
Now add behavior or services to this logical
collection
Collection state and services
Horizontal Services
Meta-data
SDSC
131
Active Datagrid Collections
Collection specific services Model View
Controllers
Logical view of data operations
Collection state and services
Horizontal Services
Meta-data
SDSC
132
Active Datagrid Collections
133
Data Grid ILM
  • ILM Information Lifecycle Management (Sales
    Jargon)
  • Dynamic re-orientation of data placement and data
    retention policies (rules)
  • Based on business value of data and storage
    cost
  • HSM Hierarchical Storage Management, based on
    data freshness. ILM goes one step further
  • Applying this concept on Data Grid, very tricky
    as different autonomous domains have different
    business rules

134
Data Grid Triggers
  • Similar to triggers in databases
  • Based on ECA concepts
  • Event
  • Condition
  • Action
  • Example
  • Event Insert new file in collection
    (/ourProject/data)
  • Condition (color blue galaxy
    Andromedia)
  • Action Run ( selectiveDataReplicator.dgl )

135
Data Grid Language
  • Requirement
  • Data Grid ILM process
  • The long run process that has to be run is
    described in DGL
  • Data Grid Triggers
  • Action part of the ECA (Event-Condition-Action)
    logic
  • Data Gridflows
  • Step by step execution of long run process on
    Data Grid
  • Analogy of SQL in relational databases
  • Long-run procedures stored and executed in Data
    Grid it self
  • Captures the Infrastructure Execution Logic

136
DGL Request
Annotations about the Data Grid Request
Can be either a Flow or a Status Query
137
DGL Requests (2 types)
  • Data Grid Flow
  • An XML Structure that describes the execution
    logic, associated procedural rules and DGL
    variables. Can be synchronous or asynchronous
    flow
  • Status Query
  • An XML Structure used to query the execution
    status any gridflow or a sub-flow at any granular
    level. Status Queries can be made for both
    synchronous and asynchronous flows

138
Flow
Scoped Variables that can control the flow
Logic used by the sub-members
Sub-members that are the real execution statements
139
Flow Logic (How a flow executes)
140
DGL-Response
Responses can be synchronous or asynchronous
141
Tutorial Outline
  • Introduction to Data Grids
  • Data Grid Design Philosophies
  • SDSC Storage Resource Broker
  • Hands on Experience / Demo
  • Ongoing Activities or Suggested Research
  • Open QA and Discussion

142
Pop Quiz
  • What is a collection?
  • Autonomous Administrative Domain?
  • DataSet Vs File ??
  • Why not use database to store all files?
  • Extra Credit NFSv4 Vs DGMS ??
  • Who want to join iRODS?
  • What is going on in India for Grids?

143
We are SDSC SRB
Arun is here! - Shameless Self promotion ?
Some have moved on with their career
144
(No Transcript)
145
Its your time for your questions there are no
dumb questions only dumb answers. So, please
contribute to the discussion.
146
An Introduction to Data Grid Management Systems
(DGMS)
  • Arun Jagatheesan
  • San Diego Supercomputer Center
  • University of California, San Diego

Tutorial at the 13th International Conference on
Management of Data (COMAD 2006) IIT-Delhi,
India December 14 - 16, 2006 arun_at_sdsc.edu
Write a Comment
User Comments (0)
About PowerShow.com