Data Grid Components - PowerPoint PPT Presentation


PPT – Data Grid Components PowerPoint presentation | free to download - id: 92a4f-ZGNlZ


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Data Grid Components


Data Grid Components. Adam Belloum. Computer Architecture & Parallel Systems group. University of Amsterdam. The Problem (application pull) ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 65
Provided by: Ada578
Learn more at:
Tags: components | data | grid


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Grid Components

Data Grid Components
  • Adam Belloum
  • Computer Architecture Parallel Systems group
  • University of Amsterdam

The Problem (application pull)
  • New class of application are emerging in
    different domains and which involve a huge
    collection of the data geographically distributed
    and owned by different collaborating
  • Examples of such applications
  • Large Hadron Collider at CERN (2005)
  • Climate Modeling

Requirements of this emerging class of
  • Efficient data transfer service
  • Efficient data access service
  • Reliability and security
  • Possibility to create and manage multiple copies
    of the data

Distributed Application
Requested Data Management System ?
Services specific to the data Grid
Low level Services (shared with other Grid
Services shared with other Grid Componets
the Data Grid is then …
  • The Data Grid is the infrastructure that provides
    the services required for manipulating
    geographically distributed large collection of
    measured and computed data
  • security services
  • replicas services
  • data transfer services
  • Etc.
  • Design Principles
  • Mechanism neutrality
  • Compatibility with Grid infrastructure
  • Uniform of information and infrastructure

The Data Grid Architecture
High Level Components
Core Services
Data Grid Specific services
Generic Grid Specific services
Replica services for Data Grid
  • Possibility to create multiple copies of the data
  • Efficient and reliable management of the replicas
  • Efficient replication strategy Replica
    Management Service
  • Location of the replicas Replica location
  • Coherence of the replicas Replica consistency
  • Selection of the replica Replica selection
  • Secure replicas mechanism

Transfer services for Data Grid
  • Fast mechanisms for large data transfer
  • Reliable transfer mechanisms
  • Secure transfer mechanisms
  • GridFTP

Security services for Data Grid
  • Authentication
  • Who is can access or view the data?
  • Authorization
  • Who is authorized to effectively use the data
  • Accounting
  • The users may be charged for using the data

Replica Management for Data Grid
  • Introduction to the Replica Management
  • Replication Strategies
  • Simple strategies
  • Dynamic model driven strategies
  • Replica catalogue
  • Replicas location
  • Data derivation

Replica Management Problem
  • When a request for a large file is issued, it
    requires a considerable amount of bandwidth to
    be achieved. The availability of the bandwidth
    at the due time will have an impact on the
    latency of access to the requested file.
  • Replication of files nearby the potential users
  • (in other domains this is called caching)

What is the replicas manager
  • It is a Grid Service responsible for
    creating complete and partial copies of the
    datasets (mainly a collection of files)
  • Grid Data model
  • Datasets are stored in files grouped into
  • A replica
  • Is a subset of a collection that is stored on a
    particular physical storage

The role of replicas manager service
  • Its purpose is to map a logical file name to a
    physical name for the file on a specific storage
  • Note Does not use any semantic information
    contained on the logical file names

Services relevant to the Replicas Manager
Particle Physics applications, climate modeling
application, etc.
Replica Mgmt service
Replica Selection Service
Metadata Services
Distributed Catalog Service
Information Services
Storage Mgmt protocols
Catalog Mgmt protocols
Network Mgmt protocols
Compute Mgmt protocols
Communications , services discovery (DNS),
authentication, delegation, …
Storage Systems
Compute Systems
Replica catalog
Metadata Catalog
Framework of the replica manager service
  • Separation of Replication and Metadata
  • Only information needed for the mapping of the
    logical file names to physical locations are
  • Replication Semantics
  • The replicas are not guaranteed to be coherent,
  • The information on the original copy is not saved
  • Replica Management Consistency
  • The Replicas Manager is able to recover and
    return to a consistent state

Replicas Management Targets
  • Replicas Management should answer the following
  • Which files should be replicated?
  • static files, large size files …
  • When a replicas should be created?
  • Frequently accessed files …
  • Where the replicas should be located?
  • Close to users, fast storage systems, …

Replication Strategy
How should I replicate The data
Simple Dynamic Replication Strategies
  • Best Client
  • Files are replicated at the node where they are
    the mostly requested
  • Cascading replication
  • Replicas are created each time a threshold of
    requests is reached starting from the original
    node (root) and follow the hierarchy of the nodes
  • Plain caching
  • Files are stored locally at the client side
  • Fast Spread
  • Files are stored on each node of the path to the
  • Caching plus cascading replication

Dynamic Model-Driven Replication
  • The decision of replicating a file and where
    to locate the replicas are taken following a
    performance model that compares the costs and the
    benefits of creating replicas of a particular
    file in certain locations
  • Single-system stability
  • Transfer time between nodes
  • Storage cost
  • Accuracy of the replica location mechanism
  • Etc.

Dynamic Model-Driven Replication
  • The model driven approach is trying to answer to
    critical questions
  • What is the optimal number of replicas for a
    given file?
  • Which is the best location for the replicas?
  • When a file needs to be replicated?

number of replicas for a file
  • Is defined given a certain availability
  • Proposed model
  • RLacc(1-(1-p)r)gt Avail
  • Where
  • P the probability of a node to be up
  • RLacc is the accuracy of the location mechanism
  • Avail the needed availability
  • r is the number of replicas

Best location for the replicas
  • A query to the Discovery service returns a number
    of nodes (candidate for replication) which
  • dont contain a copy of the file
  • have available storage
  • And a reasonable response time
  • The best candidates should maximize the
    difference between
  • The replication benefit (high as much as
  • Replication costs (low as much as possible)

Best location for the replicas
  • Proposed Model
  • S(F, N2) trans(F, N1, N2)
  • Where
  • N1 node that currently contains the file
  • N2 Candidate for a new replica
  • S(F,N) storage cost for a file F at node N
  • Trans(F,a,b) transfer costs between locations
    a, and b
  • The benefit of creating a replica is
  • trans(F,N1,User) trans(F,N2,User)

Replicas management recovery
  • Two functionality at least are required to
    restart the replica manager after a failure
  • Restart
  • rollback

Drawback of decentralized approach
  • Because every node is taking the replication
    decisions as a standalone, there is always the
    case where multiple nodes can start replicating
    the same file at the same time.
  • …..

Replica Catalog
How do I keep Track of the replica
Create A catalog
The Replica catalog
  • The Replica catalog is a key component of the
    Replica management service, it provides the
    mapping between logical and physical entities.
    The Replica catalog register three types of
  • Logical collections represents a number of
    logical file names
  • Locations maps the logical collection to a
    particular physical instance of that collection
  • Logical files represents a unique logical file

The Replica catalog
Filename Filename etc
Filename Filename Protocol Hostname Path
Logical File
Logical File
Logical File
Logical File
Operation allowed on the Replica catalog
  • Publish (File_publish)
  • Copies a file from a storage system not
    registered in the in the replicas catalogue to a
    registered storage system and updated the replica
  • Copy (File_copy)
  • Copies a file from a registered storage system to
    another registered storage system and updated the
    replica catalogue. It creates the replicas.
  • Delete (File_delete)
  • deletes a filename from the replica catalogue
    location entry and optionally removes the file
    from the registered storage system.

I want to apply an astronomical analysis program
to millions of objects. If the program has
already been run and the results stored, Ill
save weeks of computation.
I want to find those results that I computed last
month, and the details of how I generated them
Data Derivation System
Ive detected a calibration error in an
instrument And want to know which derived data
to re-compute
A virtual Data System for Representing, Querying,
and Automating Data derivation
  • The analysis of data obtained from scientific
    instruments is a significant community activity.
    As a result of this activity, communities
    construct, in a collaborative fashion,
    collections of derived data with relationships
  • data objects corresponding to the computational
    procedures used to derive one from another
  • Recording and discovering these relationships can
    be important

Produced by Consumed by
Created by
Execution of
A virtual Data System for Representing, Querying,
and Automating Data derivation
  • More generally, we want to be able to track how
    data products are derived with sufficient
    precision that
  • Create and/or re-create data products from this
  • Explain definitively how data products are
  • This can be done by implementing a new class of
    virtual data management operations that
  • re-materialize data products that were deleted,
  • generate data products that were defined but
    never created,
  • regenerate data when data dependencies or
    transformation programs change
  • create replicas of data products at remote
    locations when creation is more efficient than
    data transfer.

Basic components of the virtual Data System
  • A transformation is an executable program
    associated with information that might be used
  • Characterize, locate, and invoke the
  • transformation arguments are formal parameters
  • A derivation represents an execution of a
    transformation. Associated with a derivation is
  • the name of the associated transformation,
  • the names of data objects to which the
    transformation is applied.
  • The arguments to a derivation are actual
  • A data object is a named entity that may be
    consumed or produced by a derivation

Virtual Data System Architecture
  • Information about transformations and derivations
    can be
  • Declared explicitly by user,
  • extracted automatically from a job control
  • produced by higher-level job creation interfaces
    such as portals,
  • and/or created by monitoring job execution
    facilities and file accesses.
  • Information can be recorded in the virtual data
    system at various times and for various purposes.
  • Transformation entries generated before
    invocation can be used to locate transformations
    and guide execution.
  • Derivation entries generated before jobs are
    executed can provide information needed to
    generate a file.
  • Entries generated after a job is executed record
    how to regenerate a file.
  • Example Chimera

Chimera Virtual Data System
Virtual Data Applications
Task Graphs (compute data movement tasks,
with dependencies)
Virtual Language (definition query)
VDL Interpreter (manipulate derivations
Data Grid Resources (distributed execution and
data management
Virtual Data Catalog (implementation
chimera Virtual Data Schema)
Chimera Virtual Data Schema
  • Transformation may have more than one derivation
    each supplying different values for the
  • A derivation may be applicable to more than one
    transformation. Thus, versioning allows for a
    range of valid transformations to apply,
    increasing the degrees of freedom for schedulers
    to choose the most applicable one.
  • An ACTUALARG relates to a derivation. Its value
    captures either the LFN or the value of a
    non-file parameter.
  • A FORMALARG may contain an optional default
    value, captured in a similar fashion by the same
    VALUE class.
  • The VALUE class is an abstract base class for
    either single value (SCALAR) or a list of similar
    values (LIST), which are collapsed union-fashion
    into a single table.
  • The relationships between a transformation and
    its formal parameters, on the one hand, and a
    dependent derivation and its actual parameters,
    on the other, are not independent of each other.
  • Each instantiation of an actual parameter maps
    to exactly one formal parameter describing the
    entry. The binding is created using the argument
    name, not its position in the argument list.

Replica Location Service
Where did I put the replicas !!!
You need A replicas location service
Replication location Service
Application-Oriented Data Services
Data Management services
Reliable Replication Services
Metadata Service
Replication location service
File Transfer Service
Role of the replica location service
  • The main task of the replica location Service
    is to find a specified number of Physical File
    Names given a Logical File Name

Distributed, Adaptive Replica Location Service
Replica Location Node - query based on LFN -
forward query - Digest distribution -
Overlay Network - Soft-state
Storage sites - register/delete of pairs of
Distributed, Adaptive Replica Location Service
  • Overlay network
  • separate addressing at layer-3 and layer-2,
    default route paths, and shortcut between edge
  • Compressed probabilistic presentation
  • reduce the search space of potential match
  • Soft-state mechanisms
  • decouple nodes states, a state producer sends its
    state over a lossy communication channel, and
    the receiver removes it after a timeout
  • Functional requirements
  • Autonomy of the Replica Location Nodes
  • Best-effort consistency the Replica Location
    Nodes might have an incomplete or outdated view
    of the system.
  • Adaptiveness

Replica location using small-world Models
  • It turns out that data sharing for grid-based
    applications follows a small-world pattern
  • A small average path length
  • A large clustering coefficient
  • examples of applications having the
    characteristics of a small-world can be found in
    multiple-domains including physics, biomedical
    research, mathematics, etc.

Graph shows the relationships among components
in a digital circuit (top) and TV digital
circuit (Bottom)
Characteristics of data-sharing in a small-world
  • Group Locality
  • People tend to work in group even they are
    geographically distributed
  • Time locality
  • User may requests a file a number of time within
    a short period of time
  • Question How can we exploit the small-word to
    boost the performance of data-sharing

Location of files in small-world networks
  • Combines information dissemination techniques
    with request-forwarding search mechanisms
  • location information is propagated aggressively
    within the cluster
  • It likely (high probability) that all nodes will
    learn about other in the cluster ( in a
    small-world of C clusters, each composed of G
    nodes, the search space is reduce from CG to C
    (exp gossip system)
  • To reduce the storage space needed to keep all
    the information most of time a Bloom Filter

Creating a small-world Network
  • Two approaches
  • Identify the characteristics of an existing
  • The users access pattern
  • The exploration protocol (Ping-Pong effect)
  • Start form a theoretical model that creates
  • Starts from highly clustered graph
  • And randomly adds or rewires edges to connect
    different clusters
  • The hands-off approach
  • The centralized approach
  • The agent-based approach

Giggle a Scalable Location Services
Replica Location Index - query about multiple
replica sites - set of LFN, pointer to LRC
- Types - RLI indexing over the full
namespace - RLI indexing over a subset of
Replica sites - Local Replica Catalog LRC
- Query - local integrity -
security - state propagation
Requirements for a replica Location Service
  • Read only data and versioning
  • Size
  • Performance
  • Security
  • Consistency
  • Reliability

Giggle A Replica Location Service
Requirements of the Giggle Framework (Location
Replica Index)
  • Secure Remote Access
  • LRI must support Authentication, integrity, and
    confidentiality and implement local access
  • State propagation
  • LRI must accept periodic updates from the LRC
  • Queries
  • LRI should supports queries seeking for replicas
    using LFN
  • Soft state
  • A time out is defined to remove any entry in the
    RLI which has not been updated from the LRC
  • Failure recovery
  • RLI should not contain persistent replica state

Replica Selection Service
You need A replica Selection Service
the Problem of the Replicas Selection
  • An application that requires access to replicated
    data begins by querying a specific metadata
    repository. Once the logical file name has been
    identified and a number of replica exist, the
    application, for performance reasons, requires
    access to the most appropriate replica (according
    specific characteristics).
  • This task is achieved by the Replica selection.

The role of replica selection
  • The process selection is the process of choosing
    a replica from among those spread across the Grid
    based on some characteristics specified by the
  • Access speed
  • Geographical location
  • Access costs
  • Etc.

A Data selection scenario
(8) Location of Selected replicas
Attribute of Desired Data (1)
Replica Selection Service
(2) (3) Logical File Names
(4) Location of (5) One or more replicas
Metadata service
Replica Management Service
(7) Performances Measurements And Predictions
(6) Sources and Destinations Candidates
Information Service
How the replica selection achieves its goals?
(3) Search for the replicas that matches
the applications characteristics
(1) Locate the replicas
(2) Get the capability and usage policy for
all the replicas
Core Services
The replica selection
  • Two services are necessary to the replica
    selection service
  • Replica management Service (High Level Service)
  • Provides the information on all existing replicas
  • Resource Management (Core Service)
  • Provides the information of the characteristics
    of the underlying resource.

Metacomputing Directory Service - information
collection - publication - access service
for grid resources
storage resource Grid Resource Information
Server (GRIS) - collect and publish -
system configuration Metadata - security
- state propagation - dynamically generate
Grid Index Information Service (GIIS) -
register GRISs - support broad users queries
Replica catalog
Storage broker - Search - Match - Access
You need A storage broker service
Matching Problem
  • The matching process depends on
  • Physical characteristics of the resources and the
    load of the CPU, Networks, storage devices that
    are part of the end-to-end path linking possible
    sources and sinks
  • The Matching process depends on factors which are
    very dynamic (dramatically change in the future)
  • Predictor to estimate future usage

Intelligent Matching process
  • Having replica location expose performance about
  • previous data transfers, which can be used to
    predict future behaviour between sites
  • Prediction of end-to-end system performance
  • Create a model of each system component involved
    in the end-to-end data transfer (CPU, cache hit,
    disk access, network …)
  • Observations from past application from the
    entire system.

Collecting the observations
  • Tools NWS, NetLogger, Web100, ipref, NetPerf
  • Experience has shown substantial difference in
    performance between a small network probe (64kb)
    and the actual data transfer (GridFTP)
  • From logs of past applications
  • The sporadic nature of large data transfers means
    that often there is no data available about
    current conditions

Prediction data-transfer using regressive models
  • Preparing the data streams
  • Correlation
  • Regression Techniques
  • Matching
  • Filling-in techniques
  • Discard unaccounted data
  • Last Value Filling
  • Average filling
  • Temporal filters
  • Regression Models

Data transfer services
Security services