Decentralized Data Management Framework for Distributed Environments - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Decentralized Data Management Framework for Distributed Environments

Description:

... data transfer in Grid and distributed environments ... extensive administrative oversight and management overhead such as Data Grids ... Grid Node Insertion: ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 22
Provided by: HOU9
Category:

less

Transcript and Presenter's Notes

Title: Decentralized Data Management Framework for Distributed Environments


1
Decentralized Data Management Framework for
Distributed Environments
  • Houda Lamehamedi
  • Computer Science Department
  • Rensselaer Polytechnic Institute

2
Data in Large Scale Computing
  • In an increasing number of scientific
    disciplines, large data collections are emerging
    as important community resources
  • data produced and collected at experiment sites
    e.g. high energy physics, climate modeling
  • processed data and analysis results
  • The geographical location of the compute and
    storage resources results in complex and
    stringent performance demands

3
Data Management Requirements
  • Scientific collaborations in distributed
    environments generate queries involving access to
    large data sets
  • Efficient execution of these queries requires
  • careful management of large data caches,
  • gigabits data transfer over wide area networks,
  • creation, management, and strategic placement of
    replicas

4
Data Replication
  • Globus toolkit is a standard set of services
    supporting resource sharing applications
  • Data management services offered
  • GridFTP offers secure efficient data transfer in
    Grid and distributed environments
  • Replica Catalog allows users to register files
  • Replica Location Service allows users to locate
    replicas
  • The system only provides the users with tools to
    statically replicate data files

5
Data Management Issues
  • Existing Data Management frameworks demand
    extensive administrative oversight and management
    overhead such as Data Grids
  • Missing support for dynamic and intermittent
    participation on the Data Grid hinders scalable
    growth of collaborative research
  • Limited support to replication Data is
    statically replicated under user guidelines

6
Our Approach
  • To address these issues we introduce a
    decentralized performance-driven adaptive replica
    management middleware that
  • Uses an overlay network to organize
    participating nodes
  • Dynamically adapts replica placement to changing
    users and networks needs and behavior
  • Dynamically evaluates data access costs vs.
    performance gains before creating a new replica

7
Major Components
  • A theoretical model of data transfer cost and
    access performance
  • Parameterized by the changing computing
    environment
  • Data monitoring tools that feed current values of
    resource consumption to the cost function
  • Dynamic replica management services
  • Offer transparent replication using the cost
    function
  • Manage replica placement and discovery

8
Middleware Architecture
Replica Management Layer supports the management
and transferring of data between Grid nodes and
the creation of new replicas. Uses input from
lower layers to track users' access patterns and
monitor data popularity
Resource Access Layer provides access to
available resources and monitors their usage and
availability. Includes a Replica Catalog to
support transparent access to data at each Grid
node for local and remote users
Communication Layer consists of data transfer
and authen. protocols used to ensure security,
verify users identities, and maintain data
integrity provides support for the overlay
network structure
9
Framework
  • Services Offered by the middleware
  • Resource Monitoring service monitoring resource
    availability and access frequency
  • Replica Creation service replica creation based
    on cost evaluation
  • Replica Location service managing local replica
    Catalog
  • Resource Allocation service allocating space for
    newly created replicas
  • Routing and Connectivity service routing
    outgoing messages

10
Catalog Management
File Deletion
Replica deletion
Local Catalog
Local Storage
Key file ID1
Locations /../file1-
File Registration
Node80.cs.rpi.edu/../file1
Key file ID2
Locations /../file1-
Key file ID3
Locations Node80.cs.ri.edu.file2
Remote Catalog
Catalog Replicat.
Key file ID4
Locations /../file3
Replica Creation
File creation
11
Data Search and Replica Location
Data Location Process
Data Access Request
initiates
Local Catalog/ Database
Response processing
File Locations
Key file ID Location /../file- Node2.cs.rpi.ed
u/../file- Node80.cs.rpi.edu/../file
12
Data Model Construction
  • We use a combination of spanning tree and ring
    topologies
  • Grid Node Insertion
  • When joining the grid, a node is added through an
    existing grid node by attaching to it as a child
    node or a sibling
  • Node Removal
  • When a node leaves the tree, it sends a
    notification message to its parent, siblings, and
    children

13
Node Addition / Data Model Construction
After a Node joins, it starts developing a list
of preferred neighbors
Requests Flow
Data Flow
14
Middleware Deployment
  • We used two hierarchical distribution models that
    represent the most popular and commonly used
    models
  • Bottom Up Multiple collection sites
  • Top Down single collection site
  • Experiments were conducted on a cluster of 40
    Linux machines and a cluster of 20 FreeBSD
    workstations

15
Access Patterns
  • Data access requests are based on patterns
    commonly observed in scientific and data-sharing
    environments
  • Files are of similar sizes within an application
  • Spikes are generated by new Interesting Files
  • Users social organization and interests guide
    the overlay construction
  • Interest-based adaptive clustering of users

16
Top Down Model
17
Bottom UP Model
18
Top Down Experiments Results
19
Bottom Up Experiments Results
20
Access Performance Evaluation
21
Conclusions
  • Cost guided dynamic replication improves data
    access performance by up to 30 and a minimum of
    10 compared to static user initiated replication
  • The combination of parameter selection for cost
    evaluation and resource availability plays a key
    role in influencing the performance of the
    system.
  • Lower storage availability might lead to race
    conditions where popular data compete for storage
    space.
  • The results also show that popular data files
    benefit the most from dynamic replication
Write a Comment
User Comments (0)
About PowerShow.com