A Distributed Architecture for Multidimensional Indexing and Data Retrieval in Grid Environments - PowerPoint PPT Presentation

About This Presentation
Title:

A Distributed Architecture for Multidimensional Indexing and Data Retrieval in Grid Environments

Description:

Implementation of BitTorrent designed to interface and integrate with GridFTP ... BitTorrent ... by exchanging the BitTorrent bit field message, informing ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 19
Provided by: STS80
Category:

less

Transcript and Presenter's Notes

Title: A Distributed Architecture for Multidimensional Indexing and Data Retrieval in Grid Environments


1
A Distributed Architecture for Multi-dimensional
Indexing and Data Retrieval in Grid Environments
  • Athanasia Asiki, Katerina Doka,Ioannis
    Konstantinou, Antonis Zissimos and Nectarios
    Koziris
  • National Technical University of Athens
  • School of Electrical and Computer Engineering
  • Computing Systems Laboratory
  • e-mail nasia, katerina, ikons, azisi,
    nkoziris_at_cslab.ece.ntua.gr

2
Abstract
  • A service-oriented architecture of a generic
    middleware platform, which provides the required
    services for efficient content storage, search
    and retrieval in a distributed environment.
  • Algorithms from Peer-to-Peer computing are
    introduced in a grid environment in order
    scalability, fault-tolerance and data
    availability despite nodes arrivals and
    departures to be ensured
  • The system consists of heterogeneous resources
    belonging to different Virtual Organizations

3
Outline
  • Introduction
  • Challenges and requirements
  • Overall architecture
  • A multidimensional indexing scheme
  • The Distributed Replica Location Service
  • GridTorrent protocol

4
A brief introduction
  • Grid computing
  • Remotely located, disjoint and diverse processing
    and data storage facilities are integrated under
    common software architecture (middleware)
  • Resources connected to a shared network and
    provide the necessary software-level services to
    be remotely used and administered
  • Heterogeneity of resources
  • Rules and policies define the sharing of
    resources
  • P2P computing
  • Oriented towards the sharing of large amount of
    data
  • Files are stored in a dynamic set of peers, which
    may join or the leave the network
  • Absence of centralized structures

5
Our approach
  • Main Idea
  • Exploitation of Peer-to-Peer techniques to
    build a service-oriented
  • infrastructure for data management and search
  • Features of the proposed system
  • A powerful metadata search mechanism supporting
    both point and range queries
  • A data transfer mechanism for efficient storage
    and retrieval of data in distributed and
    heterogeneous resources
  • A distributed Replica Location Service to keep
    track of file locations
  • Motivation
  • The design of the proposed architecture has
    been largely motivated by the requirements posed
    by the Gredia research project (http//www.gredia.
    eu/?Pagehome)

6
Challenges and requirements
  • The data needs to be partitioned among the nodes
    according to a strategy ensuring
  • load balance
  • effective query processing
  • Absence of centralized structures to provide an
    overall view of the system
  • Efficient query routing is required so as
  • A small number of nodes to process the query
  • The number of exchanged messages to remain
    relative small
  • Data locality shall be preserved, namely relative
    data to be kept in the same node if feasible
  • New advanced features should not affect the
    maintenance cost of the overlay
  • The large size of data along with limitations in
    storage capacity and network bandwidth should be
    considered and a more flexible structure should
    be adopted
  • ? Metadata and data will be stored in different
    overlays

7
Architecture (1)
  • Three different overlays will be implemented
  • Metadata overlay
  • DRLS overlay
  • Storage overlay
  • The Metadata overlay and the DRLS overlay are
    implemented with the required extensions to the
    Kademlia DHT
  • The Storage overlay comprises a distributed
    repository
  • Kademlia DHT
  • PING, STORE, FIND_NODE and FIND_VALUE RPCs
  • A hash function is used to assign keys to values
    stored in the DHT
  • Distance among points in the key space is defined
    by and XOR metric

8
Architecture (2)
  • A data file is described by a predefined set of
    attributes included in its metadata file
  • Upload a file
  • The file is assigned with a unique identifier
  • The unique identifier is included in the
    metadata file
  • The data file is uploaded to the Storage
    Overlay by the GridTorrent mechanism
  • The metadata file is inserted in the Metadata
    Overlay
  • The physical location(s) where the file is (are)
    inserted in the DRLS overlay

9
Architecture (3)
  • Search procedure
  • The search mechanism is applied in the Metadata
    overlay
  • A user can search the metadata files according to
    his / her criteria and select the data file(s) of
    its own interest
  • The physical location(s) of the data file
    replicas are returned by the DRLS overlay
  • The GridTorrent protocol downloads a file
    exploiting sharing properties in order to boost
    aggregate performance

10
Space Filling Curves
  • Map continuously a compact interval to a
    d-dimensional space
  • Partitioning of the d-dimensional space into 2kd
    cells, which in turn are mapped through the Space
    Filling Curve to 2kd points of a single dimension
  • Recursive nature (generation of the curve)
  • Perseverance of locality, points being close in
    the 1-dimensional space are mapped to points that
    are close together in the d-dimensional space

11
Multidimensional indexing
  • The set of d attributes chosen to be indexed form
    a d-dimensional space
  • Each combination of attributes values is
    depicted to a point of the d-dimensional space
  • The key of a metadata file is produced by the
    Space Filling curve
  • Query processing
  • Clusters of the Space Filling curve will be
    defined answering the query
  • Lookup for the specified clusters
  • Load balance
  • Virtual servers ? entity that owns an interval of
    the identifier space
  • Each physical node contains multiple virtual
    servers
  • When a physical node is overloaded by virtue of
    available storage space or bandwidth, it may move
    one ore more of its virtual servers to another,
    underloaded physical node

12
DRLS (1)
  • A Distributed Replica Location Service
  • A DHT by correlates its inherent key-value pairs
    to the unique identifier of a file to Physical
    File Names (PFNs) mappings
  • Problems
  • In DHTs read-only files are stored while mutable
    data cannot be handled
  • The replication strategy results in key-value
    pairs to be stored in nodes that are close to the
    ID of the key and cached around the network
  • The exact location(s) of a key-value pair in a
    given moment cannot be returned and the update of
    all replicas is not ensured
  • PFN mappings for a given LFN change frequently

13
DRLS (2)
  • Solution
  • Every lookup always queries all nodes responsible
    for a specific key-value pair
  • The Lookup procedure does not stop to the first
    returned value and peers that do not reply with a
    value, they are considered as not uptodate
  • When all available results are returned, the
    query node compares the results based on some
    predefined version vector (indicating the latest
    update of the value)
  • Updates are propagated to the nodes it has found
    responsible for storage but not yet up-todate
    with the latest value

14
The data transfer mechanism of the Storage overlay
  • Implementation of BitTorrent designed to
    interface and integrate with GridFTP
  • A GridTorrent client is able to request file
    fragments from other GridTorrent clients holding
    the file or connect to GridFTP servers
  • Effectiveness and scalability, even under extreme
    load and flash crowd conditions
  • Optimized replica selection based on the
    rarest-first policy
  • BitTorrent
  • A peer-to-peer protocol that allows clients to
    download files from multiple sources while
    uploading them to other users at the same time

15
GridTorrents basic components
16
GridTorrent
  • Queries the RDLS periodically
  • The DRLS returns to the GridTorrent peer
  • PFN of replicas identified by the GridTorrent URL
  • gtp//site.fully.qualified.domain.name/path/to/fi
    le
  • File size, hashes of pieces, size of each piece
  • GridTorrent client
  • The two invlolved peers initiate communication by
    exchanging the BitTorrent bit field message,
    informing each other of the pieces they possess
  • A request message for blocks is issued
  • GridFTP server
  • The client issues a GridFTP partial get message
    for the data within the specific block it intends
    to download

17
Conclusions
  • The paper proposes a service-oriented
    architecture for efficient search and retrieval
    of annotated content
  • Different overlays are implemented and
    Peer-to-Peer techniques are introduced in the
    Grid environment
  • The design of the platform ensures scalability
    and fault-tolerance
  • An extensible architecture is presented favoring
    the integration with other systems and the
    development of Grid applications

18
  • Thank you for your attention
Write a Comment
User Comments (0)
About PowerShow.com