Data Grid Technologies - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Data Grid Technologies

Description:

Replication to deal with faults and provide scheduling flexibility. ... James Annis , Yong Zhao, Jens Voeckler, Michael Wilde, Steve Kent, Ian Foster. SC 2002. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 38
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: Data Grid Technologies


1
Data Grid Technologies
  • Sathish Vadhiyar
  • Sources/Credits Technical papers listed in
    references

2
Replica Strategies
3
Problem Motivation
  • Replication to deal with faults and provide
    scheduling flexibility.
  • Given a file that is partitioned into blocks that
    are replicated throughout a wide-area file
    system, how can a client retrieve the file with
    the best performance?
  • Various algorithms

4
Basic Downloading Algorithm
  • The client opens a thread to each server
    containing the file
  • A block size is chosen
  • Each thread selects a different block to download
    and all threads start downloading
  • A thread then chooses a new block that is
    currently not being downloaded by any other
    thread
  • Adaptive Servers with higher bandwidths to
    clients download more blocks
  • Selection of block size - tricky

5
Aggressive Redundancy
  • To provide fault tolerance and to improve
    download time
  • A redundancy factor, R
  • The client downloads a block simultaneously from
    R servers
  • Only 1 is chosen whichever returns first

6
Progress-Driven Redundancy
  • Retry a download only when it is progressing
    slowly
  • Progress number - P, redundancy factor R
  • Each block assigned a download number initialized
    to 0
  • When a thread attempts to download a block, it
    increments the blocks download number

7
Progress-Driven Redundancy (Continued)
  • For selecting a new block to download
  • If there is a block B whose download number lt R,
    and if there are P blocks after B whose downloads
    have completed, then select B
  • Else select next block whose download number is
    zero

8
Fastest1
  • Another approach
  • For downloading a block, choose a server that has
    minimum value of time(l1)
  • time predicted time to download a block when
    there is no contention. Obtained from NWS numbers
    before download is initiated.
  • l number of threads currently downloading from
    the server

9
Results
10
Multiple clients
  • Situation arises when parallel data for
    computation on parallel clients have to be
    selected from available replica server locations
  • More challenges download decision by a client
    can impact download performance on other clients.
    Need to predict this impact.
  • Periodic network monitoring have to be augmented
    by measurements corresponding to current
    downloads

11
Collective Download algorithm
  • Each algorithm connects to a server only once
    even if some of the data belongs to other clients
    download phase
  • The clients then redistribute data among
    themselves redistribution phase
  • Widely followed in parallel-I/O
  • Especially useful when clients and servers are on
    either side of WAN multiple latencies can be
    avoided at the cost of less expensive
    redistribution phase

12
Replica Placement Strategies
  • Replica placement questions
  • When should replicas be created?
  • Which files should be replicated?
  • Where should replicas be placed?
  • The model assumes that data is produced in tier-1
    (root) and there are storage spaces at various
    tiers (levels of hierarchy)
  • Clients that request data form the leaves of the
    hierarchy

13
Placement strategies
  • Best client
  • Each storage node maintains history regarding
    number of requests for the files it contains
  • If the number of requests for a file exceeds the
    threshold, the node creates a replica of the file
    in that client node that has generated most
    requests for that file (best-client)
  • The request details for the file are cleared.

14
Strategies
  • Cascading replication
  • Analogy to a 3-tiered function
  • Once a threshold for a file is exceeded at the
    root, a replica is created at the next level on
    the path to the best client and so on
  • Geographical locality is exploited
  • Plain caching done at the client
  • Caching plus Cascading Replication

15
Strategies
  • Fast Spread
  • A replica of the file is stored at each node
    along its path to the client
  • Replica selection closest replica
  • Replica replacement least popular file with
    oldest age is replaced. Popularity logs are
    cleared periodically

16
Findings
  • Best-client performs worst for random access
    patterns and shows improvement for access
    patterns with a bit of geographical locality
  • Fast spread works much better than cascading for
    random data access
  • Bandwidth savings are more in fast spread than in
    cascading
  • Fast spread has high storage requirements

17
Computation and Data
18
GriPhyN
  • Focuses on virtual data grid technologies
  • Allows exploitation of computation procedures and
    results as community resources
  • Request to data can either retrieve data or
    execute computation procedures that produce the
    data

19
Challenges
  • Representing transformations in virtual data
    catalog
  • Tracing derived data
  • Mapping computations onto effective flow graphs
  • Rebuilding dependent objects when code or data
    changes
  • Automated generation and scheduling of
    computations required to instantiate data products

20
Chimera
  • Virtual data system that supports capture and
    reuse of data generated by computations
  • Consists of virtual data catalog and virtual data
    language interpreter
  • VDC tracks how data is derived
  • Transformation abstract definition of how a
    program is to be invoked, what parameters and
    input files it needs etc.
  • Derivation invocation of a transformation with
    specific set of inputs and files
  • Execution of all transformations recorded in
    Chimera database
  • VDL query functions allows to search VDC for
    derivation or transformation. Queried by
    application, transformation, input, output name

21
Chimera architecture
22
Transformation and Derivation Example
23
Chimera-Pegasus Architecture
24
Work flows
25
Decoupling computation and data movement
26
Architecture
  • External Scheduler (ES)
  • Decides which remote site to send the job to
  • Local Scheduler (LS)
  • Follows its own policies
  • Data Scheduler (DS)
  • Replicates popular data sets to remote sites
    following some algorithm

27
Algorithms
  • 4 different ES algorithms
  • JobRandom
  • JobLeastLoaded
  • JobDataPresent
  • JobLocal
  • 3 different DS algorithms
  • DataDoNothing
  • DataRandom
  • DataLeastLoaded

28
Simulation
  • Discrete Event Simulator was used
  • Resource capacities were modeled
  • Dataset sizes uniform distribution between 500
    MB to 2 GB
  • Initially only one replica per data set
  • Users mapped evenly across sites
  • Each job requires a single input file and
    requires 300 D seconds, where D is the input size
    in GB
  • Network contention modeled based on number of
    simultaneous data transfers
  • Input file requests generated randomly according
    to geometric distribution based on popularity of
    files

29
Popularity distribution
30
Results
31
Sources / References / Credits
  • Algorithms for high Performance, Wide-area
    distributed file downloads. J.S. Plank, S.
    Atchley, Y.Ding and M. Beck, Parallel Processing
    Letters, vol. 13, no. 2, pp 207-224, June 2003.
  • Downloading Replicated Wide-Area Files a
    Framework and Empirical Evaluation. R.L. Collins
    and J.S. Plank. NCA 2004.
  • Identifying Dynamic Replication Strategies for a
    High-Performance Data Grid. K. Ranganathan and I.
    Foster. Grid 2002.

32
Sources / References / Credits
  • Grid-Based Galaxy Morphology Analysis for the
    National Virtual Observatory. Ewa Deelman,
    Raymond Plante, Carl Kesselman, Gurmeet Singh,
    Mei-Hui Su, Gretchen Greene, Robert Hanisch,
    Niall Gaffney, Antonio Volpicelli, James Annis,
    Vijay Sekhri, Fermi Tamas Budavari, Maria
    Nieto-Santisteban, William O'Mullane, David
    Bohlender, Tom McGlynn, Arnold Rots, Olga
    Pevunova, Supercomputing 2003.
  • Applying Chimera virtual data concepts to cluster
    finding in the Sloan Sky Survey. James Annis ,
    Yong Zhao, Jens Voeckler, Michael Wilde, Steve
    Kent, Ian Foster. SC 2002.

33
Sources / References / Credits
  • Kavitha Ranganathan and Ian Foster, Decoupling
    Computation and Data Scheduling in Distributed
    Data Intensive Applications, Proceedings of the
    11th International Symposium for High Performance
    Distributed Computing (HPDC-11), Edinburgh, July
    2002.

34
Replica Creation and Elimination Policy
  • More replicas lead to load balance but puts
    pressure on storage capacities
  • Replication creation
  • On-demand
  • By replica managers in the background
  • Replica management decisions for on-demand
  • Replica decision should a remote file be
    replicated at a local site in response to the
    file request?
  • Replica selection and
  • Replica replacement

35
Replica Optimization Strategies
  • LRU
  • Replication decision always replicate
  • Replica selection based on closest location
  • Replica replacement LRU
  • Binomial
  • Replica decision based on file value calculated
    from binomial prediction of file popularity
  • Replica selection using auction and bidding
  • Replica replacement replace local replica with
    lowest file value
  • Zipf
  • Same as binomial, but zipf distribution is used

36
Scheduling optimizations
  • Random assign job to a random host
  • Shortest queue assign job to host whose queue
    is smallest
  • Access cost assign job to host whose access
    cost to files required by the job is smallest
  • Queue access cost assign job to host where the
    sum of access cost for this job and all the jobs
    in the queue is the smallest

37
Results Impact of Network Performance
Write a Comment
User Comments (0)
About PowerShow.com