Data Grids Data Intensive Computing - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Data Grids Data Intensive Computing

Description:

Term Grid borrowed from electrical grid ... Unavailability of file can cause job to hang. Potential delay to job can be unbounded ... – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 65
Provided by: mingleisus
Category:

less

Transcript and Presenter's Notes

Title: Data Grids Data Intensive Computing


1
Data GridsData Intensive Computing
2
Simplistically
  • Data Grids
  • Large number of users
  • Large volume of data
  • Large computational task involved
  • Connecting resources through a network

3
Grid
  • Term Grid borrowed from electrical grid
  • Users obtains computing power through Internet by
    using Grid just like electrical power from any
    wall socket

4
Data Grid
  • By connecting to a Grid, can get
  • needed computing power
  • storage spaces and data
  • Specialized equipment
  • Each user - a single login account to access all
    resources
  • Resources - owned by diverse organizations
    Virtual Organization

5
Data Grids
  • Data
  • Measured in terabytes and petabytes
  • Also geographically distributed
  • Researchers
  • Access and analyze data
  • Sophisticated, computationally expensive
  • Geographically distributed
  • Queries
  • Require management of caches, data transfer over
    WANs
  • Schedule data transfer and computation
  • Performance estimates to select replicas

6
Data Grids
  • Domains as diverse as
  • Global climate change
  • High energy physics
  • Computational genomics
  • Biomedical applications

7
Data Grids
  • Data grid differs from
  • Cluster computing grid is more than homogeneous
    sites connected by LAN (grid can be multiple
    clusters)
  • Distributed system grid is more than
    distributing the load of a program across two or
    more processes
  • Parallel computing grid is more than a single
    task on multiple machines
  • Data grid is
  • heterogeneous, geographically distributed,
    independent site
  • Gridware manages the resources for Grids

8
Methods of Grid Computing
  • Distributed Supercomputing
  • Tackle problems that cannot be solved on a single
    system
  • High-Throughput Computing
  • goal of putting unused processor cycles to work
    on loosely coupled, independent tasks (SETI
    Search for Extraterrestrial Intelligence)
  • On-Demand Computing
  • short-term requirements for resources that are
    not locally accessible, real-time demands

9
Methods of Grid Computing
  • Data-Intensive Computing
  • Synthesize new information from data that is
    maintained in geographically distributed
    repositories, databases, etc.
  • Collaborative Computing
  • enabling and enhancing human-to-human
    interactions

10
An Illustrative Example
  • NASA research scientist
  • collected microbiological samples in the
    tidewaters around Wallops Island, Virginia.
  • Needs
  • high-performance microscope at National Center
    for Microscopy and Imaging Research (NCMIR),
    University of California, San Diego.

11
Example (continued)
  • Samples sent to San Diego and used NPACIs
    Telescience Grid and NASAs Information Power
    Grid (IPG) to view and control the output of the
    microscope from her desk on Wallops Island.
  • Viewed the samples, and move platform holding
    them, making adjustments to the microscope.

12
Example (continued)
  • The microscope produced a huge dataset of images
  • This dataset was stored using a storage resource
    broker on NASAs IPG
  • Scientist was able to run algorithms on this very
    dataset while watching the results in real time

13
Grid - Lower level services
  • Other basic services
  • Authorization/authentication
  • Resource reservation for predictable transfers
  • Performance measurements, estimation techniques
  • Instrument services that enable end-to-end
    instrumentation of storage transfers

14
Grid - Higher level services
  • Replica manager
  • Create/delete copies of files instances
  • Typically byte-for-byte copies
  • Replica created for better performance/availabilit
    y
  • Logical file exists in metadata repository with
    globally unique name
  • Related logical files grouped into replica
    catalogs collections (hierarchies too)
  • File not in catalog is in local cache
  • Replica policy separate from replica manager
  • Can keep local copies separate

15
Topics to follow
  • Discuss data Grid research at UA
  • Discuss Green computing
  • Discuss Celadon cluster at UA

16
An On-Line Replication Strategy to Increase
Availability in Data Grids
  • Ming Lei, PhD
  • Department of Computer Science
  • University of Alabama
  • Now at Oracle Corporation
  • Atlanta, GA

17
Introduction
  • How to improve file access time and data
    availability?
  • Replicate the Data!
  • Copies of files at different sites
  • Deciding where and when is the problem
  • Dynamic behavior of Grid user
  • Large volume of datasets
  • Hundreds of client across the globe submit
    requests

18
Introduction
  • Early work in data replication focused on
    decreasing access latency and network bandwidth
  • As bandwidth and computing capacity become
    cheaper, data access latency can drop
  • How to improve availability and reliability
    becomes the focus
  • Unavailability of file can cause job to hang
  • Potential delay to job can be unbounded
  • Any node failure or data outage can cause
    potential file unavailability

19
Related Replica Work
  • Economical model replica decision based on
    auction protocol Carman, Zini, et al. e.g.,
    replicate if used in future, unlimited storage
  • Hotzone places replicas so client-to-replica
    latency minimized Szymaniak et al.
  • Replica strategies central and distributed
    replication Tang et al. consider limited
    storage but only LRU replacement
  • Multi-tiered Grid Simple Bottom Up and
    Aggregate Bottom Up Tang et al.
  • Replicate fragments of files, block mapping
    procedure for direct user access ChangChen

20
Motivation
  • Want to complete a job with correct data
  • File access failure can lead to incorrect result
    or job crash
  • Improve overall system availability
  • Propose to measure the system level data
    availability
  • Assume limited file storage

21
Data Grid Architecture
Computing Element CE Storage Element SE Replica
manger containing a replica optimizer
22
File Availability
  • File availability
  • Associate with each SE (storage element) is a
    file availability (probability will be available)
  • Doesnt help to increase copies at same SE, all
    fail together
  • One copy per SE
  • All copies same availability at same SE

23
Measures of System Availability
  • System File Missing Rate SFMR
  • number of files potentially unavailable
  • number of all the files requested by all the jobs
  • System Bytes Missing Rate SBMR
  • number of bytes potentially unavailable
  • Total number bytes requested by all the jobs
  • two metrics will be the same when all files
    sizes the same

24
System model
Availability Pj of file fj is
set of jobs, J (j1, j2, j3, jN)
PSEi is the file availability in the ith SE k
denotes the number of copies of the file fj
25
System model
System File Missing Rate SFMR
n denotes the total number of jobs, each of which
will have m file accesses
System Bytes Missing Rate SBMR
Sj denotes the size of file fj
26
Problem Generalization

sequence of file requests O(r1, r2, r3., rN),
SFMR
SBMR
best system data availability results from
minimizing above equations subject to
Ci denotes the number of copies of fi S is the
total storage available
27
Problem Generalization
Transfer the minimization problem to a
maximization problem
SFMR
SBMR
N is the total number of the request operations
in a given set O
Tbytes denotes the total bytes that will be
accessed for all of O
To minimize the SFMR and SBMR, we need to
maximize
and
28
On-line Optimal Replication Problem
With each file associate a value Vi (future
access)
Assume a newly requested file is t
Choose a file set d f1,f2,..,fk from the file
set F t to achieve the maximum
and
If t is in d, then we need to replicate the file
The above optimal problem is a classic Knapsack
problem
Aggregate each file replicas storage costs
together as the weight of the item fi
29
On-line Optimal Replication Problem
Solving this Knapsack problem at each replacement
instant is known to be NP-hard
Can convert our optimization problem to an
approximate fractional knapsack problem (done
elsewhere by Berkeley people)
Assume that the storage capacity is sufficiently
large and holds a significantly large number of
file
Amount of space left after storing the maximum
number of files is negligible compared to the
total storage space
30
Minimum Data Missing Rate StrategyMinDmr
  • Propose MinDmr replica optimizer
  • In our greedy algorithm, we introduce the file
    weight as
  • W (Pj Vj) /(Cj Sj)
  • Vj file value based on future accesses
  • Pj - file fjs availability
  • Cj - the number of copies of fj
  • Sj - the size of fj

31
MinDmr Strategy
  • Value Vi
  • Must make long term performance decisions
  • Each file access operation ri, at instant T, is
    associated with an important variable Vi
  • Vi is set to number of times file will be
    accessed in the future
  • Assign future value to file via a prediction
    function

32
Prediction Functions
  • Prediction via four kinds of prediction
    functions
  • Bio Prediction binomial distribution is used to
    predict a value based on file access history
  • Zipf Prediction Zipf distribution is used to
    predict a value based on file access history
  • Queue Prediction The current job queue is used
    to predict a value of the file
  • No Prediction No predictions of the file are
    made, the value will always be 1

33
MinDmr Strategy
  • For each file request
  • If enough space
  • replicate file
  • Else
  • Sort stored files by weight W
  • Replace file(s)
  • if value gained by replicating gt
  • value lost by replacing a file

34
(No Transcript)
35
Existing Eco model Comparison
  • Compare to Economical Model in OptorSim
  • Eco
  • File replicated if maximizes profit if SE (e.g.
    what is earned over time based on predicted
    number of file requests)
  • Eco prediction functions
  • EcoBio
  • EcoZipf

36
Existing Eco model Comparison
  • MinDmr differs from Eco
  • Both greedy
  • MinDmr uses 2 values gain/loss and value for
    sorting existing files for replacement
  • Eco uses same value to determine files value and
    replacement
  • MinDmr includes availability, copies and size
  • Incidence of replication different for both Eco
    replicates the same file many more times

37
OptorSim
  • Evaluate the performance of our replica and
    replacement strategy
  • Using OptorSim
  • OptorSim developed by the EU DataGrid Project to
    test dynamic replica schemes

38
Grid topology
39
Compare to
  • Will compare 8 replica schemes (optimizers)
  • Bio MD (Bio MinDmr)
  • ZipfMD (Zipf MinDmr)
  • MDNo Pred (MinDmr R no prediction)
  • MDQuePred (MinDmr queue prediction)
  • EcoBio
  • EcoZipf
  • LRU (least recently used)
  • LFU (least frequently used)

40
What to vary
  • Comparisons made for
  • varying access patterns
  • Total job time
  • Varying scheduler
  • Queue length
  • SE availability
  • Different file size
  • Different sized files

41
Access Patterns
  • Consider 4 access patterns (OptorSim)
  • Random
  • Random Walk Gaussian
  • Sequential
  • Random Walk Zipf

42
Job Schedulers
  • Consider 4 types of job schedulers (OptorSim)
  • Random
  • Shortest Queue
  • Access Cost file has lowest access cost
  • Queue access cost sum of access cost for job
    and access cost for all jobs in the Q is the
    smallest

43
Performance Results
44
Workload and system parameter values
File availability at each SE is 99
45
SFMR with varying replica optimizers
46
Results
  • MinDMR (MD) perform better than both Eco
  • EcoBio worst, EcoZipf 2nd worst
  • SFMR for Eco up to 200 times greater than MinDMR
  • LFU slightly better than LRU
  • ZipfMD worse than LRU, LFU
  • This will be consistent in most of the results
  • ZipfMD uses Zipf prediction function in OptorSim-
    not acurrate

47
Total job time with sequential access
48
Results
  • Total job time smallest for MinDMR
  • BioMD shortest, EcoBio the longest
  • LRU higher SFMR, but lower total job time
  • Notice we only used sequential access

49
SFMR with varying job schedulers
50
Results
  • Shortest Q, Access cost similar SFMR values for
    all replica schemes
  • Random worst, Q access cost best
  • Notice dropped LRU

51
SFMR with varying job queue length
52
Results
  • Effect of queue length on SFMR
  • Consider only MDQuePred
  • Shorter job queue, higher SFMR
  • However, if queue too long, SFMR can increase
    slightly
  • Valuable files are replicated and stay in storage
    too long

53
Total Job Time with varying job queue length
54
Results
  • As length of Queue increases, total running time
    decreases
  • Decrease increases for longer queues
  • Trade-off total job time for SFMR

55
SFMR Ratio of MDQuePred with varying SE
availability
56
Results
  • Vary availability at 90, 99, 99.9 and 99.99
  • Compare to availability of MDQuePred (smallest)
  • All benefit from higher availability
  • MinDmr strategies always smaller

57
SFMR with sequential access when varying file size
58
Results
  • Change size of all files from 200, 300, 400, 500,
    600M all files still same size
  • The larger the file size, the higher the SFMR
  • All MinDmr better except for ZipfMD

59
SFMR and SBMR with file sizes different
60
Results
  • Each file different size
  • Range from 500M-1G
  • All replica schemes except LFU higher SBMR than
    SFMR
  • Schemes store small-size files in relica space,
    displacing larger ones
  • LFU (LRU) not affected
  • MinDmr (except ZipfMD) better

61
Difference in SBMR and SFMR with file sizes
different
62
Results
  • Display SBMR-SFMR
  • Largest gap for EcoBio, smallest BioMD
  • Larger gap for EcoBio, EcoZipf and ZipfMD.
  • SFMR and SBMR small for LFU BioMed, MDNoPred and
    MDQuePred.

63
Conclusions
  • MinDmr is better than the others in terms of the
    new data availability metrics regardless of
  • File sizes
  • System load
  • Queue length
  • Prediction function
  • Job schedulers
  • File access patterns

64
Future Work
  • Differentiate SFMR and SBMR when file sizes not
    unique
  • Study preferential treatment by algorithm of
    smaller size files
  • File bundle situation
  • Quality of service issues
Write a Comment
User Comments (0)
About PowerShow.com