Automatic%20Data%20Virtualization%20-%20Supporting%20XML%20based%20abstractions%20on%20HDF5%20Datasets

About This Presentation

Title:

Automatic%20Data%20Virtualization%20-%20Supporting%20XML%20based%20abstractions%20on%20HDF5%20Datasets

Description:

Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets Swarup Kumar Sahoo Gagan Agrawal Roadmap Motivation Introduction System Overview ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 29

Provided by: sah108

Learn more at: http://web.cse.ohio-state.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic%20Data%20Virtualization%20-%20Supporting%20XML%20based%20abstractions%20on%20HDF5%20Datasets

1
Automatic Data Virtualization - Supporting XML
based abstractions on HDF5 Datasets

Swarup Kumar Sahoo
Gagan Agrawal

2
Roadmap

Motivation
Introduction
System Overview
XQuery, Low and High Level schema and HDF5
storage
Compiler Analysis and Algorithm
Experiment
Summary and Future Work

3
Motivation

Emergence of grid-based data repositories
Can enable sharing of data
Emergence of applications that process large
datasets
Complicated by complex and specialized storage
formats
Need for easily portable applications
Compatibility with web/grid services

4
Data Virtualization

An abstract view of data
dataset

Data Virtualization
Data Service

By Global Grid Forums DAIS working group
A Data Virtualization describes an abstract view
of data.
A Data Service implements the mechanism to
access and process data
through the Data Virtualization

5
Introduction Automatic Data Virtualization

Goal Enable Automatic creation of efficient
data services
Support a high-level or abstract view of data
Data is stored in low-level format
Application development
assume a high-level or virtual view
Application Execution
On actual low-level layout

6
Overview of Our Automatic Data Virtualization Work

Previous work on XML Based virtualization
Techniques for XQuery Compilation (Li and
Agrawal, ICS 2003, DBPL 2003)
Supporting XML Based high-level abstraction on
flat-file datasets (LCPC 2003, XIME-P 2004)
Relational Table/SQL Based Implementation
Supporting SQL Select and Where (HPDC 2004)
Supporting SQL-3 Aggregations (LCPC 2004)

7
XML-based Virtualization
NetCDF
HDF5
TEXT

RDBMS
8
Challenges and Contributions

Challenges
Compiler generates efficient data processing code
Uses the information about the low-level layout
and mapping between virtual and low-level layout
Challenge in compilation
High level to low level
to ensure high locality in processing of large
datasets
Contributions of this paper
An improved data- centric transformation
algorithm
An implementation specific to HDF5 as the
low-level format

9
System Overview
System Overview
High level XML Schema
Mapping Schema
XQuery Source Code
Low level XML Schema
Compiler
Generated Code
HDF5 Library
Processor and Disk
10
XQuery and HDF5

High-level declarative languages ease application
development
XQuery is a high-level language for processing
XML datasets
Derived from database, declarative, and
functional languages!
HDF5
Hierarchical Data Format
Widely used in scientific communities
A case study with a format which has optimized
access libraries

11
Use of XML Schemas

High-level schema
XML is used to provide a virtual view of the
dataset
Low-level schema
reflects actual physical layout in HDF5
Mapping schema
describes mapping between each element of
high-level schema and low-level schema

12
Oil Reservoir Simulation

Support cost-effective Oil Production
Simulations on a 3-D grid
17 variables and cell locations in 3-D grid at
each time step
Computation of bypassed regions
Expression to determine if a cell is bypassed for
a time-step
Within a spatial region and range of time steps
Grid cells that are bypassed for every time-step
in the range

Oil Reservoir management
13
High-Level Schema

lt xselement name"data" maxOccurs"unbounded" gt
lt xscomplexType gt
lt xssequence gt
lt xselement name"x" type"xsinteger"/ gt
lt xselement name"y" type"xsinteger"/ gt
lt xselement name"z" type"xsinteger"/ gt
lt xselement name"time" type"xsinteger"/ gt
lt xselement name"velocity" type"xsfloat"/ gt
lt xselement name"mom" type"xsfloat"/ gt
lt /xssequence gt
lt /xscomplexType gt
lt /xselement gt

14
High-Level XQuery Code Of Oil Reservoir management

unordered(
for i in (x1 to x2)
for j in (y1 to y2)
for k in (z1 to z2)
let p document("OilRes.xml")/data
where (p/xi) and (p/y j) and (p/z k)
and (p/time gt tmin) and (p/time lt
tmax)
return
ltinfogt
ltcoordgt i, j, k lt/x-coordgt
ltsummarygt analyze(p) lt/summarygt
lt/infogt
)

15
Low-Level Schema

ltfile name"info"gt
ltsequencegt
ltgroup name"data"gt
ltattribute name"time"gt ltdatatypegt integer
lt/datatypegt
ltdataspacegt ltrankgt 1 lt/rankgt ltdimensiongt 1
lt/dimensiongt lt/dataspacegt
lt/attributegt
ltdataset name"velocity"gt ltdatatypegt float
lt/datatypegt
ltdataspacegt ltrankgt 1 lt/rankgt ltdimensiongt x
lt/dimensiongt lt/dataspacegt
lt/datasetgt
..............
lt/groupgt
lt/sequencegt
lt/filegt

16
Mapping Schema

//high/data/velocity //low/info/data/velocity
//high/data/time //low/info/data/time
//high/data/mom //low/info/data/mom
index(//low/info/data/velocity, 1)
//high/data/x //low/coord/x
index(//low/info/data/velocity, 1)

17
Compiler Analysis

Problem with direct translation
Each let expression involves complete scan over
dataset
So final code will need several passes over the
data
Solution
Apply Data Centric Transformations to read a
portion HDF5 dataset only once

18
Naïve Strategy
Dataset
Output
19
Data Centric Strategy
Datasets
Output
Requires just one scan
20
Data Centric Transformation

Overall Idea in Data-Centric Transformation
Iterate over each data element in actual storage
Find out iterations of the original loop in which
they are accessed.
Execute computation corresponding to those
iterations.
Previous Work
Pingali et al. blocking
Ferreira and Agrawal data-parallel Java on
disk-resident datasets
Li and Agrawal XQuery, invert getData
functions
Our contribution
Use Low-Level and Mapping Schema
Extend the idea when multiple datasets need to be
accessed

21
Data Centric Transformation

Mapping Function T
Iteration space ? High-Level data
Mapping Function C
High-Level data ? Low-Level data
Mapping Function C T M
Iteration space ? Low-Level data
Our Goal is to compute M-1.

22
Data Centric Transformation

Choose one dataset as base dataset S1 from n
datasets to be accessed
Apply M1-1 to compute set of iterations.
The expression Mi M1-1 gives the portion of
dataset Si that needs to be accessed along with
S1
Choice of base dataset might impact the data
locality.

23
Choice of Base Dataset

Min-IO-Volume Strategy
Minimize repeated access to any dataset
Min-Seek-Time Strategy
Minimize any discontinuity in access

24
Template for Generated Code

Generated_Query
Create an abstract iteration space using Source
code.
Allocate and initialize an array of output
element corresponding to iteration space.
For k 1, , NO_OF_CHUNKS
Read kth chunk of dataset S1 using HDF5
functions and structural tree.
Foreach of the other datasets S2, , Sn
access the required chunk of the dataset.
Foreach data element in the chunks of data
compute the iteration instance.
apply the reduction computation and update the
output.

25
Experiment
200200200 grid with 10 time steps (1.28
GB) 505050 Storage Chunk Size
26
Experiment
505050 grid with 200 time steps (400
MB) 252525 Storage Chunk Size
27
Key Observations

Overall minimum execution time
Min-IO-Volume strategy when read chuck size
matches storage chunk size
Execution time
Very sensitive to Read Chunk-Size in
Min-IO-Volume Strategy
Not sensitive to Read Chunk-Size in Min-Seek-Time
Strategy due to buffering of Storage chunks

28
Summary

Compiler techniques
Support High-level abstractions on complex
low-level data formats
Enables use of the same source code across a
variety of data formats
Perform data centric transformations
automatically
Experimental result shows minor change in
strategy can affect performance significantly
Future Work
Cost models to guide strategy and chunk size
selection
Compare performance with manual implementations
parallelizing data processing
extend applicability of the algorithm to more
general class of queries