Title: Automatic%20Data%20Virtualization%20-%20Supporting%20XML%20based%20abstractions%20on%20HDF5%20Datasets
1Automatic Data Virtualization - Supporting XML
based abstractions on HDF5 Datasets
- Swarup Kumar Sahoo
- Gagan Agrawal
2Roadmap
- Motivation
- Introduction
- System Overview
- XQuery, Low and High Level schema and HDF5
storage - Compiler Analysis and Algorithm
- Experiment
- Summary and Future Work
3Motivation
- Emergence of grid-based data repositories
- Can enable sharing of data
- Emergence of applications that process large
datasets - Complicated by complex and specialized storage
formats - Need for easily portable applications
- Compatibility with web/grid services
4Data Virtualization
-
- An abstract view of data
- dataset
Data Virtualization
Data Service
- By Global Grid Forums DAIS working group
- A Data Virtualization describes an abstract view
of data. - A Data Service implements the mechanism to
access and process data - through the Data Virtualization
5Introduction Automatic Data Virtualization
- Goal Enable Automatic creation of efficient
data services - Support a high-level or abstract view of data
- Data is stored in low-level format
- Application development
- assume a high-level or virtual view
- Application Execution
- On actual low-level layout
6Overview of Our Automatic Data Virtualization Work
- Previous work on XML Based virtualization
- Techniques for XQuery Compilation (Li and
Agrawal, ICS 2003, DBPL 2003) - Supporting XML Based high-level abstraction on
flat-file datasets (LCPC 2003, XIME-P 2004) - Relational Table/SQL Based Implementation
- Supporting SQL Select and Where (HPDC 2004)
- Supporting SQL-3 Aggregations (LCPC 2004)
7XML-based Virtualization
NetCDF
HDF5
TEXT
RDBMS
8Challenges and Contributions
- Challenges
- Compiler generates efficient data processing code
- Uses the information about the low-level layout
and mapping between virtual and low-level layout - Challenge in compilation
- High level to low level
- to ensure high locality in processing of large
datasets - Contributions of this paper
- An improved data- centric transformation
algorithm - An implementation specific to HDF5 as the
low-level format
9System Overview
System Overview
High level XML Schema
Mapping Schema
XQuery Source Code
Low level XML Schema
Compiler
Generated Code
HDF5 Library
Processor and Disk
10XQuery and HDF5
- High-level declarative languages ease application
development - XQuery is a high-level language for processing
XML datasets - Derived from database, declarative, and
functional languages! - HDF5
- Hierarchical Data Format
- Widely used in scientific communities
- A case study with a format which has optimized
access libraries
11Use of XML Schemas
- High-level schema
- XML is used to provide a virtual view of the
dataset - Low-level schema
- reflects actual physical layout in HDF5
- Mapping schema
- describes mapping between each element of
high-level schema and low-level schema
12Oil Reservoir Simulation
- Support cost-effective Oil Production
- Simulations on a 3-D grid
- 17 variables and cell locations in 3-D grid at
each time step - Computation of bypassed regions
- Expression to determine if a cell is bypassed for
a time-step - Within a spatial region and range of time steps
- Grid cells that are bypassed for every time-step
in the range
Oil Reservoir management
13High-Level Schema
- lt xselement name"data" maxOccurs"unbounded" gt
- lt xscomplexType gt
- lt xssequence gt
- lt xselement name"x" type"xsinteger"/ gt
- lt xselement name"y" type"xsinteger"/ gt
- lt xselement name"z" type"xsinteger"/ gt
- lt xselement name"time" type"xsinteger"/ gt
- lt xselement name"velocity" type"xsfloat"/ gt
- lt xselement name"mom" type"xsfloat"/ gt
- lt /xssequence gt
- lt /xscomplexType gt
- lt /xselement gt
14High-Level XQuery Code Of Oil Reservoir management
- unordered(
- for i in (x1 to x2)
- for j in (y1 to y2)
- for k in (z1 to z2)
- let p document("OilRes.xml")/data
- where (p/xi) and (p/y j) and (p/z k)
- and (p/time gt tmin) and (p/time lt
tmax) - return
- ltinfogt
- ltcoordgt i, j, k lt/x-coordgt
- ltsummarygt analyze(p) lt/summarygt
- lt/infogt
- )
-
15Low-Level Schema
- ltfile name"info"gt
- ltsequencegt
- ltgroup name"data"gt
- ltattribute name"time"gt ltdatatypegt integer
lt/datatypegt - ltdataspacegt ltrankgt 1 lt/rankgt ltdimensiongt 1
lt/dimensiongt lt/dataspacegt - lt/attributegt
- ltdataset name"velocity"gt ltdatatypegt float
lt/datatypegt - ltdataspacegt ltrankgt 1 lt/rankgt ltdimensiongt x
lt/dimensiongt lt/dataspacegt - lt/datasetgt
- ..............
- lt/groupgt
- lt/sequencegt
- lt/filegt
16Mapping Schema
- //high/data/velocity //low/info/data/velocity
- //high/data/time //low/info/data/time
- //high/data/mom //low/info/data/mom
index(//low/info/data/velocity, 1) - //high/data/x //low/coord/x
index(//low/info/data/velocity, 1)
17Compiler Analysis
- Problem with direct translation
- Each let expression involves complete scan over
dataset - So final code will need several passes over the
data - Solution
- Apply Data Centric Transformations to read a
portion HDF5 dataset only once
18Naïve Strategy
Dataset
Output
19Data Centric Strategy
Datasets
Output
Requires just one scan
20Data Centric Transformation
- Overall Idea in Data-Centric Transformation
- Iterate over each data element in actual storage
- Find out iterations of the original loop in which
they are accessed. - Execute computation corresponding to those
iterations. - Previous Work
- Pingali et al. blocking
- Ferreira and Agrawal data-parallel Java on
disk-resident datasets - Li and Agrawal XQuery, invert getData
functions - Our contribution
- Use Low-Level and Mapping Schema
- Extend the idea when multiple datasets need to be
accessed
21Data Centric Transformation
- Mapping Function T
- Iteration space ? High-Level data
- Mapping Function C
- High-Level data ? Low-Level data
- Mapping Function C T M
- Iteration space ? Low-Level data
- Our Goal is to compute M-1.
22Data Centric Transformation
- Choose one dataset as base dataset S1 from n
datasets to be accessed - Apply M1-1 to compute set of iterations.
- The expression Mi M1-1 gives the portion of
dataset Si that needs to be accessed along with
S1 - Choice of base dataset might impact the data
locality.
23Choice of Base Dataset
- Min-IO-Volume Strategy
- Minimize repeated access to any dataset
- Min-Seek-Time Strategy
- Minimize any discontinuity in access
24Template for Generated Code
- Generated_Query
-
- Create an abstract iteration space using Source
code. - Allocate and initialize an array of output
element corresponding to iteration space. - For k 1, , NO_OF_CHUNKS
-
- Read kth chunk of dataset S1 using HDF5
functions and structural tree. - Foreach of the other datasets S2, , Sn
- access the required chunk of the dataset.
- Foreach data element in the chunks of data
-
- compute the iteration instance.
- apply the reduction computation and update the
output. -
-
-
25Experiment
200200200 grid with 10 time steps (1.28
GB) 505050 Storage Chunk Size
26Experiment
505050 grid with 200 time steps (400
MB) 252525 Storage Chunk Size
27Key Observations
- Overall minimum execution time
- Min-IO-Volume strategy when read chuck size
matches storage chunk size - Execution time
- Very sensitive to Read Chunk-Size in
Min-IO-Volume Strategy - Not sensitive to Read Chunk-Size in Min-Seek-Time
Strategy due to buffering of Storage chunks
28Summary
- Compiler techniques
- Support High-level abstractions on complex
low-level data formats - Enables use of the same source code across a
variety of data formats - Perform data centric transformations
automatically - Experimental result shows minor change in
strategy can affect performance significantly - Future Work
- Cost models to guide strategy and chunk size
selection - Compare performance with manual implementations
- parallelizing data processing
- extend applicability of the algorithm to more
general class of queries