Automatic%20Data%20Virtualization%20-%20Supporting%20XML%20based%20abstractions%20on%20HDF5%20Datasets - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic%20Data%20Virtualization%20-%20Supporting%20XML%20based%20abstractions%20on%20HDF5%20Datasets

Description:

Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets Swarup Kumar Sahoo Gagan Agrawal Roadmap Motivation Introduction System Overview ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Automatic%20Data%20Virtualization%20-%20Supporting%20XML%20based%20abstractions%20on%20HDF5%20Datasets


1
Automatic Data Virtualization - Supporting XML
based abstractions on HDF5 Datasets
  • Swarup Kumar Sahoo
  • Gagan Agrawal

2
Roadmap
  • Motivation
  • Introduction
  • System Overview
  • XQuery, Low and High Level schema and HDF5
    storage
  • Compiler Analysis and Algorithm
  • Experiment
  • Summary and Future Work

3
Motivation
  • Emergence of grid-based data repositories
  • Can enable sharing of data
  • Emergence of applications that process large
    datasets
  • Complicated by complex and specialized storage
    formats
  • Need for easily portable applications
  • Compatibility with web/grid services

4
Data Virtualization
  • An abstract view of data
  • dataset

Data Virtualization
Data Service
  • By Global Grid Forums DAIS working group
  • A Data Virtualization describes an abstract view
    of data.
  • A Data Service implements the mechanism to
    access and process data
  • through the Data Virtualization

5
Introduction Automatic Data Virtualization
  • Goal Enable Automatic creation of efficient
    data services
  • Support a high-level or abstract view of data
  • Data is stored in low-level format
  • Application development
  • assume a high-level or virtual view
  • Application Execution
  • On actual low-level layout

6
Overview of Our Automatic Data Virtualization Work
  • Previous work on XML Based virtualization
  • Techniques for XQuery Compilation (Li and
    Agrawal, ICS 2003, DBPL 2003)
  • Supporting XML Based high-level abstraction on
    flat-file datasets (LCPC 2003, XIME-P 2004)
  • Relational Table/SQL Based Implementation
  • Supporting SQL Select and Where (HPDC 2004)
  • Supporting SQL-3 Aggregations (LCPC 2004)

7
XML-based Virtualization
NetCDF
HDF5
TEXT

RDBMS
8
Challenges and Contributions
  • Challenges
  • Compiler generates efficient data processing code
  • Uses the information about the low-level layout
    and mapping between virtual and low-level layout
  • Challenge in compilation
  • High level to low level
  • to ensure high locality in processing of large
    datasets
  • Contributions of this paper
  • An improved data- centric transformation
    algorithm
  • An implementation specific to HDF5 as the
    low-level format

9
System Overview
System Overview
High level XML Schema
Mapping Schema
XQuery Source Code
Low level XML Schema
Compiler
Generated Code
HDF5 Library
Processor and Disk
10
XQuery and HDF5
  • High-level declarative languages ease application
    development
  • XQuery is a high-level language for processing
    XML datasets
  • Derived from database, declarative, and
    functional languages!
  • HDF5
  • Hierarchical Data Format
  • Widely used in scientific communities
  • A case study with a format which has optimized
    access libraries

11
Use of XML Schemas
  • High-level schema
  • XML is used to provide a virtual view of the
    dataset
  • Low-level schema
  • reflects actual physical layout in HDF5
  • Mapping schema
  • describes mapping between each element of
    high-level schema and low-level schema

12
Oil Reservoir Simulation
  • Support cost-effective Oil Production
  • Simulations on a 3-D grid
  • 17 variables and cell locations in 3-D grid at
    each time step
  • Computation of bypassed regions
  • Expression to determine if a cell is bypassed for
    a time-step
  • Within a spatial region and range of time steps
  • Grid cells that are bypassed for every time-step
    in the range

Oil Reservoir management
13
High-Level Schema
  • lt xselement name"data" maxOccurs"unbounded" gt
  • lt xscomplexType gt
  • lt xssequence gt
  • lt xselement name"x" type"xsinteger"/ gt
  • lt xselement name"y" type"xsinteger"/ gt
  • lt xselement name"z" type"xsinteger"/ gt
  • lt xselement name"time" type"xsinteger"/ gt
  • lt xselement name"velocity" type"xsfloat"/ gt
  • lt xselement name"mom" type"xsfloat"/ gt
  • lt /xssequence gt
  • lt /xscomplexType gt
  • lt /xselement gt

14
High-Level XQuery Code Of Oil Reservoir management
  • unordered(
  • for i in (x1 to x2)
  • for j in (y1 to y2)
  • for k in (z1 to z2)
  • let p document("OilRes.xml")/data
  • where (p/xi) and (p/y j) and (p/z k)
  • and (p/time gt tmin) and (p/time lt
    tmax)
  • return
  • ltinfogt
  • ltcoordgt i, j, k lt/x-coordgt
  • ltsummarygt analyze(p) lt/summarygt
  • lt/infogt
  • )

15
Low-Level Schema
  • ltfile name"info"gt
  • ltsequencegt
  • ltgroup name"data"gt
  • ltattribute name"time"gt ltdatatypegt integer
    lt/datatypegt
  • ltdataspacegt ltrankgt 1 lt/rankgt ltdimensiongt 1
    lt/dimensiongt lt/dataspacegt
  • lt/attributegt
  • ltdataset name"velocity"gt ltdatatypegt float
    lt/datatypegt
  • ltdataspacegt ltrankgt 1 lt/rankgt ltdimensiongt x
    lt/dimensiongt lt/dataspacegt
  • lt/datasetgt
  • ..............
  • lt/groupgt
  • lt/sequencegt
  • lt/filegt

16
Mapping Schema
  • //high/data/velocity //low/info/data/velocity
  • //high/data/time //low/info/data/time
  • //high/data/mom //low/info/data/mom
    index(//low/info/data/velocity, 1)
  • //high/data/x //low/coord/x
    index(//low/info/data/velocity, 1)

17
Compiler Analysis
  • Problem with direct translation
  • Each let expression involves complete scan over
    dataset
  • So final code will need several passes over the
    data
  • Solution
  • Apply Data Centric Transformations to read a
    portion HDF5 dataset only once

18
Naïve Strategy
Dataset
Output
19
Data Centric Strategy
Datasets
Output
Requires just one scan
20
Data Centric Transformation
  • Overall Idea in Data-Centric Transformation
  • Iterate over each data element in actual storage
  • Find out iterations of the original loop in which
    they are accessed.
  • Execute computation corresponding to those
    iterations.
  • Previous Work
  • Pingali et al. blocking
  • Ferreira and Agrawal data-parallel Java on
    disk-resident datasets
  • Li and Agrawal XQuery, invert getData
    functions
  • Our contribution
  • Use Low-Level and Mapping Schema
  • Extend the idea when multiple datasets need to be
    accessed

21
Data Centric Transformation
  • Mapping Function T
  • Iteration space ? High-Level data
  • Mapping Function C
  • High-Level data ? Low-Level data
  • Mapping Function C T M
  • Iteration space ? Low-Level data
  • Our Goal is to compute M-1.

22
Data Centric Transformation
  • Choose one dataset as base dataset S1 from n
    datasets to be accessed
  • Apply M1-1 to compute set of iterations.
  • The expression Mi M1-1 gives the portion of
    dataset Si that needs to be accessed along with
    S1
  • Choice of base dataset might impact the data
    locality.

23
Choice of Base Dataset
  • Min-IO-Volume Strategy
  • Minimize repeated access to any dataset
  • Min-Seek-Time Strategy
  • Minimize any discontinuity in access

24
Template for Generated Code
  • Generated_Query
  • Create an abstract iteration space using Source
    code.
  • Allocate and initialize an array of output
    element corresponding to iteration space.
  • For k 1, , NO_OF_CHUNKS
  • Read kth chunk of dataset S1 using HDF5
    functions and structural tree.
  • Foreach of the other datasets S2, , Sn
  • access the required chunk of the dataset.
  • Foreach data element in the chunks of data
  • compute the iteration instance.
  • apply the reduction computation and update the
    output.

25
Experiment
200200200 grid with 10 time steps (1.28
GB) 505050 Storage Chunk Size
26
Experiment
505050 grid with 200 time steps (400
MB) 252525 Storage Chunk Size
27
Key Observations
  • Overall minimum execution time
  • Min-IO-Volume strategy when read chuck size
    matches storage chunk size
  • Execution time
  • Very sensitive to Read Chunk-Size in
    Min-IO-Volume Strategy
  • Not sensitive to Read Chunk-Size in Min-Seek-Time
    Strategy due to buffering of Storage chunks

28
Summary
  • Compiler techniques
  • Support High-level abstractions on complex
    low-level data formats
  • Enables use of the same source code across a
    variety of data formats
  • Perform data centric transformations
    automatically
  • Experimental result shows minor change in
    strategy can affect performance significantly
  • Future Work
  • Cost models to guide strategy and chunk size
    selection
  • Compare performance with manual implementations
  • parallelizing data processing
  • extend applicability of the algorithm to more
    general class of queries
Write a Comment
User Comments (0)
About PowerShow.com