Tools and Techniques for the Data Grid - PowerPoint PPT Presentation

Loading...

PPT – Tools and Techniques for the Data Grid PowerPoint presentation | free to download - id: 5273d1-MGI0Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Tools and Techniques for the Data Grid

Description:

Tools and Techniques for the Data Grid Gagan Agrawal Grids and Data Grids Grid Computing Large scale problem solving using resources over the internet Distributed ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 49
Provided by: Lyd666
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Tools and Techniques for the Data Grid


1
Tools and Techniques for the Data Grid
  • Gagan Agrawal

2
Grids and Data Grids
  • Grid Computing
  • Large scale problem solving using resources over
    the internet
  • Distributed computing, but across multiple
    administrative domains
  • Data Grid
  • Grid with focus on sharing and processing large
    scale datasets

3
Scientific Data Analysis on Grid-based Data
Repositories
  • Scientific data repositories
  • Large volume
  • Gigabyte, Terabyte, Petabyte
  • Distributed datasets
  • Generated/collected by scientific simulations or
    instruments
  • Data could be streaming in nature
  • Scientific data analysis

Data Specification Data Organization
Data Extraction Data Movement
Data Analysis Data Visualization
4
Opportunities
  • Scientific simulations and data collection
    instruments generating large scale data
  • Grid standards enabling sharing of data
  • Rapidly increasing wide-area bandwidths

5
Existing Efforts
  • Data grids recognized as important component of
    grid/distributed computing
  • Major topics
  • Efficient/Secure Data Movement
  • Replica Selection
  • Metadata catalogs / Metadata services
  • Setting up workflows

6
Open Issues
  • Accessing / Retrieving / Processing data from
    scientific repositories
  • Need to deal with low-level formats
  • Integrating tools and services having/requiring
    data with different formats
  • Support for processing streaming data in a
    distributed environment
  • Efficient distributed data-intensive applications
  • Developing scalable data analysis applications

7
Ongoing Projects
  • Automatic Data Virtualization
  • On the fly information integration in a
    distributed environment
  • Middleware for Processing Streaming Data
  • Supporting Coarse-grained pipelined parallelism
  • Compiling XQuery on Scientific and Streaming Data
  • Middleware and Algorithms for Scalable Data
    Mining

8
Outline
  • Automatic Data Virtualization
  • Relational/SQL
  • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Cluster and Grid-based data mining middleware

9
Automatic Data Virtualization Motivation
  • Emergence of grid-based data repositories
  • Can enable sharing of data in an unprecedented
    way
  • Access mechanisms for remote repositories
  • Complex low-level formats make accessing and
    processing of data difficult
  • Main desired functionality
  • Ability to select, down-load, and process a
    subset of data

10
Data Virtualization
  • An abstract view of data
  • dataset

Data Virtualization
Data Service
  • By Global Grid Forums DAIS working group
  • A Data Virtualization describes an abstract view
    of data.
  • A Data Service implements the mechanism to
    access and process data
  • through the Data Virtualization

11
Our Approach Automatic Data Virtualization
  • Automatically create data services
  • A new application of compiler technology
  • A meta-data descriptor describes the layout of
    data on a repository
  • An abstract view is exposed to the users
  • Two implementations
  • Relational /SQL-based
  • XML/XQuery based

12
Relational/SQL Implementation
Meta-data Descriptor
User Defined Aggregate
Select Query Input
Aggregation Service
13
Design a Meta-data Description Language
  • Requirements
  • Specify the relationship of a dataset to the
    virtual dataset schema
  • Describe the dataset physical layout within a
    file
  • Describe the dataset distribution on nodes of one
    or more clusters
  • Specify the subsetting index attributes
  • Easy to use for data repository administrators
    and also convenient for our code generation

14
An Example
Component I Dataset Schema Description IPARS //
Dataset schema name REL short int //
Data type definition TIME int X float Y
float Z float SOIL float SGAS float
  • Oil Reservoir Management
  • The dataset comprises several simulation on the
    same grid
  • For each realization, each grid point, a number
    of attributes are stored.
  • The dataset is stored on a 4 node cluster.

Component II Dataset Storage Description IparsDa
ta // Dataset name // Dataset schema
for IparsData DatasetDescription
IPARS DIR0 osu0/ipars DIR1
osu1/ipars DIR2 osu2/ipars DIR3 osu3/ipars
15
Evaluate the Scalability of Our Tool
  • Scale the number of nodes hosting the Oil
    reservoir management dataset
  • Extract a subset of interest at the size of 1.3GB
  • The execution times scale almost linearly.
  • The performance difference varies between 534,
    with an average difference of 16.

16
Comparison with an existing database (PostgreSQL)
No. Description
1 SELECT FROM TITAN
2 SELECT FROM TITAN WHERE Xgt0 AND Xlt10000 AND Ygt0 AND Ylt10000 AND Zgt0 AND Zlt100
3 SELECT FROM TITAN WHERE DISTANCE(X,Y,Z) lt 1000
4 SELECT FROM TITAN WHERE S1 lt 0.01
5 SELECT FROM TITAN WHERE S1 lt 0.5
6GB data for Satellite data processing. The total
storage required after loading the data in
PostgreSQL is 18GB. Create Index for both spatial
coordinates and S1 in PostgreSQL. No special
performance tuning applied for the experiment.
17
Outline
  • Automatic Data Virtualization
  • Relational/SQL
  • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Coarse-grained pipelined parallelism

18
XML/XQuery Implementation
NetCDF
HDF5
TEXT
RMDB

19
Programming/Query Language
  • High-level declarative languages ease application
    development
  • Popularity of Matlab for scientific computations
  • New challenges in compiling them for efficient
    execution
  • XQuery is a high-level language for processing
    XML datasets
  • Derived from database, declarative, and
    functional languages !
  • XPath (a subset of XQuery) embedded in an
    imperative language is another option

20
Approach / Contributions
  • Use of XML Schemas to provide high-level
    abstractions on complex datasets
  • Using XQuery with these Schemas to specify
    processing
  • Issues in Translation
  • High-level to low-level code
  • Data-centric transformations for locality in
    low-level codes
  • Issues specific to XQuery
  • Recognizing recursive reductions
  • Type inferencing and translation

21
System Architecture
External Schema
XML Mapping Service
logical XML schema
physical XML schema
Compiler
XQuery Sources
C/C
22
Outline
  • Automatic Data Virtualization
  • Relational/SQL
  • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Cluster and Grid-based data mining middleware

23
Overall Goal
  • Tools for data integration driven by
  • Data explosion
  • Data size number of data sources
  • New analysis tools
  • Autonomous resources
  • Heterogeneous data representation various
    interfaces
  • Frequent Updates
  • Common Situations
  • Flat-file datasets
  • Ad-hoc sharing of data

24
Current Approaches
  • Manually written wrappers
  • Problems
  • O(N2) wrappers needed, O(N) for a single updates
  • Mediator-based integration systems
  • Problems
  • Need a common intermediate format
  • Unnecessary data transformation
  • Integration using web/grid services
  • Needs all tools to be web-services (all data in
    XML?)

25
Our Approach
  • Automatically generate wrappers
  • Stand-alone programs
  • For integrated DBs, (grid) workflow systems
  • Transform data in files of arbitrary formats
  • No domain- or format-specific heuristics
  • Layout information provided by users
  • Help biologists write layout descriptors using
    data mining techniques
  • Particularly attractive for
  • flat-file datasets
  • ad hoc data sharing
  • data grid environments

26
Our Approach Advantages
  • Advantages
  • No DB or query support required
  • One descriptor per resource needed
  • No unnecessary transformation
  • New resources can be integrated on-the-fly

27
Our Approach Challenges
  • Description language
  • Format and logical view of data in flat files
  • Easy to interpret and write
  • Wrapper generation and Execution
  • Correspondence between data items
  • Separating wrapper analysis and execution
  • Interactive tools for writing layout descriptors
  • What data mining techniques to use ?

28
Wrapper Generation System Overview
Layout Descriptor
Schema Descriptors
Parser
Mapping Generator
Data Entry Representation
Schema Mapping
Application Analyzer
WRAPINFO
Source Dataset
Target Dataset
DataReader
DataWriter
Synchronizer
29
Layout Description Language
  • Goal
  • To describe data in arbitrary flat file format
  • Easy to interpret and write
  • Components
  • Schema description
  • Layout description
  • Example FASTA

30
Layout Description Language
gtseq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n
gtseq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLP
RHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKS
HGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES
\n gtseq3
  • Component I Schema Description
  • FASTA //Schema Name
  • ID string //Data type definitions
  • DESCRIPTION string
  • SEQ string

31
Layout Description Language
gtseq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP
\ngtseq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFL
PRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKS
HGRTQDENPVVHFFKNIVTPRTPPPSQGKGR
\nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES
\n gtseq3
  • Key observations on data layout
  • Strings of variable length
  • Delimiters widely used
  • Data fields divided into variables
  • Repetitive structures
  • Key tokens
  • constant string
  • LINESIZE
  • optional
  • ltrepeatinggt

32
Layout Description Language
gtseq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP
\ngtseq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFL
PRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKS
HGRTQDENPVVHFFKNIVTPRTPPPSQGKGR
\nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES
\n gtseq3
  • Component II Layout Description
  • LOOP ENTRY 1EOF1
  • gt ID DESCRIPTION
  • lt \n SEQ gt
  • \n EOF

33
Outline
  • Automatic Data Virtualization
  • Relational/SQL
  • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Coarse-grained pipelined parallelism

34
Streaming Data Model
  • Continuous data arrival and processing
  • Emerging model for data processing
  • Sources that produce data continuously sensors,
    long running simulations
  • WAN bandwidths growing faster than disk
    bandwidths
  • Active topic in many computer science
    communities
  • Databases
  • Data Mining
  • Networking .

35
Summary/Limitations of Current Work
  • Focus on
  • centralized processing of stream from a single
    source (databases, data mining)
  • communication only (networking)
  • Many applications involve
  • distributed processing of streams
  • streams from multiple sources

36
Motivating Application
Network Fault Management System

Switch Network
37
Motivating Application (2)

Computer Vision Based Surveillance
38
Features of Distributed Streaming Processing
Applications
  • Data sources could be distributed
  • Over a WAN
  • Continuous data arrival
  • Enormous volume
  • Probably cant communicate it all to one site
  • Results from analysis may be desired at
    multiple sites
  • Real-time constraints
  • A real-time, high-throughput, distributed
    processing problem

39
Need for a Grid-Based Stream Processing
Middleware
  • Application developers interested in data
    stream processing
  • Will like to have abstracted
  • Grid standards and interfaces
  • Adaptation function
  • Will like to focus on algorithms only
  • GATES is a middleware for
  • Grid-based
  • Self-adapting
  • Data Stream Processing

40
Adaptation for Real-time Processing
  • Analysis on streaming data is approximate
  • Accuracy and execution rate trade-off can be
    captured by certain parameters (Adaptation
    parameters)
  • Sampling Rate
  • Size of summary structure
  • Application developers can expose these
    parameters and a range of values

41
API for Adaptation
  • Public class Sampling-Stage implements
    StreamProcessing
  • void init()
  • void work(buffer in, buffer out)
  • while(true)
  • Image img get-from-buffer-in-GATES(in)
  • Image img-sample Sampling(img,
    sampling-ratio)
  • put-to-buffer-in-GATES(img-sample, out)

GATES.Information-About-Adjustment-Parameter(min,
max, 1)
sampling-ratio GATES.getSuggestedParam
eter()
42
Outline
  • Automatic Data Virtualization
  • Relational/SQL
  • XML/XQuery based
  • Information Integration
  • Middleware for Streaming Data
  • Cluster and Grid-based data mining middleware

43
Scalable Mining Problem
  • Our understanding of what algorithms and
    parameters will give desired insights is often
    limited
  • The time required for creating scalable
    implementations of different algorithms and
    running them with different parameters on large
    datasets slows down the data mining process

44
Mining in a Grid Environment
  • A data mining application in a grid environment
    -
  • - Needs to exploit different forms of
    available parallelism
  • - Needs to deal with different data layouts and
    formats
  • - Needs to adapt to resource availability

45
FREERIDE Overview
  • Framework for Rapid Implementation of datamining
    engines
  • Demonstrated for a variety of standard mining
    algorithm
  • Targeted distributed memory parallelism, shared
    memory parallelism, and combination
  • Can be used as basis for scalable grid-based
    data mining implementations
  • Published in SDM 01, SDM 02, SDM 03, Sigmetrics
    02, Europar 02, IPDPS 03, IEEE TKDE (to appear)

46
FREERIDE-G
  • Data processing may not be feasible where the
    data resides
  • Need to identify resources for data processing
  • Need to abstract data retrieval, movement and
    parallel processing

47
Group Members
  • Ph.D students
  • Liang Chen
  • Leo Glimcher
  • Kaushik Sinha
  • Li Weng
  • Xuan Zhang
  • Qian Zhu
  • Recently Graduated
  • Ruoming Jin (Kent State)
  • Wei Du (Yahoo)
  • Xiaogang Li (Wi 06, AskJeeves)

48
Getting Involved
  • Talk to me
  • Most recent papers are available online
  • Sign in for my 888
About PowerShow.com