Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web

Description:

Data-Intensive Computing . Simply put: scalable analysis of large datasets . How is it different from: related to . Databases: Emphasis on processing of static datasets – PowerPoint PPT presentation

Number of Views:450

Avg rating:3.0/5.0

Slides: 52

Provided by: cseOhios

Category:

more less

Transcript and Presenter's Notes

Title: Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web

1
Data-Intensive Computing From Multi-Cores and
GPGPUs to Cloud Computing and Deep Web

Gagan Agrawal

2
Data-Intensive Computing

Simply put scalable analysis of large datasets
How is it different from related to
Databases
Emphasis on processing of static datasets
Data Mining
Community focused more on algorithms, and not
scalable implementations
High Performance / Parallel Computing
More focus on compute-intensive tasks, not I/O or
large datasets
Datacenters
Use of large resources for hosting data, less on
their use for processing

3
Why Now ?

Amount of data is increasing rapidly
Cheap Storage
Better connectivity, easy to move large datasets
on web/grids
Science shifting from compute-X to X-informatics
Business intelligence and analysis
Googles Map-Reduce has created excitement

4
Architectural Context

Processor architecture has gone through a major
change
No more scaling with clock speeds
Parallelism multi-core / many-core is the trend
Accelerators like GPGPUs have become effective
More challenges for scaling any class of
applications

5
Grid/Cloud/Utility Computing

Cloud computing is a major new trend in industry
Data and computation in a Cloud of resources
Pay for use model (like a utility)
Has roots in many developments over the last
decade
Service-oriented computing, Software as a Service
(SaaS)
Grid computing use of wide-area resources

6
My Research Group

Data-intensive computing on emerging
architectures
Data-intensive computing in Cloud Model
Data-integration and query processing deep web
data
Querying low-level datasets through automatic
workflow composition
Adaptive computation time as a constraint

7
Personnel

Current students
6 PhD students
2 MS thesis students
Talking to several first year students
Past students
7 PhDs completed between 2005 and 2008

8
Outline

FREERIDE Data-intensive Computing on Cluster of
Multi-cores
A system for exploiting GPGPUs for data-intensive
computing
FREERIDE-G Data-intensive computing on Cloud
Environments
Quick overview of three other projects

9
FREERIDE - Motivation

Availability of very large datasets and its
analysis (Data-intensive applications)
Adaptation of Multi-core and inevitability of
parallel programming
Need for abstraction of difficulties of parallel
programming.

10
FREERIDE

A middle-ware for parallelizing Data-intensive
applications
Motivated by difficulties in implementing and
performance tuning of Datamining applications
Based on observation of similar generalized
reduction among datamining, OLAP and other
scientific applications

11
Generalized Reduction structure
12
SMP Techniques

Full-replication(f-r) (obvious technique)
Locking based techniques
Full-locking (f-l)
Optimized Full-locking(o-f-l)
Fixed Locking(fi-l)
Cache-sensitive locking( Hybrid of o-f-l fi-l)

13
Memory Layout of SMP techs
14
Experimental setup

Intel Xeon E5345 CPU
2 Quad-core machine
Each core 2.33GHz
6GB Main memory
Nodes in cluster connected by Infiniband

15
Experimental Results K-means (CMP)
16
K-means (cluster)
17
Apriori (CMP)
18
Apriori (cluster)
19
E-M (CMP)
20
E-M (cluster)
21
Summary of Results

Both Full-replication and Cache-sensitive locking
can outperform each other based on the nature of
application
Cache-sensitive locking seems to have high
overhead when there is little computation between
updates in ReductionObject
MPI processes competes well with best of other
two when run on smaller cores, but experiences
communication overheads when run on larger number
of cores

22
Background GPU Computing

Multi-core architectures are becoming more
popular in high performance computing
GPU is inexpensive and fast
CUDA is a high level language that supports
programming on GPU

23
Architecture of GeForce 8800 GPU (1
multiprocessor)?
24
Challenges of Data-intensive Computing on GPU

SIMD shared memory programming
3 steps involved in the main loop
Data read
Computing update
Writing update

25
Complication of CUDA Programming

User has to have thorough knowledge of the
architecture of GPU and the programming model of
CUDA
Must specify the grid configuration
Has to deal with the memory allocation and copy
Need to know what data to be copied onto shared
memory and how much shared memory to use

26
Architecture of the Middleware

User input
Code analyzer
Analysis of variables (variable type and size)
Analysis of reduction functions (sequential code
from the user)
Code Generator ( generating CUDA code and C
code invoking the kernel function)

27
Architecture of the middleware
Variable Analyzer
Host Program
Variable information
Variable Access Pattern and Combination Operations
Kernel functions
Reduction functions
Code Generator
Grid configuration and kernel invocation
Optional functions
Code Analyzer( In LLVM)
Executable
28
User Input
29
Analysis of Sequential Code

Get the information of access features of each
variable
Figure out the data to be replicated
Get the operator for global combination
Calculate the size of shared memory to use and
which data to be copied to shared memory

30
Experiment Results
Speedup of k-means
31
Speedup of EM
32
Emergence of Cloud and Utility Computing

Group generating data
use remote resources for storing data
Already popular with SDSC/SRB
Scientist interested in deriving results from
data
use distinct but remote resources for processing
Remote Data Analysis Paradigm
Data, Computation, and User at Different
Locations
Unaware of location of other

33
Remote Data Analysis

Advantages
Flexible use of resources
Do not overload data repository
No unnecessary data movement
Avoid caching process once data
Challenge Tedious details
Data retrieval and caching
Use of parallel configurations
Use of heterogeneous resources
Performance Issues
Can a Grid Middleware Ease Application
Development for Remote Data Analysis and Yet
Provide High Performance ?

34
Our Work

FREERIDE-G (Framework for Rapid Implementation of
Datamining Engines in Grid)
Enable Development of Flexible and Scalable
Remote Data Processing Applications

Middleware user
Repository cluster
Compute cluster
35
Challenges

Support use of parallel configurations
For hosting data and processing data
Transparent data movement
Integration with Grid/Web Standards
Resource selection
Computing resources
Data replica
Scheduling and Load Balancing
Data Wrapping Issues

36
FREERIDE (G) Processing Structure

KEY observation most data mining algorithms
follow canonical loop
Middleware API
Subset of data to be processed
Reduction object
Local and global reduction operations
Iterator
Derived from precursor system FREERIDE

While( ) forall( data instances d)
(I , d) process(d) R(I) R(I)
op d .
37
FREERIDE-G Evolution

FREERIDE
data stored locally
FREERIDE-G
ADR responsible for remote data retrieval
SRB responsible for remote data retrieval
FREERIDE-G grid service
Grid service featuring
Load balancing
Data integration

38
Evolution
FREERIDE-G-ADR
FREERIDE
Application Data ADR SRB Globus
FREERIDE-G-SRB
FREERIDE-G-GT
39
FREERIDE-G System Architecture
40
Compute Node

More compute nodes than data hosts
Each node
Registers IO (from index)
Connects to data host
While (chunks to process)
Dispatch IO request(s)
Poll pending IO
Process retrieved chunks

41
FREERIDE-G in Action
Compute Node
Data Host
I/O Registration
Connection establishment
SRB Agent
While (more chunks to process)
I/O request dispatched
Pending I/O polled
MCAT
SRB Master
Retrieved data chunks analyzed
SRB Agent
Compute Node
42
Implementation Challenges

Interaction with Code Repository
Simplified Wrapper and Interface Generator
XML descriptors of API functions
Each API function wrapped in own class
Integration with MPICH-G2
Supports MPI
Deployed through Globus components (GRAM)
Hides potential heterogeneity in service startup
and management

43
Experimental setup

Organizational Grid
Data hosted on Opteron 250 cluster
Processed on Opteron 254 cluster
Connected using 2 10 GB optical fibers
Goals
Demonstrate parallel scalability of applications
Evaluate overhead of using MPICH-G2 and Globus
Toolkit deployment mechanisms

44
Deployment Overhead Evaluation

Clearly a small overhead associated with using
Globus and MPICH-G2 for middleware deployment.
Kmeans Clustering with 6.4 GB dataset 18-20.
Vortex Detection with 14.8 GB dataset 17-20.

45
Deep Web Data Integration

The emerge of deep web
Deep web is huge
Different from surface web
Challenges for integration
Not accessible through search engines
Inter-dependences among deep web sources

46
Motivating Example
Given a gene ERCC6, we want to know the amino
acid occurring in the corresponding position in
orthologous gene of non-human mammals
dbSNP
ERCC6
AA Positions for Nonsynonymous SNP
Alignment Database
Encoded Protein
Entrez Gene
Protein Sequence
Sequence Database
Encoded Orthologous Protein
47
Observations

Inter-dependences between sources
Time consuming if done manually
Intelligent order of querying
Implicit sub-goals in user query

48
Contributions

Formulate the query planning problem for deep web
databases with dependences
Propose a dynamic query planner
Develop cost models and an approximate planning
algorithm
Integrate the algorithm with a deep web mining
tool

49
HASTE Middleware Design Goals

To Enable the Time-critical Event Handling to
Achieve the Maximum Benefit, While Satisfying the
Time Constraint
To be Compatible with Grid and Web Services
To Enable Easy Deployment and Management with
Minimum Human Intervention
To be Used in a Heterogeneous Distributed
Environment

ICAC 2008
50
HASTE Middleware Design
ICAC 2008
51
Workflow Composition System
52
Summary

Several projects cross cutting Parallel
Computing, Distributed Computing and Database/
Data mining
Number of opportunities for MS thesis, MS
project, and PhD students
Relevant Courses
CSE 621/721
CSE 762
CSE 671 / 674

Write a Comment

User Comments (0)