Grid Performance Issues in the Design of a GridEnabled Query Processor - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Grid Performance Issues in the Design of a GridEnabled Query Processor

Description:

Massive growth in databases size. Emergence ... Logical Optimisation. Plan is expressed using a ... Physical Optimisation. Plan is expressed using a ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 47

Provided by: goun3

Category:

more less

Transcript and Presenter's Notes

Title: Grid Performance Issues in the Design of a GridEnabled Query Processor

1
Grid Performance Issues in the Design of a
Grid-Enabled Query Processor

APART workshop
Larnaca, March 27-28, 2003
Tasos Gounaris

2
Presentation Outline

Project Overview and Presentation of the Polar
Grid enabled Distributed Query Processor (DQP)
Grid Issues

3
Motivation

Massive growth in databases size
Emergence of the Grid
Limitations of existing distributed and federated
DB solutions
How can database technologies can be deployed
over the Grid to achieve high-performance
parallel query execution over Grid resources ?

select p.proteinId, Blast(p.sequence)
from protein p,
proteinTerm t
where t.termId S92 and
p.proteinId t.proteinId

5
Mutual Benefit

The Grid needs DQP
Declarative, high-level resource integration with
implicit parallelism.
DQP-based solutions should in principle run
faster than those manually coded.

DQP needs the Grid
Requires systematic access to remote data and
computational resources.
Dynamic resource discovery and allocation

6
Challenges

Grid is a volatile environment
Mapping queries onto resources
Inadequate/inaccurate information about data at
compile time
Inadequate/inaccurate information about Grid
resources at compile time

7
Objectives

Develop a query processor that can
Monitor the query plan and the execution
environment
Assess the monitored information
React, if necessary
Speed-up queries through parallelism
Develop scheduling algorithms for the Grid

8
The Polar Query Processor
9
The Polar Compiler
10
A Service-Based Model

High level of cohesion low levels of coupling
Built upon OGSA and OGSA-DAI
No need for custom wrappers (to interface the
query engine to data sources) as SB-DQP operates
on generic GDSs
SB-DQP is layered on top of a more flexible
execution environment so it can provide
partitioned parallelism
The interaction semantics of constituting
components of SB-DQP is governed by standard,
uniform protocols and mechanisms (XML, WSDL,
SOAP, OGSA )

11
Presentation Outline

Project Overview and Presentation of the Polar
Grid enabled Distributed Query Processor (DQP)
Grid Issues

12
Potential Goals of a GDQP

Final response time for a query
Initial results
Most accurate results in least time
Minimum resource consumption
Minimum economic cost

13
Constructing a virtual database

Identify relevant databases
Identify relevant tools
Construct a common database schema
Choose other candidate computational nodes

14
Logical Optimisation
reduce

Plan is expressed using a logical algebra.
Heuristic-based application of equivalence laws.
Multiple equivalent plans generated.

op_call (Blast)
join (proteinId)
reduce
reduce
scan (protein)
scan termIDS92 (proteinTerm)
15
Physical Optimisation
reduce

Plan is expressed using a physical algebra.
Logical operators replaced with physical
operators.
Cost-based ranking of plans.

op_call (Blast)
hash_join (proteinId)
reduce
reduce
table_scan (protein)
table_scan termIDS92 (proteinTerm)
16
Partitioning
reduce

Plan is expressed in a parallel algebra.
Parallel algebra physical algebra exchange.
Exchange operators are placed where data movement
is required.

op_call (Blast)
exchange
hash_join (proteinId)
exchange
exchange
reduce
reduce
table_scan (protein)
table_scan termIDS92 (proteinTerm)
17
The Exchange Operator
From Mehul A. Shah, Joseph M. Hellerstein, Sirish
Chandrasekaran and Michael J. Franklin Flux An
Adaptive Partitioning Operator for Continuous
Query Systems, To appear ICDE, March 2003
18
Scheduling

Partitions are allocated to Grid nodes.
Expressed by decorating parallel algebra
expression.
Heuristic algorithm considers memory use, network
costs.

19
Parallelism
reduce
4,5
Partitioned parallelism
op_call (Blast)
5
4
3,6
exchange
hash_join (proteinId)
exchange
exchange

Pipeline
parallelism

reduce
reduce
2,3
6
table_scan (protein)
table_scan termIDS92 (proteinTerm)
20
Perform the resource allocation (1)

We have
Formulas to estimate the time cost of each
physical database operator for a specific system
configuration
We need
The number and the size of the tuples each
operator receives and produces
Detailed information about the system
configuration

21
Perform the resource allocation (2)

Choosing the computational nodes
How many?
For which part(s) of the query?
Which?
This is an NP-hard problem
Scalability
Production of good plans in most of the cases

22
A minimum set of resource metadata

Available CPU
Available memory
I/O speed
Network speed for the connections to the other
participating machines
Databases
Applications

23
Some initial results
24
Adaptivity
25
Monitoring the physical operators

What can be measured?
What can be derived from the measurements?
What is useful?

26
What to monitor?

New machines
Updated resource metadata for existing machines
Operator Cost
Client input (tradeoff between update rate and
accuracy, priorities)
Data arrival rates
Workload
Statistical information like sizes of
intermediate results, number and frequencies of
attribute values, availability of indices,
selectivity of operators

27
Summary

Performance Definition
Construction of resources pool
Communication mechanism
Resource scheduling
Adaptivity
monitoring

28
The Polar Team

Manchester
Norman Paton
Alvaro Fernandes
Rizos Sakellariou
Nedim Alpdemir
Anastasios Gounaris

Newcastle
Paul Watson
Arijit Mukherjee
Jim Smith

http//www.ncl.ac.uk/polarstar/index.htm
29
GDS interactions
30
Interactions of DQP components
31
ltGDQDataSourceList gt ltimportedDataSourcegt
ltGDSFactoryHandlegt http//130.88.198.2038080/o
gsa/services/ogsadai/GridDataServiceFactoryP2R1 lt
/GDSFactoryHandlegt ltGDSFactoryScriptgt lt
GDSFSHeadergt ltGDSSScriptNamegtcreate1lt
/GDSSScriptNamegt ltGDSSVersiongt
ltGDSSConfiggtfactoryconfiglt/GDSSConfiggt
ltGDSSScriptEnvironmentgt environment
lt/GDSSScriptEnvironmentgt
lt/GDSSVersiongt ltGDSSOriginatorgtOrigina
torlt/GDSSOriginatorgt lt/GDSFSHeadergt ltGDSFSB
odygt ltGDSFSCreateGDSWithNamedConfiggt
getschemaconfig lt/GDSFSCreateGDSWith
NamedConfiggt lt/GDSFSBodygt
lt/GDSFactoryScriptgt lt/importedDataSourcegt ltimpor
tedServicegt ltserviceWSDLURLgt http//ww
w.ebi.ac.uk/collab/mygrid/service0/axis/services/u
rnsrs?WSDL lt/serviceWSDLURLgt lt/importe
dServicegt lt/GDQDataSourceListgt
32
Importing Resource Metadata
33
ltGridNodeInfo hostsDataSource"1"
hostsService"0" hasEvaluatorFactory"1"gt
ltnodeIDgtmach1.cs.man.ac.uklt/nodeIDgt
ltCPUSpeedMHzgt1400lt/CPUSpeedMHzgt
ltCPULoadPercentagegt10lt/CPULoadPercentagegt
ltconnectionSpeedMBperSecgt1.0lt/connectionSpeedMBper
Secgt lthostedDataSource
GDSFactoryHandle"http//rpc53.cs.man.ac.uk/ogsa/s
ervices/GDSFactory" GDSInstanceHandle"htt
p//rpc53.cs.man.ac.uk/ogsa/services/GDSFactory/in
st1"/gt ltevalutorFactorygt
http//rpc53.cs.man.ac.uk/ogsa/services/EvaluatorF
actory lt/evalutorFactorygt lt/GridNodeInfogt ltGri
dNodeInfo hostsDataSource"0" hostsService"1"
hasEvaluatorFactory"0"gt ltnodeIDgtmach1.ebi.co.
uklt/nodeIDgt ltCPUSpeedMHzgt1000lt/CPUSpeedMHzgt
ltCPULoadPercentagegt95lt/CPULoadPercentagegt
ltconnectionSpeedMBperSecgt2.0lt/connectionSpeedMBper
Secgt lthostedServicegt
http//www.ebi.ac.uk/collab/mygrid/service0/axis/s
ervlet/AxisServlet/urnemblfetch
lt/hostedServicegt lthostedServicegt
http//www.ebi.ac.uk/collab/mygrid/service0/axis/s
ervlet/AxisServlet/urnspoofblast
lt/hostedServicegt lt/GridNodeInfogt
ltGridNodeInfo hostsDataSource"1"
hostsService"0" hasEvaluatorFactory"1"gt
ltnodeIDgtmach1.cs.man.ac.uklt/nodeIDgt
ltCPUSpeedMHzgt1400lt/CPUSpeedMHzgt
ltCPULoadPercentagegt10lt/CPULoadPercentagegt
ltconnectionSpeedMBperSecgt1.0lt/connectionSpeedMBper
Secgt lthostedDataSource
GDSFactoryHandle"http//mach1.cs.man.ac.uk/ogsa/s
ervices/GDSFactory" GDSInstanceHandle"htt
p//mach1.cs.man.ac.uk/ogsa/services/GDSFactory/in
st1"/gt ltevalutorFactorygt
http//rpc53.cs.man.ac.uk/ogsa/services/EvaluatorF
actory lt/evalutorFactorygt lt/GridNodeInfogt ltGr
idNodeInfo hostsDataSource"0" hostsService"1"
hasEvaluatorFactory"0"gt ltnodeIDgtmach1.ebi.co.
uklt/nodeIDgt ltCPUSpeedMHzgt1000lt/CPUSpeedMHzgt
ltCPULoadPercentagegt95lt/CPULoadPercentagegt
ltconnectionSpeedMBperSecgt2.0lt/connectionSpeedMBper
Secgt lthostedServicegt
http//www.ebi.ac.uk/collab/mygrid/service0/axis/s
ervlet/AxisServlet/urnemblfetch
lt/hostedServicegt lthostedServicegt
http//www.ebi.ac.uk/collab/mygrid/service0/axis/s
ervlet/AxisServlet/urnspoofblast
lt/hostedServicegt lt/GridNodeInfogt
34
An example of a Query Document
ltGridDataServiceScriptgt ltHeadergt
ltScriptNamegtExample 1lt/ScriptNamegt ltVersiongt
ltConfiggtconfiglt/Configgt
ltScriptEnvironmentgtenvironmentlt/ScriptEnvironment
gt lt/Versiongt ltOriginatorgtGSH of
Originatorlt/Originatorgt lt/Headergt
ltBodygt ltStatement name"xyz" dataResource"MyData
Resource"gt select p.proteinId,
blast(p.sequence) from p in protein, t in
proteinTerm where t.termId'8372' and
p.proteinIdt.proteinId lt/Statementgt ltDelivery
name"delivery"gt ltMechanism type"bulk"/gt ltMod
e type"full"/gt ltFromgtxyzlt/Fromgt ltTogtresponselt
/Togt lt/Deliverygt ltExecute name"execute"gtxyzlt/Ex
ecutegt lt/Bodygt lt/GridDataServiceScriptgt
35
Evaluation

All algebraic operators, are implemented using
the iterator model.
The iterator model supports three standard
operators
open()
next()
close()

Remote data sources and computational resources
are accessed through iterator-based wrappers.
The iterator model supports partitioned and
pipelined parallelism.

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Some General Remarks (1)

There are three types of monitored information
counters, timings, and lengths of strings of
characters
The overhead of these three types is not
dependent on the type of the query operator. The
cost of a counter and computing a timing is
constant for a given system, whereas the cost of
measuring the size of a string depends on its
size.

41
Some General Remarks (2)

The cost of a counter is negligible for all the
operators examined. However, this is not true for
timings and string lengths.
The cost of predicting the output cardinality is
lower than 1 for all operators examined except
project, for which is 6.78.
The cost of predicting the final response time is
lower than 10 for all operators except project,
even if the time cost of each tuple is measured
separately. If the time cost is measured at a
frequency lower than 10 (i.e., one in ten tuples
is timed), the cost becomes lower than 1 for
these operators.