Grid Performance Issues in the Design of a GridEnabled Query Processor - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Grid Performance Issues in the Design of a GridEnabled Query Processor

Description:

Massive growth in databases size. Emergence ... Logical Optimisation. Plan is expressed using a ... Physical Optimisation. Plan is expressed using a ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 47
Provided by: goun3
Category:

less

Transcript and Presenter's Notes

Title: Grid Performance Issues in the Design of a GridEnabled Query Processor


1
Grid Performance Issues in the Design of a
Grid-Enabled Query Processor
  • APART workshop
  • Larnaca, March 27-28, 2003
  • Tasos Gounaris

2
Presentation Outline
  • Project Overview and Presentation of the Polar
    Grid enabled Distributed Query Processor (DQP)
  • Grid Issues

3
Motivation
  • Massive growth in databases size
  • Emergence of the Grid
  • Limitations of existing distributed and federated
    DB solutions
  • How can database technologies can be deployed
    over the Grid to achieve high-performance
    parallel query execution over Grid resources ?

4
  • select p.proteinId, Blast(p.sequence)
  • from protein p,
  • proteinTerm t
  • where t.termId S92 and
  • p.proteinId t.proteinId

5
Mutual Benefit
  • The Grid needs DQP
  • Declarative, high-level resource integration with
    implicit parallelism.
  • DQP-based solutions should in principle run
    faster than those manually coded.
  • DQP needs the Grid
  • Requires systematic access to remote data and
    computational resources.
  • Dynamic resource discovery and allocation

6
Challenges
  • Grid is a volatile environment
  • Mapping queries onto resources
  • Inadequate/inaccurate information about data at
    compile time
  • Inadequate/inaccurate information about Grid
    resources at compile time

7
Objectives
  • Develop a query processor that can
  • Monitor the query plan and the execution
    environment
  • Assess the monitored information
  • React, if necessary
  • Speed-up queries through parallelism
  • Develop scheduling algorithms for the Grid

8
The Polar Query Processor
9
The Polar Compiler
10
A Service-Based Model
  • High level of cohesion low levels of coupling
  • Built upon OGSA and OGSA-DAI
  • No need for custom wrappers (to interface the
    query engine to data sources) as SB-DQP operates
    on generic GDSs
  • SB-DQP is layered on top of a more flexible
    execution environment so it can provide
    partitioned parallelism
  • The interaction semantics of constituting
    components of SB-DQP is governed by standard,
    uniform protocols and mechanisms (XML, WSDL,
    SOAP, OGSA )

11
Presentation Outline
  • Project Overview and Presentation of the Polar
    Grid enabled Distributed Query Processor (DQP)
  • Grid Issues

12
Potential Goals of a GDQP
  • Final response time for a query
  • Initial results
  • Most accurate results in least time
  • Minimum resource consumption
  • Minimum economic cost

13
Constructing a virtual database
  • Identify relevant databases
  • Identify relevant tools
  • Construct a common database schema
  • Choose other candidate computational nodes

14
Logical Optimisation
reduce
  • Plan is expressed using a logical algebra.
  • Heuristic-based application of equivalence laws.
  • Multiple equivalent plans generated.

op_call (Blast)
join (proteinId)
reduce
reduce
scan (protein)
scan termIDS92 (proteinTerm)
15
Physical Optimisation
reduce
  • Plan is expressed using a physical algebra.
  • Logical operators replaced with physical
    operators.
  • Cost-based ranking of plans.

op_call (Blast)
hash_join (proteinId)
reduce
reduce
table_scan (protein)
table_scan termIDS92 (proteinTerm)
16
Partitioning
reduce
  • Plan is expressed in a parallel algebra.
  • Parallel algebra physical algebra exchange.
  • Exchange operators are placed where data movement
    is required.

op_call (Blast)
exchange
hash_join (proteinId)
exchange
exchange
reduce
reduce
table_scan (protein)
table_scan termIDS92 (proteinTerm)
17
The Exchange Operator
From Mehul A. Shah, Joseph M. Hellerstein, Sirish
Chandrasekaran and Michael J. Franklin Flux An
Adaptive Partitioning Operator for Continuous
Query Systems, To appear ICDE, March 2003
18
Scheduling
  • Partitions are allocated to Grid nodes.
  • Expressed by decorating parallel algebra
    expression.
  • Heuristic algorithm considers memory use, network
    costs.

19
Parallelism
reduce
4,5
Partitioned parallelism
op_call (Blast)
5
4
3,6
exchange
hash_join (proteinId)
exchange
exchange
  • Pipeline
  • parallelism

reduce
reduce
2,3
6
table_scan (protein)
table_scan termIDS92 (proteinTerm)
20
Perform the resource allocation (1)
  • We have
  • Formulas to estimate the time cost of each
    physical database operator for a specific system
    configuration
  • We need
  • The number and the size of the tuples each
    operator receives and produces
  • Detailed information about the system
    configuration

21
Perform the resource allocation (2)
  • Choosing the computational nodes
  • How many?
  • For which part(s) of the query?
  • Which?
  • This is an NP-hard problem
  • Scalability
  • Production of good plans in most of the cases

22
A minimum set of resource metadata
  • Available CPU
  • Available memory
  • I/O speed
  • Network speed for the connections to the other
    participating machines
  • Databases
  • Applications

23
Some initial results
24
Adaptivity
25
Monitoring the physical operators
  • What can be measured?
  • What can be derived from the measurements?
  • What is useful?

26
What to monitor?
  • New machines
  • Updated resource metadata for existing machines
  • Operator Cost
  • Client input (tradeoff between update rate and
    accuracy, priorities)
  • Data arrival rates
  • Workload
  • Statistical information like sizes of
    intermediate results, number and frequencies of
    attribute values, availability of indices,
    selectivity of operators

27
Summary
  • Performance Definition
  • Construction of resources pool
  • Communication mechanism
  • Resource scheduling
  • Adaptivity
  • monitoring

28
The Polar Team
  • Manchester
  • Norman Paton
  • Alvaro Fernandes
  • Rizos Sakellariou
  • Nedim Alpdemir
  • Anastasios Gounaris
  • Newcastle
  • Paul Watson
  • Arijit Mukherjee
  • Jim Smith

http//www.ncl.ac.uk/polarstar/index.htm
29
GDS interactions
30
Interactions of DQP components
31
ltGDQDataSourceList gt ltimportedDataSourcegt
ltGDSFactoryHandlegt http//130.88.198.2038080/o
gsa/services/ogsadai/GridDataServiceFactoryP2R1 lt
/GDSFactoryHandlegt ltGDSFactoryScriptgt lt
GDSFSHeadergt ltGDSSScriptNamegtcreate1lt
/GDSSScriptNamegt ltGDSSVersiongt
ltGDSSConfiggtfactoryconfiglt/GDSSConfiggt
ltGDSSScriptEnvironmentgt environment
lt/GDSSScriptEnvironmentgt
lt/GDSSVersiongt ltGDSSOriginatorgtOrigina
torlt/GDSSOriginatorgt lt/GDSFSHeadergt ltGDSFSB
odygt ltGDSFSCreateGDSWithNamedConfiggt
getschemaconfig lt/GDSFSCreateGDSWith
NamedConfiggt lt/GDSFSBodygt
lt/GDSFactoryScriptgt lt/importedDataSourcegt ltimpor
tedServicegt ltserviceWSDLURLgt http//ww
w.ebi.ac.uk/collab/mygrid/service0/axis/services/u
rnsrs?WSDL lt/serviceWSDLURLgt lt/importe
dServicegt lt/GDQDataSourceListgt
32
Importing Resource Metadata
33
ltGridNodeInfo hostsDataSource"1"
hostsService"0" hasEvaluatorFactory"1"gt
ltnodeIDgtmach1.cs.man.ac.uklt/nodeIDgt
ltCPUSpeedMHzgt1400lt/CPUSpeedMHzgt
ltCPULoadPercentagegt10lt/CPULoadPercentagegt
ltconnectionSpeedMBperSecgt1.0lt/connectionSpeedMBper
Secgt lthostedDataSource
GDSFactoryHandle"http//rpc53.cs.man.ac.uk/ogsa/s
ervices/GDSFactory" GDSInstanceHandle"htt
p//rpc53.cs.man.ac.uk/ogsa/services/GDSFactory/in
st1"/gt ltevalutorFactorygt
http//rpc53.cs.man.ac.uk/ogsa/services/EvaluatorF
actory lt/evalutorFactorygt lt/GridNodeInfogt ltGri
dNodeInfo hostsDataSource"0" hostsService"1"
hasEvaluatorFactory"0"gt ltnodeIDgtmach1.ebi.co.
uklt/nodeIDgt ltCPUSpeedMHzgt1000lt/CPUSpeedMHzgt
ltCPULoadPercentagegt95lt/CPULoadPercentagegt
ltconnectionSpeedMBperSecgt2.0lt/connectionSpeedMBper
Secgt lthostedServicegt
http//www.ebi.ac.uk/collab/mygrid/service0/axis/s
ervlet/AxisServlet/urnemblfetch
lt/hostedServicegt lthostedServicegt
http//www.ebi.ac.uk/collab/mygrid/service0/axis/s
ervlet/AxisServlet/urnspoofblast
lt/hostedServicegt lt/GridNodeInfogt
ltGridNodeInfo hostsDataSource"1"
hostsService"0" hasEvaluatorFactory"1"gt
ltnodeIDgtmach1.cs.man.ac.uklt/nodeIDgt
ltCPUSpeedMHzgt1400lt/CPUSpeedMHzgt
ltCPULoadPercentagegt10lt/CPULoadPercentagegt
ltconnectionSpeedMBperSecgt1.0lt/connectionSpeedMBper
Secgt lthostedDataSource
GDSFactoryHandle"http//mach1.cs.man.ac.uk/ogsa/s
ervices/GDSFactory" GDSInstanceHandle"htt
p//mach1.cs.man.ac.uk/ogsa/services/GDSFactory/in
st1"/gt ltevalutorFactorygt
http//rpc53.cs.man.ac.uk/ogsa/services/EvaluatorF
actory lt/evalutorFactorygt lt/GridNodeInfogt ltGr
idNodeInfo hostsDataSource"0" hostsService"1"
hasEvaluatorFactory"0"gt ltnodeIDgtmach1.ebi.co.
uklt/nodeIDgt ltCPUSpeedMHzgt1000lt/CPUSpeedMHzgt
ltCPULoadPercentagegt95lt/CPULoadPercentagegt
ltconnectionSpeedMBperSecgt2.0lt/connectionSpeedMBper
Secgt lthostedServicegt
http//www.ebi.ac.uk/collab/mygrid/service0/axis/s
ervlet/AxisServlet/urnemblfetch
lt/hostedServicegt lthostedServicegt
http//www.ebi.ac.uk/collab/mygrid/service0/axis/s
ervlet/AxisServlet/urnspoofblast
lt/hostedServicegt lt/GridNodeInfogt
34
An example of a Query Document
ltGridDataServiceScriptgt ltHeadergt
ltScriptNamegtExample 1lt/ScriptNamegt ltVersiongt
ltConfiggtconfiglt/Configgt
ltScriptEnvironmentgtenvironmentlt/ScriptEnvironment
gt lt/Versiongt ltOriginatorgtGSH of
Originatorlt/Originatorgt lt/Headergt
ltBodygt ltStatement name"xyz" dataResource"MyData
Resource"gt select p.proteinId,
blast(p.sequence) from p in protein, t in
proteinTerm where t.termId'8372' and
p.proteinIdt.proteinId lt/Statementgt ltDelivery
name"delivery"gt ltMechanism type"bulk"/gt ltMod
e type"full"/gt ltFromgtxyzlt/Fromgt ltTogtresponselt
/Togt lt/Deliverygt ltExecute name"execute"gtxyzlt/Ex
ecutegt lt/Bodygt lt/GridDataServiceScriptgt
35
Evaluation
  • All algebraic operators, are implemented using
    the iterator model.
  • The iterator model supports three standard
    operators
  • open()
  • next()
  • close()
  • Remote data sources and computational resources
    are accessed through iterator-based wrappers.
  • The iterator model supports partitioned and
    pipelined parallelism.

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Some General Remarks (1)
  • There are three types of monitored information
    counters, timings, and lengths of strings of
    characters
  • The overhead of these three types is not
    dependent on the type of the query operator. The
    cost of a counter and computing a timing is
    constant for a given system, whereas the cost of
    measuring the size of a string depends on its
    size.

41
Some General Remarks (2)
  • The cost of a counter is negligible for all the
    operators examined. However, this is not true for
    timings and string lengths.
  • The cost of predicting the output cardinality is
    lower than 1 for all operators examined except
    project, for which is 6.78.
  • The cost of predicting the final response time is
    lower than 10 for all operators except project,
    even if the time cost of each tuple is measured
    separately. If the time cost is measured at a
    frequency lower than 10 (i.e., one in ten tuples
    is timed), the cost becomes lower than 1 for
    these operators.

42
  • Monitoring Overhead
  • Scheduling

43
Aims Objectives
  • To answer the question how many, which, and how
    computational nodes are employed
  • To be scalable
  • To be clearly separated from the cost model

44
Basic concepts
  • Start from a valid query plan
  • Identify the bottlenecks
  • Increase the partitioned parallelism for these
    operators
  • Exit when no improvement or improvement below a
    threshold

45
Some characteristics (1)
  • Initial work focuses on response time but this is
    not an inherent limitation
  • Compexity O(operators in the query plan
    candidate machines)
  • Requires a cost model that assigns a cost to
  • The query plan
  • Each operator of the query plan

46
Some characteristics (2)
  • Heuristics are used for
  • Perform the initial processor allocation
  • Identifying the next machine
Write a Comment
User Comments (0)
About PowerShow.com