Distributed Query Processing - PowerPoint PPT Presentation

About This Presentation

Title:

Distributed Query Processing

Description:

Ship tuples in block-wise fashion (batch); smooth burstiness. Query Execution Techniques for DD ... optimizers do not scale for 10,000s of sites ... – PowerPoint PPT presentation

Number of Views:1091

Avg rating:3.0/5.0

Slides: 32

Provided by: donal210

Learn more at: http://web.cs.wpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Query Processing

1
Distributed Query Processing

Based on The state of the art in distributed
query processing Donald Kossman (ACM Computing
Surveys, 2000)

2
Motivation

Cost and scalability network of off-shelf
machines
Integration of different software vendors (with
own DBMS)
Integration of legacy systems
Applications inherently distributed, such as
workflow or collaborative-design
State-of-the-art distributed information
technologies (e-businesses)

3
Part 1 Basics

Query Processing Basics
centralized query processing
distributed query processing

4
Problem Statement

Input Query such as Biological objects in
study A referenced in a literature in journal Y.
Output Answer
Objectives
response time, throughput, first answers, little
IO, ...
Centralized vs. Distributed Query Processing
same basic problem
but, more and different parameters, such(data
sites or available machine power) and objectives

5
Steps in Query Processing

Input Declarative Query
SQL, XQuery, ...
Step 1 Translate Query into Algebra
Tree of operators (query plan generation)
Step 2 Optimize Query
Tree of operators (logical) - also select
partitions of table
Tree of operators (physical) also site
annotations
(Compilation)
Step 3 Execution
Interpretation Query result generation

6
Algebra
A.d
SELECT A.d FROM A, B WHERE A.a B.b
AND A.c 35
A.a B.b, A.c 35
X
A
B

relational algebra for SQL very well understood
algebra for XQuery mostly understood

7
Query Optimization
A.d
A.d
A.a B.b, A.c 35
hashjoin
X
B.b
A
B
index A.c
B

logical, e.g., push down cheap predicates
enumerate alternative plans, apply cost model
use search heuristics to find cheapest plan

8
Basic Query Optimization

Classical Dynamic Programming algorithm
Performs join order optimization
Input Join query on n relations
Output Best join order

9
The Dynamic Prog. Algorithm
for i 1 to n do optPlan(Ri)
accessPlans(Ri) prunePlans(optPlan(Ri)) for
i 2 to n do for all S ? R1, R2 Rn
such that S i do optPlan(S) ?
for all O ? S do optPlan(S) optPlan(S) ?
joinPlans(optPlan(O), optPlan(S
O)) prunePlans(optPlan(S)) return
optPlan(R1, R2, Rn)
10
Query Execution
John
A.d
(John, 35, CS)
hashjoin
(CS) (AS)
(John, 35, CS) (Mary, 35, EE)
B.b
(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)
index A.c
B

library of operators (hash join, merge join, ...)
exploit indexes and clustering in database
pipelining (iterator model)

11
Summary Centralized Queries

Basic SQL (SPJG, nesting) well understood
Very good extensibility
spatial joins, time series, UDF, xquery, etc.
Current problems
Better statistics cost model for optimization
Physical database design expensive complex
Some Trends
interactiveness during execution
approximate answers, top-k
self-tuning capabilities (adaptive robust etc.)

12
Distributed Query Processing Basics

Idea
Extension of centralized query processing.
(System R et al. in 80s)
What is different?
extend physical algebra sendreceive operators
other metrics optimize for response time
resource vectors, network interconnect matrix
caching and replication
less predictability in cost model (adaptive
algos)
heterogeneity in data formats and data models

13
Issues in Distributed Databases

Plan enumeration
The time and space complexity of traditional
dynamic programming algorithm is very large
Iterative Dynamic Programming (heuristic for
large queries)
Cost Models
Classic Cost Model
Response Time Model
Economic Models

14
Distributed Query Plan
A.d
Forms Of Parallelism?
hashjoin
receive
receive
send
send
B.b
index A.c
B
15
Cost Resource Utilization
Total Cost Sum of Cost of Ops Cost 40
1
8
1
6
1
6
2
5
10
16
Another Metric Response Time
Total Cost 40 first tuple 25 last tuple 33
25, 33
Pipelined parallelism
24, 32
0, 7
0, 24
Independent parallelism
0, 6
0, 18
0, 12
first tuple 0 last tuple 10
0, 5
0, 10
17
Query Execution Techniques for Distributed
Databases

Row Blocking
Multi-cast optimization
Multi-threaded execution
Joins with horizontal partitioning
Semi joins
Top n queries

18
Query Execution Techniques for DD

Row Blocking
SEND and RECEIVE operators in query plan to model
communication
Implemented by TCP/IP, UDP, etc.
Ship tuples in block-wise fashion (batch) smooth
burstiness

19
Query Execution Techniques for DD

Multi-cast Optimization
Location of sending/receiving may affect
communication costs forwarding versus
multi-casting
Multi-threaded execution
Several threads for operators at the same site
(intra-query parallelism)
May be useful to enable concurrent reads for
diverse machines (while continuing query
processing)
Must consider if resources warrant concurrent
operator execution (say two sorts each needing
all memory)

20
Query Execution Techniques for DD

Joins with Data (horizontal) partitioning
Hash-based partitioning to conduct joins on
independent partitions
Semi Joins
Reduce communication costs Send only join keys
instead of complete tuples to the site to extract
relevant join partners
Double-pipelined hash joins
Non-blocking join operators to deliver first
results quickly fully exploit pipelined
parallelism, and reduce overall response time
Top n queries
Isloate top n tuples quickly and only perform
other expensive operations (like sort, join, etc)
on those few (use stop operators)

21
Adaptive Algorithms

Deal with unpredictable events at run time
delays in arrival of data, burstiness of network
autonomity of nodes, changes in policies
Example double pipelined hash joins
build hash table for both input streams
read inputs in separate threads
good for bursty arrival of data
Re-optimization at run time (LEO, etc.)
monitor execution of query
adjust estimates of cost model
re-optimize if delta is too large

22
Special Techniques for Client-Server Architectures

Shipping techniques
Query shipping
Data shipping
Hybrid shipping
Query Optimization
Site Selection
Where to optimize
Two Phase Optimization

23
Special Techniques for Federated Database Systems

Wrapper architecture
Query optimization
Query capabilities
Cost estimation
Calibration Approach
Wrapper Cost Model
Parameter Binding

24
Heterogeneity

Use Wrappers to hide heterogeneity
Wrappers take care of data format, packaging
Wrappers map from local to global schema
Wrappers carry out caching
connections, cursors, data, ...
Wrappers map queries into local dialect
Wrappers participate in query planning!!!
define the subset of queries that can be handled
give cost information, statistics
capability-based rewriting

25
Summary

Theory well understood
extend traditional (centralized) query processing
add many more details
heterogenity needs manual work and wrappers
Problems in Practice
cost model, statistics
architectures are not fit for adaptivity,
heterogeneity
optimizers do not scale for 10,000s of sites
autonomy of sites systems not built for
asynchronous communication

26
Middleware

Two kinds of middleware
data warehouses
virtual integration
Data Warehouses
good query response times
good materializes results of data cleaning
bad high resource requirements in middleware
bad staleness of data
Virtual Integration
the opposite
caching possible to improve response times

27
Virtual Integration
Query
Middleware (query decomposition, result
composition)
wrapper
wrapper
sub query
sub query
DB1
DB2
28
IBM Data Joiner
SQL Query
Data Joiner
wrapper
wrapper
sub query
sub query
SQL DB1
SQL DB2
29
Adding XML
Query
XML Publishing
Middleware (SQL)
wrapper
wrapper
sub query
sub query
DB1
DB2
30
XML Data Integration
XML Query
Middleware (XML)
XML query
XML query
wrapper
wrapper
DB1
DB2
31
XML Data Integration

Example BEA Liquid Data
Advantage
Availability of XML wrappers for all major
databases
Problems
XML - SQL mapping is very difficult
XML is not always the right language (e.g.,
decision support style queries)

32
Web Services

Idea Encapsulate Data Source
provide WSDL interface to access data
works very well if query pattern is known
Problem Exploit Capability of Source
WSDL limits capabilities of data source
good optimization requires white box
example access by id, access by name, full
scanshould all combinations be listed in WSDL?
Solution WSDL for Query Planning

33
Summary