Distributed Query Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed Query Processing

Description:

Ship tuples in block-wise fashion (batch); smooth burstiness. Query Execution Techniques for DD ... optimizers do not scale for 10,000s of sites ... – PowerPoint PPT presentation

Number of Views:1091
Avg rating:3.0/5.0
Slides: 32
Provided by: donal210
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Distributed Query Processing


1
Distributed Query Processing
  • Based on The state of the art in distributed
    query processing Donald Kossman (ACM Computing
    Surveys, 2000)

2
Motivation
  • Cost and scalability network of off-shelf
    machines
  • Integration of different software vendors (with
    own DBMS)
  • Integration of legacy systems
  • Applications inherently distributed, such as
    workflow or collaborative-design
  • State-of-the-art distributed information
    technologies (e-businesses)

3
Part 1 Basics
  • Query Processing Basics
  • centralized query processing
  • distributed query processing

4
Problem Statement
  • Input Query such as Biological objects in
    study A referenced in a literature in journal Y.
  • Output Answer
  • Objectives
  • response time, throughput, first answers, little
    IO, ...
  • Centralized vs. Distributed Query Processing
  • same basic problem
  • but, more and different parameters, such(data
    sites or available machine power) and objectives

5
Steps in Query Processing
  • Input Declarative Query
  • SQL, XQuery, ...
  • Step 1 Translate Query into Algebra
  • Tree of operators (query plan generation)
  • Step 2 Optimize Query
  • Tree of operators (logical) - also select
    partitions of table
  • Tree of operators (physical) also site
    annotations
  • (Compilation)
  • Step 3 Execution
  • Interpretation Query result generation

6
Algebra
A.d
SELECT A.d FROM A, B WHERE A.a B.b
AND A.c 35
A.a B.b, A.c 35
X
A
B
  • relational algebra for SQL very well understood
  • algebra for XQuery mostly understood

7
Query Optimization
A.d
A.d
A.a B.b, A.c 35
hashjoin
X
B.b
A
B
index A.c
B
  • logical, e.g., push down cheap predicates
  • enumerate alternative plans, apply cost model
  • use search heuristics to find cheapest plan

8
Basic Query Optimization
  • Classical Dynamic Programming algorithm
  • Performs join order optimization
  • Input Join query on n relations
  • Output Best join order

9
The Dynamic Prog. Algorithm
for i 1 to n do optPlan(Ri)
accessPlans(Ri) prunePlans(optPlan(Ri)) for
i 2 to n do for all S ? R1, R2 Rn
such that S i do optPlan(S) ?
for all O ? S do optPlan(S) optPlan(S) ?
joinPlans(optPlan(O), optPlan(S
O)) prunePlans(optPlan(S)) return
optPlan(R1, R2, Rn)
10
Query Execution
John
A.d
(John, 35, CS)
hashjoin
(CS) (AS)
(John, 35, CS) (Mary, 35, EE)
B.b
(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)
index A.c
B
  • library of operators (hash join, merge join, ...)
  • exploit indexes and clustering in database
  • pipelining (iterator model)

11
Summary Centralized Queries
  • Basic SQL (SPJG, nesting) well understood
  • Very good extensibility
  • spatial joins, time series, UDF, xquery, etc.
  • Current problems
  • Better statistics cost model for optimization
  • Physical database design expensive complex
  • Some Trends
  • interactiveness during execution
  • approximate answers, top-k
  • self-tuning capabilities (adaptive robust etc.)

12
Distributed Query Processing Basics
  • Idea
  • Extension of centralized query processing.
    (System R et al. in 80s)
  • What is different?
  • extend physical algebra sendreceive operators
  • other metrics optimize for response time
  • resource vectors, network interconnect matrix
  • caching and replication
  • less predictability in cost model (adaptive
    algos)
  • heterogeneity in data formats and data models

13
Issues in Distributed Databases
  • Plan enumeration
  • The time and space complexity of traditional
    dynamic programming algorithm is very large
  • Iterative Dynamic Programming (heuristic for
    large queries)
  • Cost Models
  • Classic Cost Model
  • Response Time Model
  • Economic Models

14
Distributed Query Plan
A.d
Forms Of Parallelism?
hashjoin
receive
receive
send
send
B.b
index A.c
B
15
Cost Resource Utilization
Total Cost Sum of Cost of Ops Cost 40
1
8
1
6
1
6
2
5
10
16
Another Metric Response Time
Total Cost 40 first tuple 25 last tuple 33
25, 33
Pipelined parallelism
24, 32
0, 7
0, 24
Independent parallelism
0, 6
0, 18
0, 12
first tuple 0 last tuple 10
0, 5
0, 10
17
Query Execution Techniques for Distributed
Databases
  • Row Blocking
  • Multi-cast optimization
  • Multi-threaded execution
  • Joins with horizontal partitioning
  • Semi joins
  • Top n queries

18
Query Execution Techniques for DD
  • Row Blocking
  • SEND and RECEIVE operators in query plan to model
    communication
  • Implemented by TCP/IP, UDP, etc.
  • Ship tuples in block-wise fashion (batch) smooth
    burstiness

19
Query Execution Techniques for DD
  • Multi-cast Optimization
  • Location of sending/receiving may affect
    communication costs forwarding versus
    multi-casting
  • Multi-threaded execution
  • Several threads for operators at the same site
    (intra-query parallelism)
  • May be useful to enable concurrent reads for
    diverse machines (while continuing query
    processing)
  • Must consider if resources warrant concurrent
    operator execution (say two sorts each needing
    all memory)

20
Query Execution Techniques for DD
  • Joins with Data (horizontal) partitioning
  • Hash-based partitioning to conduct joins on
    independent partitions
  • Semi Joins
  • Reduce communication costs Send only join keys
    instead of complete tuples to the site to extract
    relevant join partners
  • Double-pipelined hash joins
  • Non-blocking join operators to deliver first
    results quickly fully exploit pipelined
    parallelism, and reduce overall response time
  • Top n queries
  • Isloate top n tuples quickly and only perform
    other expensive operations (like sort, join, etc)
    on those few (use stop operators)

21
Adaptive Algorithms
  • Deal with unpredictable events at run time
  • delays in arrival of data, burstiness of network
  • autonomity of nodes, changes in policies
  • Example double pipelined hash joins
  • build hash table for both input streams
  • read inputs in separate threads
  • good for bursty arrival of data
  • Re-optimization at run time (LEO, etc.)
  • monitor execution of query
  • adjust estimates of cost model
  • re-optimize if delta is too large

22
Special Techniques for Client-Server Architectures
  • Shipping techniques
  • Query shipping
  • Data shipping
  • Hybrid shipping
  • Query Optimization
  • Site Selection
  • Where to optimize
  • Two Phase Optimization

23
Special Techniques for Federated Database Systems
  • Wrapper architecture
  • Query optimization
  • Query capabilities
  • Cost estimation
  • Calibration Approach
  • Wrapper Cost Model
  • Parameter Binding

24
Heterogeneity
  • Use Wrappers to hide heterogeneity
  • Wrappers take care of data format, packaging
  • Wrappers map from local to global schema
  • Wrappers carry out caching
  • connections, cursors, data, ...
  • Wrappers map queries into local dialect
  • Wrappers participate in query planning!!!
  • define the subset of queries that can be handled
  • give cost information, statistics
  • capability-based rewriting

25
Summary
  • Theory well understood
  • extend traditional (centralized) query processing
  • add many more details
  • heterogenity needs manual work and wrappers
  • Problems in Practice
  • cost model, statistics
  • architectures are not fit for adaptivity,
    heterogeneity
  • optimizers do not scale for 10,000s of sites
  • autonomy of sites systems not built for
    asynchronous communication

26
Middleware
  • Two kinds of middleware
  • data warehouses
  • virtual integration
  • Data Warehouses
  • good query response times
  • good materializes results of data cleaning
  • bad high resource requirements in middleware
  • bad staleness of data
  • Virtual Integration
  • the opposite
  • caching possible to improve response times

27
Virtual Integration
Query
Middleware (query decomposition, result
composition)
wrapper
wrapper
sub query
sub query
DB1
DB2
28
IBM Data Joiner
SQL Query
Data Joiner
wrapper
wrapper
sub query
sub query
SQL DB1
SQL DB2
29
Adding XML
Query
XML Publishing
Middleware (SQL)
wrapper
wrapper
sub query
sub query
DB1
DB2
30
XML Data Integration
XML Query
Middleware (XML)
XML query
XML query
wrapper
wrapper
DB1
DB2
31
XML Data Integration
  • Example BEA Liquid Data
  • Advantage
  • Availability of XML wrappers for all major
    databases
  • Problems
  • XML - SQL mapping is very difficult
  • XML is not always the right language (e.g.,
    decision support style queries)

32
Web Services
  • Idea Encapsulate Data Source
  • provide WSDL interface to access data
  • works very well if query pattern is known
  • Problem Exploit Capability of Source
  • WSDL limits capabilities of data source
  • good optimization requires white box
  • example access by id, access by name, full
    scanshould all combinations be listed in WSDL?
  • Solution WSDL for Query Planning

33
Summary
  • Middleware looks like a homogenous centralized
    database
  • location transparency
  • data model transparency
  • Middleware provides global schema
  • data sources map local schemas to global schema
  • Various kinds of middleware (SQL, XML)
  • Stacks of middleware possible
  • Data cleaning requires special attention
Write a Comment
User Comments (0)
About PowerShow.com