Query Processor - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Query Processor

Description:

The relational algebra is used to express the output of the query. ... And resp = manager'' EMP(eno, ename, title) ASG(eno, jno, resp, dur) ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 74
Provided by: gangaI
Category:

less

Transcript and Presenter's Notes

Title: Query Processor


1
Query Processor
  • A query processor is a module in the DBMS that
    performs the tasks to process, to optimize, and
    to generate execution strategy for a high-level
    query
  • For a DDBMS, the QP also does data localization
    for the query based on the fragmentation scheme
    and generates the execution strategy that
    incorporates the communication operations
    involved in processing the query

2
Query Optimizer
  • Queries expressed in SQL can have multiple
    equivalent relational algebra query expressions
  • The distributed query optimizer must select the
    ordering of relational algebra operations, sites
    to process data, and possibly the way data should
    be transferred. This makes distributed query
    processing significantly more difficult

3
Complexity of Relational Algebra Operations
  • The relational algebra is used to express the
    output of the query. The complexity of relational
    algebra operations play a role in defining some
    of the principles of query optimization. All
    complexity measures are based on the cardinality
    of the relation
  • Operations Complexity
  • Select, Project (w/o duplicate elimination) O(n)
  • Project (with duplicate elimination), Group O(n
    logn)
  • Join, Semi-join, Division, Set Operators O(n
    logn)
  • Cartesian Product O(n2 )

4
Characteristics of Query Processors
  • Languages
  • Input language can be relational algebra or
    calculus output language is relational algebra
    (annotated with communication primitives). The
    query processor must efficiently map input
    language to output language
  • Types of Optimization
  • The output language specification represents the
    execution strategy. There can be many such
    strategies, the best one can be selected through
    exhaustive search, or by applying heuristic
    (minimize size of intermediate relations). For
    distributed databases semijoins can be applied
    to reduce data transfer.

5
When to Optimize
  • Static done before executing the query (at
    compilation time), cost of optimization amortized
    over multiple executions, mostly based on
    exhaustive search. Since sizes of intermediate
    relations need to be estimated, it can result in
    sub-optimal strategies.
  • Dynamic done at run time every time the query
    is executed, can make use of exact sizes of
    intermediate relations, expensive, based on
    heuristics
  • Hybrid mixes static and dynamic approaches the
    approach is mainly static, but dynamic query
    optimization may take place when high difference
    between predicted and actual sizes are detected

6
Characteristics of Query Processors
  • Statistics
  • fragment cardinality and size
  • size and number of distinct values for each
    attribute. detailed histograms of attribute
    values for better selectivity estimation.
  • Decision Sites
  • one site or several sites participate in
    selection of strategy
  • Exploitation of network topology
  • wide area network communication cost
  • local area network parallel execution

7
Characteristics of Query Processors
  • Exploitation of replicated fragments
  • larger number of possible strategies
  • Use of Semijoins
  • reduce size of data transfer
  • increase of messages and local processing
  • good for fast or slow networks?

8
Layers of Query Processing
Calculus Query on Distributed Relations
QUERY DECOMPOSITION
CONTROL SITE
Algebra Query on Distributed Relations
DATA LOCALIZATION
Fragment Query
GLOBAL OPTIMIZATION
Optimized Fragment Query With Communication
Operations
LOCAL SITE
LOCAL OPTIMIZATION
Optimized Local Queries
9
Query Decomposition
  • Normalization
  • The calculus query is written in a normalized
    form (CNF or DNF) for subsequent manipulation
  • Analysis
  • The query is analyzed for semantic correctness
  • Simplification
  • Redundant predicates are eliminated to obtain
    simplified queries
  • Restructuring
  • The calculus query is translated to optimal
    algebraic query representation

10
Query Decomposition Normalization
  • Lexical and syntactic analysis
  • check validity
  • check for attributes and relations
  • type checking on the qualification
  • There are two possible forms of representing the
    predicates in query qualification Conjunctive
    Normal Form (CNF) or Disjunctive Normal Form
    (DNF)
  • CNF (p11 ? p12 ?... ? p1n) ? ... ? (pm1 ? pm2
    ?... ? pmn)
  • DNF (p11 ? p12 ?... ? p1n) ? ... ? (pm1 ? pm2
    ?... ? pmn)
  • OR's mapped into union
  • AND's mapped into join or selection

11
Query Decomposition Analysis
  • Queries are rejected because
  • the attributes or relations are not defined in
    the global schema or
  • operations used in qualifiers are semantically
    incorrect
  • For only those queries that do not use
    disjunction or negation semantic correctness can
    be determined by using query graph
  • One node of the query graph represents result
    sites, others operand relations, edge between
    nodes operand nodes represent joins, and edge
    between operand node and result node represents
    project

12
Query Graph and Join Graph
SELECT Ename, Resp FROM E, G, J WHERE E. ENo
G. ENO AND G.JNO J.JNO AND JNAME CAD''
AND DUR gt 36 AND Title Prog''
E. ENo G. ENO
G.JNO J.JNO
Ename
Resp
JNAME CAD''
Title Prog''
E. ENo G. ENO
G.JNO J.JNO
DUR gt 36
13
Disconnected Query Graph
  • Semantically incorrect conjunctive multivariable
    query without negation have query graphs which
    are not connected

SELECT Ename, Resp FROM E, G, J WHERE E. ENo
G. ENO AND JNAME CAD'' AND DUR gt 36 AND
Title Prog''
Ename
Resp
Title Prog''
E. ENo G. ENO
DUR gt 36
JNAME CAD''
14
Simplification Eliminating Redundancy
  • Elimination of redundant predicates using well
    known idempotency rules
  • p ? p p p ? p p p ? true true
  • p ? false p p ? true p p ? false
    false
  • p1 ? (p1 ? p 2 ) p1
  • p1 ? (p1 ? p 2 ) p1
  • Such redundant predicates arise when user query
    is enriched with several predicates to
    incorporate view relation correspondence, and
    ensure semantic integrity and security

15
Eliminating Redundancy-- An Example
SELECT TITLE FROM E WHERE (NOT (TITLE
Programmer'') AND (TITLE Programmer'' OR
TITLE Elec.Engr'') AND NOT (TITLE
Elec.Engr'')) OR ENAME J.Doe''
SELECT TITLE FROM E WHERE ENAME J.Doe''
16
Eliminating Redundancy-- An Example
p1 ltTITLE Programmer''gt p2 ltTITLE
Elec. Engr''gt p3 ltENAME J.Doe''gt
Let the query qualification is ( p1 ? (p1 ?
p2) ? p2) ? p3
The disjunctive normal form of the query is (
p1 ? p1 ? p2) ? ( p1 ? p2 ? p2) ? p3
(false ? p2) ? ( p1 ? false) Ú p3 false ?
false ? p3 p3
17
Query Decomposition Rewriting
  • Rewriting calculus query in relational algebra
  • straightforward transformation from relational
    calculus to relational algebra, and
  • restructuring relational algebra expression to
    improve performance

18
Rewriting -- Transformation Rules (I)
  • Commutativity of binary operations
  • R ? S ? S ? R
  • R ? S ? S ? R
  • Associativity of binary operations
  • (R ? S) ? T ? R ? ( S ? T )
  • Idempotence of unary operations grouping of
    projections and selections
  • ?A ( ? A (R )) ? ? A (R ) for A?A? A
  • ?p1(A1) ( ? p2(A2) (R )) ? ? p1(A1) ?p2(A2) (R )

19
Rewriting -- Transformation Rules (II)
  • Commuting selection with projection
  • ?A1, , An ( ? p (Ap) (R )) ? ?A1, , An ( ? p
    (Ap) ( ?A1, , An, Ap(R )))
  • Commuting selection with binary operations
  • ? p (Ai)(R ? S) ? (? p (Ai)(R)) ? S
  • ? p (Ai)(R S) ? (? p (Ai)(R)) S
  • ? p (Ai)(R ? S) ? ? p (Ai)(R) ? ? p (Ai)(S)
  • Commuting projection with binary operations
  • ?C(R ? S) ? ?A(R) ? ?B (S) C A ? B
  • ?C(R S) ? ?C(R) ?C (S)
  • ?C (R ? S) ? ?C (R) ? ?C (S)

20
An SQL Query and Its Query Tree
?ENAME
?(ENAMEltgtJ.DOE )?(JNAMECAD/CAM )? (Dur12 ?
Dur24)
SELECT Ename FROM J, G, E WHERE G.EnoE.ENo
AND G.JNo J.JNo AND ENAME ltgt J.Doe'
AND JName CAD' AND (Dur12 or
Dur24)
PROJ
ASG
EMP
21
Query Decomposition Rewriting
?ENAME
? JNO, ENAME
? JNO
?JNAMECAD/CAM
? JNO, ENO
? ENO, ENAME
?Dur12 ? Dur24
?ENAMEltgtJ.DOE
PROJ
ASG
EMP
22
Data Localization
  • Input Algebraic query on distributed relations
  • Determine which fragments are involved
  • Localization program
  • substitute for each global query its
    materialization program
  • optimize

23
Data Localization-- An Example
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
?ENAME
?Dur12 ? Dur24
?ENAMEltgtJ.DOE
ASG is fragmented into ASG1 ?ENO? E3
(ASG) ASG2 ?ENO gtE3 (ASG)
?JNAMECAD/CAM
PROJ
?
?
ASG1
ASG1
ASG2
EMP1
EMP1
EMP1
24
Reduction with Selection
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
SELECT FROM EMP WHERE ENOE5
Given Relation R, FRR1, R2, , Rn where Rj
?pj(R) ?pj(Rj) ? if ?x ? R ?(pi(x)?pj(x))
25
Reduction with join
SELECT FROM EMP, ASG WHERE EMP.ENOASG.ENO
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
ASG is fragmented into ASG1 ?ENO? E3
(ASG) ASG2 ?ENO gtE3 (ASG)
26
Reduction with Join (I)
(R1 ? R2) S ? (R1 S) ? (R2
S)
27
Reduction with Join (II)
?
Given Ri ?pi(R) and Rj ?pj(R) Ri Rj ? if
?x ? Ri , ?y? Rj ?(pi(x)?pj(y))
Reduction with join 1. Distribute join over
union 2. Eliminate unnecessary work
28
Reduction for VF
  • Find useless intermediate relations
  • Relation R defined over attributes A A1, A2,
    , An vertically fragmented as Ri ?A (R) where
    A? A
  • ?K,D (Ri) is useless if the set of projection
    attributes D is not in A

EMP1 ?ENO,ENAME (EMP) EMP2 ?ENO,TITLE (EMP)
?ENAME
SELECT ENAME FROM EMP
EMP1
29
Reduction for DHF
Distribute joins over union Apply the join
reduction for horizontal fragmentation
SELECT FROM EMP, ASG WHERE ASG.ENO
EMP.ENO AND EMP.TITLE Mech. Eng.
30
Reduction for DHF (II)
Joins over union
?
31
Reduction for HF
  • Remove empty relations generated by contradicting
    selection on horizontal fragments
  • Remove useless relations generated by projections
    on vertical fragments
  • Distribute joins over unions in order to isolate
    and remove useless joins

32
Reduction for HF --An Example
EMP1 ?ENO?E4 (?ENO,ENAME (EMP)) EMP2
?ENOgtE4 (?ENO,ENAME (EMP)) EMP3 ?ENO,TITLE
(EMP) QUERY SELECT ENAME FROM EMP WHERE ENO
E5
33
Why Optimization An Example Query
Database
EMP(eno, ename, title) ASG(eno, jno, resp, dur)
RA tree
Query
Find the name of the employees who are managing a
project?
SQL
Select ename From EMP e, ASG g Where e.Eno g.
Eno And resp manager
34
Example - Strategies
Plan B
Site 5
Fragment Schema
?respmanager
EMP1 ?ENO lt 100(EMP) at site 1 EMP2 ?ENO gt
100(EMP) at site 2 ASG1 ?ENO lt 100(ASG) at
site 3 ASG2 ?ENO gt 100(ASG) at site 4
?
?
ASG1
EMP1
ASG2
EMP2
Query site Site 5
Plan A
ASG1
ASG2
35
Example DB Statistics Costs
  • Database Statistics
  • EMP has 400 tuples,
  • ASG has 1000 tuples,
  • there are 20 managers in G
  • the data is uniformly distributed among sites.
  • ASG and EMP are locally clustered on attributes
    RESP and ENO, respectively
  • Costs
  • tuple access tacc 1 unit,
  • tuple transfer ttrans 10 units,

36
Costs for Example Plan
  • The cost of Plan A
  • Produce ASG 20 ? tacc 20 (processing
    locally)
  • Transfer ASG 20 ttrans 200 (transfer to
    EMP site)
  • Produce EMP (1010) tacc 2 40 (join at
    the EMP site)
  • Transfer EMP 20 ttrans 200 (send to Site
    5)
  • Total cost 460
  • The cost of Plan B
  • Transfer EMP 400 ttrans 4,000 (send EMP
    to Site 5)
  • Transfer ASG 1000 ttrans 10,000 (send ASG
    to Site 5)
  • Produce ASG 1000 tacc 1,000 (selection
    at Site 5)
  • Join EMP and ASG 400 20 tacc 8,000
    (join at Site 5)
  • Total cost 23,000

37
Query Optimization
  • Problems in query optimization
  • Determining the physical copies of the fragments
    upon which to execute the fragment query
    expressions (also known as materialization)
  • Selecting the order of execution of operations
  • Selecting the method for executing each operation
  • The above problems are not independent, for
    instance, the choice of the best materialization
    for a query depends on the order in which
    operations are executed. But they are treated as
    independent. Further,
  • We bypass (1) by taking materialization for
    granted
  • We bypass (3) by clustering all operations at the
    same site as a local database system dependent
    problem

38
Query Optimization - Objectives
  • The selection of alternative query execution
    strategies is made based on predetermined
    objectives
  • Two main objectives
  • minimize the total processing time (total cost)
  • network and computers at nodes do not get loaded.
  • Response time cannot be guaranteed
  • minimize the response time
  • allocation must facilitate parallel execution of
    the query
  • but throughput may decrease and cost can be
    higher than total cost
  • Total processing time (cost) is the sum of all
    the time (cost) incurred in executing the query
    (CPU, I/O, data transfer)
  • Response time is the elapsed time from the
    initiation till the completion of the query

39
Optimization Algorithms The Issues
  • Cost model
  • cost components
  • weights for each components
  • costs for primitive operations
  • Search space
  • The set of equivalent algebra expressions (query
    trees)
  • Search strategies
  • How do we move inside the search space
  • Exhaustive search, heuristics,

40
Cost Models
  • The cost measures are I/O and CPU for
    centralized DBMSs and I/O, CPU and data transfer
    costs for DDBMS
  • Total cost CPU cost I/O cost communication
    cost
  • CPU cost Ccpu insts
  • I/O cost C i/o i/os
  • Communication Cost Cmsgmsgs Ctrbytes
  • Ccpu, C i/o, Ctr and Cmsg are all assumed to be
    constants.
  • Response time sum (sequential operations)
  • Ccpus_insts
  • Ci/os_i/os
  • Cmsgs_msg ctrs_bytes
  • S_x stands for maximum number of sequential xs
    that need to be executed to process the query

41
Intermediate Result Size
  • The size of the intermediate relations produced
    during the execution facilitates the selection of
    the execution strategy
  • This is useful in selecting an execution strategy
    that reduces data transfer
  • The sizes of intermediate relations need to be
    estimated based on cardinalities of relations and
    lengths of attributes
  • RA1, A2,..., An fragmented as R1,R2,, Rn the
    statistical data collected typically are
  • len(Ai), length of attribute Ai in bytes
  • min(Ai) and max(Ai) for ordered domains
  • card(dom(Ai)) unique values in domAi
  • Number of tuples in each fragment card(Rj)

42
Intermediate Size Estimation
  • Join selectivity factor
  • SFj(r,s) card(r s) / card(r) card(s)
  • Selecton selectivity factor
  • SFS(F) card(?f(r)) / card(r)
  • size(r) card(r) len(r)
  • Cardinality of intermediate relations
  • SFS(A value) 1/card(dom(A))
  • SFS(A gt value) max(A) - value/max(A)-min(A)
  • SFS(A lt value) value - min(A)/max(A)-min(A)
  • Sfs(p(Ai)?p(Aj)) sfs(p(Ai)) sfs(p(Aj))
  • Sfs(p(Ai)? p(Aj)) sfs(p(Ai)) sfs(p(Aj)) -
    sfs(p(Ai)) sfs(p(Aj))
  • SFS(A ? values) SFS(A value) card(values)

43
Intermediate Size Estimation (II)
  • Projection
  • card(?a(r)) card(r)
  • Cartesian product
  • card(r X S) card(r) card(s)
  • Join
  • card(R AB S) card(s)
  • if A is key in R, B is foreign key in S
  • card(R AB S) SFJ(R,S) card(r)
    card(s)
  • Union
  • Upper bound card(r) card(s)
  • Lower bound maxcard(r), card(s)

44
Cost of Processing Primitive Operations
  • Selection
  • Projection
  • Union
  • Join
  • nested-loops
  • sort-merge
  • hash-based
  • For distributed join, semi-join is proposed to
    perform joins

45
Semi-join
Amount of data transferred R S
  • join is replaced with a project followed by
    semi-join and then join
  • the project and join operations are done at one
    site, and semi-join at another site
  • amount of data transferred R S

46
Semi-join versus Join
  • using sem-ijoin increases local processing costs
    because a relation must be scanned twice (join,
    project)
  • For joining intermediate relations produced
    during sem-ijoin one cannot exploit indices on
    the base relations
  • Sem-ijoin may not be good when communication
    costs are low

47
Search Space
SELECT ENAME, RESP FROM EMP, ASG, PROJ WHERE
EMP.ENOASG.ENO AND ASG.PNOPROJ.PNO
  • Search space is characterized by alternative
    execution plans
  • Most optimizers focus on join trees
  • For N relations, there are O(N!) equivalent join
    trees

?
ASG
EMP
PROJ
48
Restricting Search Space
deep tree
  • O(N!) is large
  • Considering join methods, the search space is
    even bigger
  • Restrict by means of heuristics
  • Ignore cartisian product
  • Restrict the shape of the join tree
  • Only consider deep trees
  • .

R4
R1
R3
R2
bushy tree
R4
R3
R1
R2
Left-deep tree
R1
R2
R3
R4
49
Search Strategy
  • How to move in the search space to find the
    optimal plan
  • Deterministic
  • Start from base relations and build plans by
    adding relations at each step
  • Dynamic programming breadth-first
  • Greedy depth-first
  • Randomized
  • Search for the optimal one around a particular
    starting point
  • simulated annealing
  • iterative improvement

50
Search Strategies -- Example
Deterministic
Randomized
51
Distributed Query Optimization Algorithms
  • System R and R
  • Hill Climbing and SDD-1

52
System R (Centralized) Algorithm
  • Simple (one relation) queries are executed
    according to the best access path.
  • Execute joins
  • Determine the possible ordering of joins
  • Determine the cost of each ordering
  • Choose the join ordering with the minimal cost
  • For joins, two join methods are considered
  • Nested loops
  • Merge join

53
System R Algorithm -- Example
  • Names of employees working on the CAD/CAM
    project
  • Assume
  • EMP has an index on ENO,
  • ASG has an index on PNO,
  • PROJ has an index on PNO and an index on PNAME

54
System R Algorithm -- Example
  • Choose the best access paths to each relation
  • EMP sequential scan (no selection on EMP)
  • ASG sequential scan (no selection on ASG)
  • PROJ index on PNAME (there is a selection on
    PROJ based on PNAME)
  • Determine the best join ordering
  • EMP ASG PROJ
  • ASG PROJ EMP
  • PROJ ASG EMP
  • ASG EMP PROJ
  • EMP ? PROJ ASG
  • PROJ ? EMP ASG
  • Select the best ordering based on the join costs
    evaluated according to the two methods

55
System R Example (cont'd)
PROJ
EMP
ASG
ASG EMP
PROJ ASG
EMP PROJ
  • Best total join order is one of

56
System R Algorithm
  • (PROJ ASG) EMP has a useful index on
    the select attribute and direct access to the
    join attributes of ASG and EMP.
  • Final plan
  • select PROJ using index on PNAME
  • then join with ASG using index on PNO
  • then join with EMP using index on ENO

57
System R Distributed Query Optimization
  • Total-cost minimization. Cost function includes
    local processing as well as transmission.
  • Algorithm
  • For each relation in query tree find the best
    access path
  • For the join of n relations find the optimal join
    order strategy
  • each local site optimizes the local query
    processing

58
Data Transfer Strategies
  • Ship-whole. entire relation is shipped and stored
    as temporary relation, merge join algorithm is
    used, done in pipeline mode
  • Fetch-as-needed. this method is equivalent to
    semijoin of the inner relation with the outer
    relation tuple

59
Join Strategy 1
  • External relation R with internal relation S, let
    LC be local processing cost, CC be data transfer
    cost, let average number of tuples of S that
    match one tuple of R be s
  • Strategy 1. Ship the entire outer relation to the
    site of internal relation
  • TC LC(get R)
  • CC(size(R))
  • LC(get s tuples from S)card(R)

60
Join Strategy 2
  • Ship the entire inner relation to the site of the
    outer relation
  • TC LC(get S)
  • CC(size(S))
  • LC(store S)
  • LC(get R)
  • LC(get s tuples from S)card(R)

61
Join Strategy 3
  • Fetch tuples of the inner relation for each tuple
    of the outer relation
  • TC LC(get R)
  • CC(len(A)) card(R)
  • LC(get s tuples from S) card(R)
  • CC(slen(S))card(R)

62
Join Strategy 4
  • Move both relations to 3rd site and join there
  • TC LC(get R)
  • LC(get S)
  • CC(size(S))
  • LC(store S)
  • CC(size(R))
  • LC(get s tuples from S)card(R)
  • Conceptually, the algorithm does an exhaustive
    search among all alternatives and selects one
    that minimizes total cost

63
Hill Climbing Algorithm - Algorithm
  • Inputs
  • query graph, locations of relations, and relation
    statistics
  • Initial solution
  • the least costly among all when the relations
    are sent to a candidate result site denoted by
    ES0, and the site as chosen site
  • Splits ES0 into
  • ES1 ship one relation of join to the site of
    other relation
  • ES2 these two relations are joined locally and
    the result is transmitted to the chosen site
  • If cost(ES1) cost(ES2) LC gt cost (ES0) select
    ES0,
  • else select ES1 and ES2.
  • The process can be recursively applied to ES1 and
    ES2 till no more benefit occurs

64
Hill Climbing Algorithm - Example
?SAL
TITLE
PAY
EMP
?PNAMECAD/CAM
ASG
Ignore the local processing cost Length of tuples
is 1 for all relation
PROJ
65
HCA - Example
Site1 EMP(8)
ES1
?
Site2 PAY(4)
ES2
Solution 1 Cost
TITLE
?
Site4 ASG(10)
Site3 PROJ(1)
?
ES3
Site1 EMP(8)
ES1
Site2 PAY(4)
ESo is the BEST
ES2
Solution 2 Cost
Site4 ASG(10)
ES3
Site3 PROJ(1)
66
Hill Climbing Algorithm - Comments
  • Greedy algorithm determines an initial feasible
    solution and iteratively tries to improve it.
  • If there are local minimas, it may not find the
    global minima
  • If the optimal solution has a high initial cost,
    it wont be found since it wont be chosen as the
    initial feasible solution.

Site1 EMP(8)
Site2 PAY(4)
Site4 ASG(10)
Site3 PROJ(1)
COST
67
SDD-1 Algorithm
  • SDD-1 algorithm generalized the hill-climbing
    algorithm to determine ordering of beneficial
    semijoins and uses statistics on the database,
    called database profiles.
  • Cost of semijoin
  • Cost (R SJA S) CMSG CTRsize(?A(S))
  • Benefit is the cost of transferring irrelevant
    tuple
  • Benefit(R SJA S) (1-SFSJ(S.A)) size(R) CTR
  • A semijoin is beneficial if cost lt benefit.

68
SDD-1 The Algorithm
  • initialization phase generates all beneficial
    semijoins, and an execution strategy that
    includes only local processing
  • most beneficial semijoin is selected statistics
    are modified and new beneficial semijoins are
    selected
  • the above step is done until no more beneficial
    joins are left
  • assembly site selection to perform local
    operations
  • postoptimization removes unnecessary semijoins

69
SDD1 - Example
SELECT FROM EMP, ASG, PROJ WHERE EMP.ENO
ASG.ENO AND ASG.PNO PROJ.PNO
Site 2 ASG
ENO
PNO
Site 1 EMP
Site 3 PROJ
70
SDD1 - First Iteration
  • SJ1 ASG SJ EMP
  • benefit (1-0.3)3000 2100
  • cost 120
  • SJ2 ASG SJ PROJ
  • benefit (1-0.4)3000 1800
  • cost 200
  • SJ3 EMP SJ ASG
  • benefit (1-0.8)1500 300
  • cost 400
  • SJ4 PROJ SJ ASG
  • benefit 0
  • cost 400
  • SJ1 is selected
  • ASG size is reduced to 30000.3900
  • ASG ASG SJ EMP
  • Semijoin selectivity factor is reduced it is
    approximated by SFSJ(G.ENO) 0.80.3 0.24

71
SDD-1 - Second Third Iterations
  • Second iteration
  • SJ2 ASG SJ PROJ benefit(1-0.4)900540
  • cost200
  • SJ3 EMP SJ ASG benefit(1-0.24)15001140
  • cost400
  • SJ3 is selected
  • EMP EMP SJ ASG size(EMP) 15000.24 360
  • Third Iteration
  • SJ2 ASG SJ PROJ
  • benefit(1-0.4)900540
  • cost200
  • it is selected
  • reduces size of G further to 9000.4360

72
Local Optimization
  • Each site optimizes the plan to be executed at
    the site
  • A centralized query optimization problem

73
SDD-1 - Assembly Site Selection
  • After reduction
  • EMP is at site 1 with size 360
  • ASG is at site 2 with size 360
  • PROJ is at site 3 with size 2000
  • Site 3 is chosen as assembly site
  • no semijoins reduced in post optimization.

Site1 EMP
Site2 ASG
Site3 PROJ
(ASG SJ EMP) SJ PROJ ? site 3 (EMP SJ ASG) ? site
3 join at site 3
Write a Comment
User Comments (0)
About PowerShow.com