Title: System%20R%20Based%20Query%20Execution%20Optimization%20for%20Internet%20Information%20Gathering
1System R Based Query Execution Optimization for
Internet Information Gathering
MS Thesis Defense
Senthil Gnanaprakasam
MS Committee Dr Subbarao Kambhampati Dr Chitta
Baral Dr Susan D Urban
http//tsangpo.eas.asu.edu
2Organization
Internet Information Gathering Internet
databases Join ordering issues Current methods
and algorithms Internet System R
Algorithm Implementation and conclusion
3Internet Information Gathering
- Internet Information Gathering
4Need to order pizzas!
5Need to order pizzas!
6Need to order pizzas!
7Need to order pizzas!
8Need to order pizzas!
9Need to order pizzas!
Information Gatherer
Other Sources
10Information Integration
Uniform query interface Uses mediated
schema/virtual relations Source descriptions
statistics
Global Data Model
Query rewriting
Wrapper
Query plan optimization
Wrapper
Query execution engine
Wrapper
11Query Rewriting
- Sound Does all data returned satisfy given
query? - Complete Does the query return all possible
sound results?
12Query Plan Optimization
Duplicate sources
- Subsumption of sources
- Quality of data
- Cost of accessing sources
- Ordering for optimized cost
Pay-per-use, Network costs
13Internet Databases
14Types of internet databases
1
Form interfaced database
2
Text database
Intranet databases
3
15Source Statistics
- Access and transfer time vary widely with
- Type of source local, intranet, Internet
- Time of the day
- Number and speed of servers
- Reliability of connection
16Binding Constraints
- What attributes can be bound?
- Books isbn, title, author, publisher, price,
pagesf - YellowPages (lastNamef, firstName, zip, phoneb)
- YellowPages (lastNameb, firstName, zip, phonef)
17 18Importance of join ordering
P ? Q
Q ? P
a 100t
100a 10t
a 10t
10a 100t
Cost(P?Q) a 100t 100a 10t
Cost(Q?P) a 10t 10a 100t
Internet source access time
90a
Traditional Hard disk seek time
19Internet Information Gathering
University (Intranet)
Administration
Student
Library (Machine)
Information Gatherer
Books
Lost
Borrow
20Schema
- Books (isbnb, title, author, publisher, price,
pages) - Student (idb, firstName, lastName)
- Borrow (studentId, isbn, dateIssued)
- Lost (isbn)
- SELECT
- FROM Student, Books, Lost, Borrow
- WHERE
- Borrow.isbnLost.isbn AND Books.isbnLost.isbn
AND Borrow.idStudent.id
21Current Methods Algos
- Current Methods Algorithms
22Bound Is Easier
- Binding values to attributes produces lesser
number of results - Valid heuristic in absence of source statistics
- Example
- UniversityStudent(Name, Age, Sex, Dept) 50,000
- UniversityStudent(Name, Age, Sex, CS)
4,000 - UniversityStudent(Name, Age, M, CS)
2,000
23Greedy Algorithm
- Based on Bound-Is-Easier heuristic
- Gives importance to access costs keeping the
Internet scenario - Maintains a list of feasible binding patterns
- Views sources/binding patterns as either High
Traffic Binding Pattern HTBP or Low Traffic - Attempts are made at each iteration to access the
most general feasible binding pattern if not in
HTBP - Sentinel checks ensure the algorithm proceeds to
completion
24Join Ordering Strategies
Bound is easier
UniversityStudent(Name, Age, Sex,
Dept) Student(Name, Id)
Greedy Algorithm
Student(Name, Id, Dept) UniversityStudent(Age,
Sex, Dept)
System R
Static query optimization algorithm Exhaustive
search Dynamic programming approach Retains
candidate trees with smallest cost and prunes
others
25Shortcomings of System R
Binding restrictions not taken care of
?
Bushy
?
?
?
R4
?
R3
R4
?
R1
R2
R3
Left linear
R2
R1
Bushy trees not considered
26ISR Algo
- Internet System R Algorithm
27Internet System R
- Update bindings obtained from previous level of
subplans - Search all types of trees (left linear, bushy
right linear) - Use full set of statistics to estimate sizes
- Preserves graceful degradation property
- Trade off Planning vs. execution time
28Algorithm
- INPUTS
- S 1..m Array of all subgoals expanded w.r.t
binding patterns - Associated data structure along with above
which will help calculate costs -
- Initialize NODE with
- PP nil Bindings f Cost0.
-
- IF S has a corresponding BestPlan
- return the corresponding join order
- ENDIF
-
- REPEAT
- FOR i 1 TO number of feasible leaf nodes
- FOR j 1 TO QCi DO
- LET LeftSubGoal jth element in
QCi - LET RightSubGoal S - LeftSubGoal
- Recursively call this algorithm with
LeftSubGoal and RightSubgoal - CurPlan Make a new plan by joining
the above resultant plans - IF it has a lower cost than current
BestPlan THEN
Perform feasibility check
Bushy trees
Pruning
29Example
- ISR (Student, Books, Borrow, Lost)
- ISR (Student, Books, Borrow) ? ISR (Lost)
- ISR (Student, Books, Lost) ? ISR (Borrow)
- ISR (Student, Lost, Borrow) ? ISR (Books)
- ISR (Books, Lost, Borrow) ? ISR (Student)
- ISR (Lost) ? ISR (Student, Books, Borrow)
- ISR (Borrow) ? ISR (Student, Books, Lost)
- ISR (Books) ? ISR (Student, Lost, Borrow)
- ISR (Student) ? ISR (Books, Lost, Borrow)
- ISR (Student, Books) ? ISR (Lost Borrow)
- ISR (Student, Lost) ? ISR (Books, Borrow)
- ISR (Student, Borrow) ? ISR (Books, Lost)
- ISR (Lost, Borrow) ? ISR (Student, Books)
- ISR (Books, Borrow) ? ISR (Student, Lost)
- ISR (Books, Lost) ? ISR (Student, Borrow)
30Feasiblity Check
- Books (isbnb, title, author, publisher, price,
pages) - Student (idb, firstName, lastName)
- Borrow (studentId, isbn, dateIssued)
- Lost (isbn)
- ISR (Student, Books) ? ISR (Lost, Borrow)
31Pruning
- Plan Cost
- (Student ? Borrow) ? (Books ? Lost) 2100
- (Books ? Lost) ? (Student ? Borrow) 2800
Pruned
32Impl Conclusion
- Implementation Conclusion
33Implementation
- Java 2 on Sun Solaris
- Simulated sources with variable statistics
- Measured time independent data
34Experiments
- How important is access time?
- How big is the search space? How much is the
overhead? - Measure tradeoff between planning and execution
time - What if all statistics are not available?
35Access Transfer Times
Measured for an intranet Internet source on a
T1 line intranet Access time 92ms Transfer
time 25ms/kb Internet Access time
4.8s Transfer time 25ms/kb
Access time is a large enough cost worth
spending time to optimize
36ISR vs. SR
- Empirical evaluation of search space
- Trade off between planning time and execution time
Fast processors/slower network ? Lower total cost
It is worth exploring a larger search space
More optimal solution ? Lower execution cost
Larger search space? Higher planning cost
37Effect of Binding Patterns
- Search space increases with number of sources
- Search space is more constrained and small as
number of binding restrictions increase
Size of search space is a non-trivial function
of the given parameters
38Graceful Degradation
Is searching a larger space a good idea when
statistics are not fully available?
Yes Preserves graceful degradation property of
traditional System R
39Related Work
- Florescu, Levy et al
- Do not consider access costs being important
- Bottom up approach- build partial plans and check
which ones lead to complete plans - Consider planning time important and generate a
best first method to produce before the algorithm
runs to completion
40Related Work
- Kabra and DeWitt
- Generate a seemingly optimal plan
- Difficult to gather statistics
- Run time collection of statistics and
modification of plan
41Related Work
- Urhan, Franklin et al
- Access costs are most important
- Concentrate on initial delays
- Change plan if delays exceed a limit
42Contributions
- Analysis of importance of considering access
costs - Developed Internet System R Algorithm
- Included binding constraints over traditional
System R - Increased search space to include bushy trees
- Empirical evaluation of total cost compared to
planning and execution costs - Examined preservation of graceful degradation
with a larger search space and partial statistics
43Papers Published
- LKG99 Eric Lambrecht, Subbarao Kambhampati and
Senthil Gnanaprakasam Optimizing Recursive
Information Gathering Plans. In Proceedings of
the IJCAI-99. - KG99 Subbarao Kambhampati and Senthil
Gnanaprakasam. Optimizing source-call ordering in
information gathering plans. Proceedings of the
IJCAI-99 Workshop on Intelligent Information
Integration.
44Thank You
45Where Do I Go From Here?