System%20R%20Based%20Query%20Execution%20Optimization%20for%20Internet%20Information%20Gathering - PowerPoint PPT Presentation

About This Presentation
Title:

System%20R%20Based%20Query%20Execution%20Optimization%20for%20Internet%20Information%20Gathering

Description:

ISR (Lost, Borrow) ISR (Student, Books) ISR (Books, Borrow) ISR (Student, Lost) ISR (Books, Lost) ISR (Student, Borrow) http://tsangpo.eas.asu.edu ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 45
Provided by: rakaposh
Category:

less

Transcript and Presenter's Notes

Title: System%20R%20Based%20Query%20Execution%20Optimization%20for%20Internet%20Information%20Gathering


1
System R Based Query Execution Optimization for
Internet Information Gathering
MS Thesis Defense
Senthil Gnanaprakasam
MS Committee Dr Subbarao Kambhampati Dr Chitta
Baral Dr Susan D Urban
http//tsangpo.eas.asu.edu
2
Organization
Internet Information Gathering Internet
databases Join ordering issues Current methods
and algorithms Internet System R
Algorithm Implementation and conclusion
3
Internet Information Gathering
  • Internet Information Gathering

4
Need to order pizzas!
5
Need to order pizzas!
6
Need to order pizzas!
7
Need to order pizzas!
8
Need to order pizzas!
9
Need to order pizzas!
Information Gatherer
Other Sources
10
Information Integration
Uniform query interface Uses mediated
schema/virtual relations Source descriptions
statistics
Global Data Model
Query rewriting
Wrapper
Query plan optimization
Wrapper
Query execution engine
Wrapper
11
Query Rewriting
  • Sound Does all data returned satisfy given
    query?
  • Complete Does the query return all possible
    sound results?

12
Query Plan Optimization
Duplicate sources
  • Subsumption of sources
  • Quality of data
  • Cost of accessing sources
  • Ordering for optimized cost

Pay-per-use, Network costs
13
Internet Databases
  • Internet Databases

14
Types of internet databases
1
Form interfaced database
2
Text database
Intranet databases
3
15
Source Statistics
  • Access and transfer time vary widely with
  • Type of source local, intranet, Internet
  • Time of the day
  • Number and speed of servers
  • Reliability of connection

16
Binding Constraints
  • What attributes can be bound?
  • Books isbn, title, author, publisher, price,
    pagesf
  • YellowPages (lastNamef, firstName, zip, phoneb)
  • YellowPages (lastNameb, firstName, zip, phonef)

17
  • Join Ordering Issues

18
Importance of join ordering
P ? Q
Q ? P
a 100t
100a 10t
a 10t
10a 100t
Cost(P?Q) a 100t 100a 10t
Cost(Q?P) a 10t 10a 100t
Internet source access time
90a
Traditional Hard disk seek time
19
Internet Information Gathering
University (Intranet)
Administration
Student
Library (Machine)
Information Gatherer
Books
Lost
Borrow
20
Schema
  • Books (isbnb, title, author, publisher, price,
    pages)
  • Student (idb, firstName, lastName)
  • Borrow (studentId, isbn, dateIssued)
  • Lost (isbn)
  • SELECT
  • FROM Student, Books, Lost, Borrow
  • WHERE
  • Borrow.isbnLost.isbn AND Books.isbnLost.isbn
    AND Borrow.idStudent.id

21
Current Methods Algos
  • Current Methods Algorithms

22
Bound Is Easier
  • Binding values to attributes produces lesser
    number of results
  • Valid heuristic in absence of source statistics
  • Example
  • UniversityStudent(Name, Age, Sex, Dept) 50,000
  • UniversityStudent(Name, Age, Sex, CS)
    4,000
  • UniversityStudent(Name, Age, M, CS)
    2,000

23
Greedy Algorithm
  • Based on Bound-Is-Easier heuristic
  • Gives importance to access costs keeping the
    Internet scenario
  • Maintains a list of feasible binding patterns
  • Views sources/binding patterns as either High
    Traffic Binding Pattern HTBP or Low Traffic
  • Attempts are made at each iteration to access the
    most general feasible binding pattern if not in
    HTBP
  • Sentinel checks ensure the algorithm proceeds to
    completion

24
Join Ordering Strategies
Bound is easier
UniversityStudent(Name, Age, Sex,
Dept) Student(Name, Id)
Greedy Algorithm
Student(Name, Id, Dept) UniversityStudent(Age,
Sex, Dept)
System R
Static query optimization algorithm Exhaustive
search Dynamic programming approach Retains
candidate trees with smallest cost and prunes
others
25
Shortcomings of System R
Binding restrictions not taken care of
?
Bushy
?
?
?
R4
?
R3
R4
?
R1
R2
R3
Left linear
R2
R1
Bushy trees not considered
26
ISR Algo
  • Internet System R Algorithm

27
Internet System R
  • Update bindings obtained from previous level of
    subplans
  • Search all types of trees (left linear, bushy
    right linear)
  • Use full set of statistics to estimate sizes
  • Preserves graceful degradation property
  • Trade off Planning vs. execution time

28
Algorithm
  • INPUTS
  • S 1..m Array of all subgoals expanded w.r.t
    binding patterns
  • Associated data structure along with above
    which will help calculate costs
  •  
  • Initialize NODE with
  • PP nil Bindings f Cost0.
  • IF S has a corresponding BestPlan
  • return the corresponding join order
  • ENDIF
  • REPEAT
  • FOR i 1 TO number of feasible leaf nodes
  • FOR j 1 TO QCi DO
  • LET LeftSubGoal jth element in
    QCi
  • LET RightSubGoal S - LeftSubGoal
  • Recursively call this algorithm with
    LeftSubGoal and RightSubgoal
  • CurPlan Make a new plan by joining
    the above resultant plans
  • IF it has a lower cost than current
    BestPlan THEN

Perform feasibility check
Bushy trees
Pruning
29
Example
  • ISR (Student, Books, Borrow, Lost)
  • ISR (Student, Books, Borrow) ? ISR (Lost)
  • ISR (Student, Books, Lost) ? ISR (Borrow)
  • ISR (Student, Lost, Borrow) ? ISR (Books)
  • ISR (Books, Lost, Borrow) ? ISR (Student)
  • ISR (Lost) ? ISR (Student, Books, Borrow)
  • ISR (Borrow) ? ISR (Student, Books, Lost)
  • ISR (Books) ? ISR (Student, Lost, Borrow)
  • ISR (Student) ? ISR (Books, Lost, Borrow)
  • ISR (Student, Books) ? ISR (Lost Borrow)
  • ISR (Student, Lost) ? ISR (Books, Borrow)
  • ISR (Student, Borrow) ? ISR (Books, Lost)
  • ISR (Lost, Borrow) ? ISR (Student, Books)
  • ISR (Books, Borrow) ? ISR (Student, Lost)
  • ISR (Books, Lost) ? ISR (Student, Borrow)

30
Feasiblity Check
  • Books (isbnb, title, author, publisher, price,
    pages)
  • Student (idb, firstName, lastName)
  • Borrow (studentId, isbn, dateIssued)
  • Lost (isbn)
  • ISR (Student, Books) ? ISR (Lost, Borrow)

31
Pruning
  • Plan Cost
  • (Student ? Borrow) ? (Books ? Lost) 2100
  • (Books ? Lost) ? (Student ? Borrow) 2800

Pruned
32
Impl Conclusion
  • Implementation Conclusion

33
Implementation
  • Java 2 on Sun Solaris
  • Simulated sources with variable statistics
  • Measured time independent data

34
Experiments
  • How important is access time?
  • How big is the search space? How much is the
    overhead?
  • Measure tradeoff between planning and execution
    time
  • What if all statistics are not available?

35
Access Transfer Times
Measured for an intranet Internet source on a
T1 line intranet Access time 92ms Transfer
time 25ms/kb Internet Access time
4.8s Transfer time 25ms/kb
Access time is a large enough cost worth
spending time to optimize
36
ISR vs. SR
  • Empirical evaluation of search space
  • Trade off between planning time and execution time

Fast processors/slower network ? Lower total cost
It is worth exploring a larger search space
More optimal solution ? Lower execution cost
Larger search space? Higher planning cost
37
Effect of Binding Patterns
  • Search space increases with number of sources
  • Search space is more constrained and small as
    number of binding restrictions increase

Size of search space is a non-trivial function
of the given parameters
38
Graceful Degradation
Is searching a larger space a good idea when
statistics are not fully available?
Yes Preserves graceful degradation property of
traditional System R
39
Related Work
  • Florescu, Levy et al
  • Do not consider access costs being important
  • Bottom up approach- build partial plans and check
    which ones lead to complete plans
  • Consider planning time important and generate a
    best first method to produce before the algorithm
    runs to completion

40
Related Work
  • Kabra and DeWitt
  • Generate a seemingly optimal plan
  • Difficult to gather statistics
  • Run time collection of statistics and
    modification of plan

41
Related Work
  • Urhan, Franklin et al
  • Access costs are most important
  • Concentrate on initial delays
  • Change plan if delays exceed a limit

42
Contributions
  • Analysis of importance of considering access
    costs
  • Developed Internet System R Algorithm
  • Included binding constraints over traditional
    System R
  • Increased search space to include bushy trees
  • Empirical evaluation of total cost compared to
    planning and execution costs
  • Examined preservation of graceful degradation
    with a larger search space and partial statistics

43
Papers Published
  • LKG99 Eric Lambrecht, Subbarao Kambhampati and
    Senthil Gnanaprakasam Optimizing Recursive
    Information Gathering Plans. In Proceedings of
    the IJCAI-99.
  • KG99 Subbarao Kambhampati and Senthil
    Gnanaprakasam. Optimizing source-call ordering in
    information gathering plans. Proceedings of the
    IJCAI-99 Workshop on Intelligent Information
    Integration.

44
Thank You
  • Questions?

45
Where Do I Go From Here?
Write a Comment
User Comments (0)
About PowerShow.com