Probabilistic Ranking of Database Query Results - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Probabilistic Ranking of Database Query Results

Description:

Gerhard Weikum, MPI Informatik. Vagelis Hristidis. VLDB 2004. 2. Roadmap. Problem Definition ... House Attributes: Price, City, Bedrooms, Bathrooms, ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 32
Provided by: usersC8
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Ranking of Database Query Results


1
Probabilistic Ranking of Database Query Results
  • Surajit Chaudhuri, Microsoft Research
  • Gautam Das, Microsoft Research
  • Vagelis Hristidis, Florida International
    University
  • Gerhard Weikum, MPI Informatik

2
Roadmap
  • Problem Definition
  • Architecture
  • Probabilistic Information Retrieval
  • Performance
  • Experiments
  • Related Work
  • Conclusion

3
Motivation
  • SQL Returns Unordered Sets of Results
  • Overwhelms Users of Information Discovery
    Applications
  • How Can Ranking be Introduced, Given that ALL
    Results Satisfy Query?

4
Example Realtor Database
  • House Attributes Price, City, Bedrooms,
    Bathrooms, SchoolDistrict, Waterfront, BoatDock,
    Year
  • Query City Seattle AND Waterfront TRUE
  • Too Many Results!
  • Intuitively, Houses with lower Price, more
    Bedrooms, or BoatDock are generally preferable

5
Rank According to Unspecified Attributes
  • Score of a Result Tuple t depends on
  • Global Score Global Importance of Unspecified
    Attribute Values CIDR2003
  • E.g., Newer Houses are generally preferred
  • Conditional Score Correlations between Specified
    and Unspecified Attribute Values
  • E.g., Waterfront ? BoatDock Many Bedrooms ?
    Good School District

6
Key Problems
  • Given a Query Q, How to Combine the Global and
    Conditional Scores into a Ranking Function.Use
    Probabilistic Information Retrieval (PIR).
  • How to Calculate the Global and Conditional
    Scores.Use Query Workload and Data.

7
Roadmap
  • Problem Definition
  • Architecture
  • Probabilistic Information Retrieval
  • Performance
  • Experiments
  • Related Work
  • Conclusion

8
Architecture
9
Roadmap
  • Problem Definition
  • Architecture
  • Probabilistic Information Retrieval
  • Performance
  • Experiments
  • Related Work
  • Conclusion

10
PIR Review
  • Bayes Rule
  • Product Rule
  • Document (Tuple) t, Query QR Relevant
    DocumentsR D - R Irrelevant Documents

11
Ranking Function Adapt PIR
  • Query Q X1x1 AND AND Xsxs, X X1, , Xs
  • Result-Tuple t(X,Y), where X Specified
    Attributes, Y Unspecified Attributes

R?D
X, R, D common. R satisfies X.
12
Ranking Function Limited Conditional
Independence
  • Given a query Q and a tuple t, the X (and Y)
    values within themselves are assumed to be
    independent, though dependencies between the X
    and Y values are allowed

, C involves Y, R, D
, C involves X, R, D
Use Workload
Use Data
13
Atomic Probabilities Estimation Using Workload W
  • If Many Queries Specify Set X of Conditions then
    there is Preference Correlation between
    Attributes in X.
  • Global E.g., If Many Queries ask for Waterfront
    then p(WaterfrontTRUE) is high.
  • Conditional E.g., If Many Queries ask for
    4-Bedroom Houses in Good School Districts, then
    p(Bedrooms4 SchoolDistrictgood),
    p(SchoolDistrictgood Bedrooms4) are high.

Using Limited Conditional Independence
  • Probabilities p(x y, W) (p(x y, D))
    Calculated Using Standard Association Rule Mining
    Techniques on W (D)

Conditional Part
Global Part
14
Roadmap
  • Problem Definition
  • Architecture
  • Probabilistic Information Retrieval
  • Performance
  • Experiments
  • Related Work
  • Conclusion

15
Performance
16
Scan Algorithm
  • Preprocessing - Atomic Probabilities Module
  • Computes and Indexes the Quantities P(y W),
    P(y D), P(x y, W), and P(x y, D) for All
    Distinct Values x and y
  • Execution
  • Select Tuples that Satisfy the Query
  • Scan and Compute Score for Each Result-Tuple
  • Return Top-K Tuples

17
List Merge Algorithm
Final Formula
  • Preprocessing
  • For Each Distinct Value x of Database, Calculate
    and Store the Conditional (Cx) and the Global
    (Gx) Lists as follows
  • For Each Tuple t Containing x Calculate
  • and add to Cx and Gx respectively
  • Sort Cx, Gx by decreasing scores
  • Execution Query Q X1x1 AND AND Xsxs
  • Execute Threshold Algorithm Fag01 on the
    following lists Cx1,,Cxs, and Gxb, where Gxb
    is the shortest list among Gx1,,Gxs

18
Roadmap
  • Problem Definition
  • Architecture
  • Probabilistic Information Retrieval
  • Performance
  • Experiments
  • Related Work
  • Conclusion

19
Quality Experiments
  • Compare our Conditional Ranking Method with the
    Global Method CIDR03
  • Surveyed 14 MSR employees
  • Datasets
  • MSR HomeAdvisor Seattle (http//houseandhome.msn.c
    om/)
  • Internet Movie Database (http//www.imdb.com)
  • Each User Behaved According to Various Profiles.
    E.g.
  • singles, middle-class family, rich retirees
  • teenage males, people interested in comedies of
    the 80s
  • First Collect Workloads, Then Compare Results of
    2 Methods for a Set of Queries

20
Quality Experiments Average Precision
  • For 5 queries, ask users to Mark 10 out of a Set
    of 30 likely results containing the Top-10
    results of both the Conditional and Global plus a
    few randomly selected tuples.
  • Precision Recall

21
Quality Experiments - Fraction of Users
Preferring Each Algorithm
  • Seattle Homes and Movies Datasets
  • 5 new queries
  • Top-5 Result-lists

22
Performance Experiments
  • Microsoft SQL Server 2000 RDBMS
  • P4 2.8-GHz PC, 1 GB RAM
  • C, Connected to RDBMS through DAO
  • Datasets
  • Compared Algorithms
  • LM List Merge
  • Scan

23
Performance Experiments - Precomputation
  • Time and Space Consumed by Index Module

24
Performance Experiments - Execution
Varying Number of Tuples Satisfying Selection
Conditions
  • US Homes Database
  • 2-Attributes Queries

25
Performance Experiments - Execution
  • US Homes Database

26
Roadmap
  • Problem Definition
  • Architecture
  • Probabilistic Information Retrieval
  • Performance
  • Experiments
  • Related Work
  • Conclusion

27
Related Work
  • CIDR2003
  • Use Workload, Focus on Empty-Answer Problem.
  • Drawback Global Ranking Regardless of Query.
    E.g. Tram is desirable to be away from expensive
    houses, but close to cheap.
  • Collaborative Filtering
  • Require Training Data of Queries and their Ranked
    Results
  • Relevance-Feedback Techniques for Learning
    Similarity in Multimedia and Relational Databases

28
Roadmap
  • Problem Definition
  • Architecture
  • Probabilistic Information Retrieval
  • Performance
  • Experiments
  • Related Work
  • Conclusion

29
Conclusions Future Work
  • Conclusions
  • Completely Automated Approach for the
    Many-Answers Problem which Leverages Data and
    Workload Statistics and Correlations
  • Based on PIR
  • Future Work
  • Empty-Answer Problem
  • Handle Plain Text Attributes

30
Questions?
31
Performance Experiments - Execution
  • LMM List Merge where lists for one of the two
    specified attributes are missing, halving space
  • Seattle Homes Database
Write a Comment
User Comments (0)
About PowerShow.com