Probabilistic Ranking of Database Query Results - PowerPoint PPT Presentation

About This Presentation
Title:

Probabilistic Ranking of Database Query Results

Description:

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research ... D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 33
Provided by: wei88
Learn more at: https://ranger.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Ranking of Database Query Results


1
Probabilistic Ranking of Database Query Results
  • Surajit Chaudhuri, Microsoft Research
  • Gautam Das, Microsoft Research
  • Vagelis Hristidis, Florida International
    University
  • Gerhard Weikum, MPI Informatik

Presented by Weimin He CSE_at_UTA
2
Outline
  • Motivation
  • Problem Definition
  • System Architecture
  • Construction of Ranking Function
  • Implementation
  • Experiments
  • Conclusion and open problems

3
Motivating example
  • Realtor DB
  • Table D(TID, Price , City, Bedrooms,
    Bathrooms, LivingArea, SchoolDistrict, View,
    Pool, Garage, BoatDock)

SQL query Select From D Where CitySeattle
AND ViewWaterfront
4
Motivation
  • Many-answers problem
  • Two alternative solutions
  • Query reformulation
  • Automatic ranking
  • Apply probabilistic model in IR to DB tuple
    ranking

5
Problem Definition
  • Given a database table D with n tuples t1, ,
    tn over a set of m categorical attributes A
    A1, , Am
  • and a query Q SELECT FROM D
  • WHERE
  • X1x1 AND AND Xsxs
  • where each Xi is an attribute from A and xi is a
    value in its domain.
  • The set of attributes X X1, , Xs is known as
    the set of attributes specified by the query,
    while the set Y A X is known as the set of
    unspecified attributes
  • Let be the answer set of Q
  • How to rank tuples in S and return top-k tuples
    to the user ?

6
System Architecture
7
Intuition for Ranking Function
  • Select From D Where CitySeattle And
    ViewWaterfront
  • Score of a Result Tuple t depends on
  • Global Score Global Importance of Unspecified
    Attribute Values
  • E.g., Homes with good school districts are
    globally desirable
  • Conditional Score Correlations between Specified
    and Unspecified Attribute Values
  • E.g., Waterfront ? BoatDock

8
Probabilistic Model in IR
  • Bayes Rule
  • Product Rule
  • Document t, Query QR Relevant document setR
    D - R Irrelevant document set

9
Adaptation of PIR to DB
  • Tuple t is considered as a document
  • Partition t into t(X) and t(Y)
  • t(X) and t(Y) are written as X and Y
  • Derive from initial scoring function until final
    ranking function is obtained

10
Preliminary Derivation
11
Limited Independence Assumptions
  • Given a query Q and a tuple t, the X (and Y)
    values within themselves are assumed to be
    independent, though dependencies between the X
    and Y values are allowed

12
Continuing Derivation
13
Workload-based Estimation of
Assume a collection of past queries existed in
system Workload W is represented as a set of
tuples Given query Q and specified attribute
set X, approximate R as all query tuples in W
that also request for X All properties of the
set of relevant tuple set R can be obtained by
only examining the subset of the workload that
caontains queries that also request for X
14
Final Ranking Function
15
Pre-computing Atomic Probabilities in Ranking
Function
Relative frequency in W
Relative frequency in D
(of tuples in W that conatains x, y)/total of
tuples in W
(of tuples in D that conatains x, y)/total of
tuples in D
16
Example for Computing Atomic Probabilities
  • Select From D Where CitySeattle And
    ViewWaterfront
  • YSchoolDistrict, BoatDock,
  • D10,000 W1000
  • Wexcellent10
  • Wwaterfront yes5
  • p(excellentW)10/10000.1
  • p(excellentD)10/10,0000.01
  • p(waterfrontyes,W)5/10000.005
  • p(waterfrontyes,D)5/10,0000.0005

17
Indexing Atomic Probabilities
AttName, AttVal, Prob B tree index on
(AttName, AttVal)
AttName, AttVal, Prob B tree index on
(AttName, AttVal)
AttNameLeft, AttValLeft, AttNameRight,
AttValRight, Prob B tree index on
(AttNameLeft, AttValLeft, AttNameRight,
AttValRight)
AttNameLeft, AttValLeft, AttNameRight,
AttValRight, Prob B tree index on
(AttNameLeft, AttValLeft, AttNameRight,
AttValRight)
18
Scan Algorithm
  • Preprocessing - Atomic Probabilities Module
  • Computes and Indexes the Quantities P(y W),
    P(y D), P(x y, W), and P(x y, D) for All
    Distinct Values x and y
  • Execution
  • Select Tuples that Satisfy the Query
  • Scan and Compute Score for Each Result-Tuple
  • Return Top-K Tuples

19
Beyond Scan Algorithm
  • Scan algorithm is Inefficient
  • Many tuples in the answer set
  • Another extreme
  • Pre-compute top-K tuples for all possible
    queries
  • Still infeasible in practice
  • Trade-off solution
  • Pre-compute ranked lists of tuples for all
    possible atomic queries
  • At query time, merge ranked lists to get top-K
    tuples

20
Two kinds of Ranked List
  • CondList Cx
  • AttName, AttVal, TID, CondScore
  • B tree index on (AttName, AttVal, CondScore)
  • GlobList Gx
  • AttName, AttVal, TID, GlobScore
  • B tree index on (AttName, AttVal, GlobScore)

21
Index Module
22
List Merge Algorithm
23
Experimental Setup
  • Datasets
  • MSR HomeAdvisor Seattle (http//houseandhome.msn.c
    om/)
  • Internet Movie Database (http//www.imdb.com)
  • Software and Hardware
  • Microsoft SQL Server2000 RDBMS
  • P4 2.8-GHz PC, 1 GB RAM
  • C, Connected to RDBMS through DAO

24
Quality Experiments
  • Conducted on Seattle Homes and Movies tables
  • Collect a workload from users
  • Compare Conditional Ranking Method in the paper
    with the Global Method CIDR03

25
Quality Experiment-Average Precision
  • For each query Qi , generate a set Hi of 30
    tuples likely to contain a good mix of relevant
    and irrelevant tuples
  • Let each user mark 10 tuples in Hi as most
    relevant to Qi
  • Measure how closely the 10 tuples marked by the
    user match the 10 tuples returned by each
    algorithm

26
Quality Experiment- Fraction of Users Preferring
Each Algorithm
  • 5 new queries
  • Users were given the top-5 results

27
Performance Experiments
  • Datasets
  • Compare 2 Algorithms
  • Scan algorithm
  • List Merge algorithm

28
Performance Experiments Pre-computation Time
29
Performance Experiments Execution Time
30
Performance Experiments Execution Time
31
Performance Experiments Execution Time
32
Conclusion and Open Problems
  • Automatic ranking for many-answers
  • Adaptation of PIR to DB
  • Mutiple-table query
  • Non-categorical attributes
Write a Comment
User Comments (0)
About PowerShow.com