Probabilistic Ranking of Database Query Result - PowerPoint PPT Presentation

About This Presentation
Title:

Probabilistic Ranking of Database Query Result

Description:

Two kinds of efficient access operations are needed: ... of the Conditional algorithm: List Merge, its space-saving variant and Scan. Datasets used: ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 35
Provided by: Abh97
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Ranking of Database Query Result


1
Probabilistic Ranking of Database Query Result
  • Surajit Chaudhuri, Microsoft Research
  • Gautam Das, Microsoft Research
  • Vagelis Hristidis, Florida International
    University
  • Gerhard Weikum, MPI Informatik
  • 30th VLDB Conference Toronto ,Canada,2004
  • Presented By
  • Abhishek Jamloki
  • CSE_at_UB

2
Motivating example
  • Realtor DB
  • Table D(TID, Price , City, Bedrooms,
    Bathrooms, LivingArea, SchoolDistrict, View,
    Pool, Garage, BoatDock)
  • SQL query
  • Select From D
  • Where CitySeattle AND ViewWaterfront

3
Formal Definition
  • Consider a database table D with n tuples t1, ,
    tn over a set of m
  • categorical attributes A A1, , Am
  • a query Q SELECT FROM D
  • WHERE
  • X1x1 AND AND Xsxs
  • where each Xi is an attribute from A and xi is a
    value in its domain.
  • specified attributes X X1, , Xs
  • unspecified attributes Y A X
  • Let S be the answer set of Q
  • How to rank tuples in S and return top-k tuples
    to the user?

4
Many Answers Problem
  • IR Treatment
  • Query Reformulation
  • Automatic Ranking
  • Correlations are ignored in high dimensional
    spaces of IR
  • Automated Ranking function proposed based on
  • A global score of unspecified attributes
  • A conditional score (strength of correlation
    between specified and unspecified attributes)
  • Automatic estimation using workload and data
    analysis

5
Architecture of Ranking System
6
Probabilistic Model in IR
  • Bayes Rule
  • Product Rule

Document t, Query QR Relevant document setR
D - R Irrelevant document set
7
Adaptation of PIR Models for Structured Data
  • Each tuple t is treated as a document
  • Partition t into two parts
  • t(X) contains specified attributes
  • t(Y) contains unspecified attributes
  • Replace t with X and Y
  • Replace R with D

8
Derivation
9
Limited Independence Assumptions
  • Comprehensive dependency models have unacceptable
    preprocessing and query processing costs
  • Choose a middle ground.
  • Given a query Q and a tuple t, the X (and Y)
    values within themselves are assumed to be
    independent, though dependencies between the X
    and Y values are allowed

10
Derivation Continued
11
Workload-Based Estimation of p(yR)
  • Workload W a collection of ranking queries that
    have been executed on our system in the past.
  • Represented as a set of tuples, where each
    tuple represents a query and is a vector
    containing the corresponding values of the
    specified attributes.
  • We approximate R as all query tuples in W that
    also request for X (approximation is novel to
    this paper)
  • Properties of the set of relevant tuples R can be
    obtained by only examining the subset of the
    workload that contains queries that also request
    for X
  • Substitute p(y R) as p(y X,W)

12
Derivation Continued
13
Computing the Atomic Probabilities
  • p(y W) the relative frequencies of each
    distinct value y in the workload
  • p( y D) relative frequencies of each distinct
    value y in the
  • database (similar to IDF concept in IR)
  • p(x y,W) confidences of pair-wise association
    rules in the workload, that is (of tuples in W
    that contains x, y)/total of tuples in W
  • p(x y,D) (of tuples in D that contains x,
    y)/total of tuples in D
  • Stored as auxiliary tables in the intermediate
    knowledge representation layer

14
Implementation
  • p(y w) AttName, AttVal, Prob
  • BTree index on (AttName, AttVal)
  • p(y D) AttName, AttVal, Prob
  • BTree index on (AttName, AttVal)
  • p(x y,W) AttNameLeft, AttValLeft,
    AttNameRight, AttValRight, Prob
  • BTree index on (AttNameLeft, AttValLeft,
    AttNameRight, AttValRight)
  • p(x y,D) AttNameLeft, AttValLeft,
    AttNameRight, AttValRight, Prob
  • BTree index on (AttNameLeft, AttValLeft,
    AttNameRight, AttValRight)

15
Naïve Scan Algorithm
  • Preprocessing - Atomic Probabilities Module
  • Computes and Indexes the Quantities P(y W),
    P(y D), P(x y, W), and P(x y, D) for All
    Distinct Values x and y
  • Execution
  • Select Tuples that Satisfy the Query
  • Scan and Compute Score for Each Result-Tuple
  • Return Top-K Tuples

16
Adapting the TA
  • Trade off between pre-processing and query
    processing
  • Pre-compute ranked lists of the tuples for all
    possible atomic queries. Then at query time,
    given an actual query that specifies a set of
    values X, we merge the ranked lists
    corresponding to each x in X to compute the final
    Top-K tuples.
  • We should be able to perform merging without
    scanning the entire ranked lists
  • Threshold algorithm can be used for this purpose
  • A feasible adaptation of TA should keep the
    number of sorted streams small
  • Number of sorted streams will depend on number of
    attributes in database

17
Pre-Compute Data Structures
  • At query time we do a TA-like merging of several
    ranked lists (i.e. sorted streams)
  • The required number of sorted streams depends
    only on the number of specified attribute values
    in the query and not on the total number of
    attributes in the database
  • Such a merge operation is only made possible due
    to the specific functional form of our ranking
    function resulting from our limited independence
    assumptions

18
The Computed Lists
  • Index Module takes as inputs the association
    rules and the database, and for every distinct
    value x, creates two lists Cx and Gx, each
    containing the tuple-ids of all data tuples that
    contain x, ordered in specific ways.
  • Conditional List Cx consists of pairs of the
    form ltTID, CondScoregt, ordered by descending
    CondScore
  • TID tuple-id of a tuple t that contains x
  • Global List Gx consists of pairs of the form
    ltTID, GlobScoregt, ordered by descending
    GlobScore, where TID is the tuple-id of a tuple t
    that contains x and

19
Retrieving values
  • At query time we retrieve and multiply the scores
    of t in the lists Cx1,,Cxs and in one of
    Gx1,,Gxs. This requires only s1 multiplications
    and results in a score2 that is proportional to
    the actual score.Two kinds of efficient access
    operations are needed
  • First, given a value x, it should be possible to
    perform a GetNextTID operation on lists Cx and Gx
    in constant time, tuple-ids in the lists should
    be efficiently retrievable one-by-one in order of
    decreasing score. This corresponds to the sorted
    stream access of TA.
  • Second, it should be possible to perform random
    access on the lists, that is, given a TID, the
    corresponding score (CondScore or GlobScore)
    should be retrievable in constant time.

20
The Conditional and Global Lists
  • These lists are stored as database tables
  • CondList Cx
  • AttName, AttVal, TID, CondScore
  • BTree index on (AttName, AttVal, CondScore)
  • GlobList Gx
  • AttName, AttVal, TID, GlobScore
  • BTree index on (AttName, AttVal, GlobScore)

21
Index Module
22
Query Processing Component
23
Limited Available Space
  • Space consumed by the lists is O(mn) bytes (m is
    the number of attributes and n the number of
    tuples of the database table)
  • We can store only a subset of the lists at
    preprocessing time, at the expense of an increase
    in the query processing time.
  • Determining which lists to retain/omit at
    preprocessing time done by analyzing the
    workload.
  • Store the conditional lists Cx and the
    corresponding global lists Gx only for those
    attribute values x that occur most frequently in
    the workload
  • Probe the intermediate knowledge representation
    layer at query time to compute the missing
    information

24
Experimental Setup
  • The following Datasets were used
  • MSR HomeAdvisor Seattle (http//houseandhome.msn.c
    om/)
  • Internet Movie Database (http//www.imdb.com)
  • Software and Hardware
  • Microsoft SQL Server2000 RDBMS
  • P4 2.8-GHz PC, 1 GB RAM
  • C, Connected to RDBMS through DAO

25
Quality Experiments
  • Evaluated using two ranking methods
  • 1) Conditional
  • 2) Global
  • Several hundred workload queries were collected
    for both the datasets and ranking algorithm
    trained on this workload

26
Quality Experiment-Average Precision
  • For each query Qi , generate a set Hi of 30
    tuples likely to contain a good mix of relevant
    and irrelevant tuples
  • Let each user mark 10 tuples in Hi as most
    relevant to Qi
  • Measure how closely the 10 tuples marked by the
    user match the 10 tuples returned by each
    algorithm

27
Quality Experiment Fraction of Users
Preferring Each Algorithm
  • Users were given the Top-5 results of the two
    ranking methods for 5 queries (different from the
    previous survey), and were asked to choose which
    rankings they preferred

28
Performance Experiments
  • Compared performance of the various
    implementations of the Conditional algorithm
    List Merge, its space-saving variant and Scan
  • Datasets used

29
Performance Experiments Pre-Computation Time
30
Performance Experiments Execution Time
31
Performance Experiments Execution Time
32
Performance Experiments Execution Time
33
Conclusion
  • Completely automated approach for the
    Many-Answers Problem which leverages data and
    workload statistics and correlation
  • Probabilistic IR models were adapted for
    structured data.
  • Experiments demonstrate efficiency as well as
    quality of the ranking system

34
Open Questions
  • Many relational databases contain text columns in
    addition to numeric and categorical columns.
    Whether correlations between text and non-text
    data can be leveraged in a meaningful way for
    ranking ?
  • Comprehensive quality benchmarks for database
    ranking need to be established
Write a Comment
User Comments (0)
About PowerShow.com