Ranking of Database Query Results - PowerPoint PPT Presentation

About This Presentation
Title:

Ranking of Database Query Results

Description:

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor Introduction As the name suggests Ranking is the process of ordering a set of ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 27
Provided by: Nites4
Learn more at: https://ranger.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Ranking of Database Query Results


1
Ranking of Database Query Results
Nitesh Maan, Arujn Saraswat, Nishant Kapoor
2
Introduction
  • As the name suggests Ranking is the process of
    ordering a set of values (or data items) based on
    some parameter that is of high relevance to the
    user of ranking process.
  • Ranking and returning the most relevant results
    of users query is a popular paradigm in
    information retrieval.

3
Ranking and databases
  • Not much work has been done in ranking of results
    of query in database systems
  • We have all seen example of ranking of results in
    the internet. The most common example is the
    internet search engines (like Google). A set of
    WebPages (satisfying the users search criteria)
    are returned, with most relevant results
    featuring at the top of the list.

4
  • In contrast to the WWW, databases support only a
    Boolean query model. For example a selection
    query on a SQL database schema returns all tuples
    that satisfy the conditions specified in the
    query. Depending on the conditions specified in
    the query, two situations may arise

5
  • Empty Answers when the query is too selective,
    the answer may be empty.
  • Many Answers when the query is not too
    selective, too many tuples may be there in the
    answer.
  • We next consider these two scenarios in detail
    and look at various mechanism to produce ranked
    results in these circumstances.

6
The Empty Answers Problem
  • Empty answers problem is the consequence of a
    very selective query in database system.
  • In this case it would be desirable to return a
    ranked list of approximately matching tuples
    without burdening the user to specify any
    additional conditions. In other words, an
    automated approach for ranking and returning
    approximately matching tuples.

7
Automated Ranking Functions
  • Automated ranking of query results is the process
    of taking a user query and mapping it to a Top-K
    query with a ranking function that depends on
    conditions specified in the user query.
  • A ranking function should be able to work well
    even for large databases and have minimum side
    effects on query processing

8
Automated Ranking functions for the Empty
Answers Problem
  • IDF Similarity
  • QF Similarity
  • QFIDF Similarity

9
IDF Similarity
  • IDF (inverse document frequency) is an adaptation
    of popular IR technique based on the philosophy
    that frequently occurring words convey less
    information about users needs than rarely
    occurring words, and thus should be weighted less.

10
IDF Similarity, formal definition
  • For every value t in the domain of attribute
    A, IDF(t) can be defined as log(n/F(t)),
  • where n number of tuples in the database
  • F(t) frequency of tuples in database where
    A t
  • The similarity between a tuple T and a query
    Q is defined as
  • i.e., similarity between a tuple T and a query
    Q is simply the sum of corresponding similarity
    coefficients over all attributes in T

11
QF Similarity leveraging workloads
  • There may be instances where relevance of a
    attribute value may be due to factors other than
    the frequency of its occurrence.
  • QF similarity is based on this very philosophy.
    According to QF Similarity, the importance of
    attribute values is directly related to the
    frequency of their occurrence in query strings in
    workload.

12
QFIDF Similarity
  • QF is purely workload based, i.e., it does not
    use data at all. This may be a disadvantage in
    situations wherein we have insufficient or
    unreliable workloads.
  • QFIDF Similarity is a remedy in such situations.
    It combines QF and IDF weights. This way even if
    a value is never referenced in the workload, it
    gets a small non-zero QF.

13
Breaking ties.
  • In case of many answers problem, the recently
    discussed ranking functions might fail to
    perform.
  • This is because many tuples may tie for the same
    similarity score. Such a scenario could arise for
    empty answer problem also.
  • To break this tie, requires looking beyond the
    attributes specified in the query, i.e., missing
    attributes.

14
Many Answers Problem
  • We know by now, that many answers problem in
    database systems is the consequence of not too
    selective queries.
  • Such a query on a database system produces a
    large number of tuples that satisfy the condition
    specified in the query.
  • Let us see how ranking of results in such a
    scenario is accomplished.

15
Basic Approach
  • Any ranking function for many answers problem has
    to look beyond the attributes specified in the
    query, since all or a large number of tuples
    satisfy the specified conditions.
  • To determine precisely the unspecified attributes
    is a challenging task. We show adaptation of
    Probabilistic Information Retrieval (PIR) ranking
    methods.

16
Ranking function for Many Answers Problem
  • Ranking function for many answers problem is
    developed by adaptation of PIR models that best
    model data dependencies and correlations.
  • The ranking function of a tuple depends on two
    factors (a) a global score, and (b) a
    conditional score.
  • These scores can be computed through workload as
    well as data analysis.

17
Ranking Function Adaptation of PIR Models for
Structured Data
  • The basic philosophy of PIR models is that given
    a document collection, D, the set of relevant
    documents, R, and the set of irrelevant
    documents, ( D R), any document t in
    D can be ranked by finding out score(t). The
    score(t) is the probability of t belonging to
    the relevant set, R

18
Problem in adapting this approach
  • The problem in computing the score(t) using PIR
    model for the databases is that the relevant set,
    R is unknown at query time.
  • This approach is well suited to IR domain as R
    is usually determined through user feedback.
  • User feedback based estimation of R might be
    attempted in databases also but we propose an
    automated approach.

19
The ranking formula
  • This is the final ranking formula that we will
    use in computing scores of tuples in order to
    rank them.
  • The ranking formula is composed of two large
    factors
  • Global part of the score measures the global
    importance of unspecified attributes
  • Conditional part of the score measures the
    dependencies between the specified and
    unspecified attributes

20
Architecture of Ranking System
Detailed Architecture of the Ranking System
21
Implementation
  • Pre-Processing the pre-processing component is
    composed of Atomic Probabilities Module and
    Index Module.
  • Atomic Probabilities Module is responsible
    for computation of several atomic probabilities
    necessary for the computation of Ranking
    Function, Score(t).
  • Index Module is responsible for
    pre-computation of ranked lists necessary for
    improving the efficiency of query processing
    module.

22
Implementation
  • Intermediate Layer the atomic probabilities,
    lists computed by the Index Module are all stored
    as database tables in the intermediate layer. All
    the tables in the intermediate layer are indexed
    on appropriate attributes for fast access during
    the later stages.
  • Primary purpose of the intermediate layer is to
    avoid computing the score from scratch each time
    a query is received, by storing pre-computed
    results of all atomic computations

23
The Index Module
  • Index module pre-computes the ranked lists of
    tuples for each possible atomic query.
  • Purpose is to take the run-time load off the
    query processing component.
  • To assist the query processing component in
    returning the Top-K tuples, ranked lists of the
    tuples for all possible atomic queries are
    pre-computed
  • Taking as input, the association rules and the
    database, Conditional List and Global List are
    created for each distinct value x in the
    database

24
Query Processing Component
  • List merge algorithm is the key player in the
    query processing component.
  • Its function is to take the user query, compute
    scores for all the tuples that satisfy the
    condition specified in the query, rank the tuples
    in a sorted order of the scores and then return
    the Top-K tuples.

25
Space Requirements
  • To build the conditional and the global lists,
    space consumed is O(mn) bytes (where m is the
    number of attributes and n is the number of
    tuples of the database table)
  • There may be applications where space is an
    expensive resource.
  • In such cases, only a subset of the lists may be
    stored at pre-processing times, but this will at
    the expense of an increase in query processing
    time.

26
Whats Next..
  • The ranking function so presented works on single
    table databases and does not allow presence of
    NULL values.
  • A very interesting but nevertheless challenging
    extension to this work would be to develop
    ranking functions that work on multi-table
    databases and allow NULLs as well as non-text
    data in database columns.
Write a Comment
User Comments (0)
About PowerShow.com