Database Searching and Information Retrieval - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Database Searching and Information Retrieval

Description:

Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga Background Motivation The main motivation behind choosing this topic was our ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 21
Provided by: rangerUt4
Learn more at: http://ranger.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Database Searching and Information Retrieval


1
Database Searching and Information Retrieval
  • Presented by
  • Tushar Kumar.J
  • Ritesh Bagga

2
Background
  • Motivation
  • The main motivation behind choosing this
    topic was our interest in expanding the knowledge
    about the database and also due to the support
    which it will provide to our research work.
  • Focus
  • Our focus is on the various algorithms
    employed to retrieve top few results from the
    database. This is one of the most exciting field
    in database recently.

3
Introduction to Problem
  • Most often we query single database.
  • At times we need to query multiple databases with
  • heterogeneous data.
  • Difficult for user to write a single sql-query to
    work on all
  • database.
  • Solution develop a middleware system to work on
    top of these
  • subsystems.
  • This middleware divides the query into sub
    queries and run
  • them on each individual subsystem.

4
Introduction to Problem
User Query
(Color Red) AND (ShapeCircle)
Middleware System (We will study algorithms which
run on this middleware)
Color Red
Shape Circle
Redness
Circle
R3 (0.70)
R3 (1.00)
R1 (1.00)
R2 (0.50)
R2 (0.00)
R4 (0.40)
R1 (0.10)
R4 (0.00)
Aggregation Function (MIN)
Result
5
Framework of this presentation
  • Basic algorithms
  • Comparative study of basic algorithms
  • Modifications of TA algorithm
  • Advance algorithms
  • Related work
  • How web-search engines rank the web pages ?
  • Conclusion

6
Basic algorithms Fagins Algorithm
  • The most basic and original algorithm for solving
    the problem was developed by Ron Fagin, called as
    FA algorithm.
  • FA algorithm consists of following steps
  • Sorted access in parallel to each of the m
    lists.
  • Random access for every new object seen in every
    other list to find
  • i th field x I of R.
  • Use aggregation function t(R) t( xI , x 2 ..
    xm) for every object
  • to calculate over all grade and store it in set
    Y.
  • Define set H containing objects seen is all the
    lists.
  • Stopping Point Set H has at least k objects.
  • Sort set Y and output top k values.

7
Basic algorithms Fagins Algorithm
Objects Seen
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
3.05
R8(0.95)
R3(0.95)
R5(1.00)
R10(1.00)
3.40
R7(0.95)
R10(0.80)
R2(0.90)
R3(0.95)
2.55
R4(0.70)
R8(0.90)
R5(0.85)
R7(0.85)
3.05
R2(0.85)
R8(0.65)
R3(0.80)
R8(0.80)
R4(0.80)
R5(0.75)
R7(0.60)
R7(0.75)
3.15
3.30
R3(0.70)
R2(0.55)
R9(0.70)
R2(0.75)
2.05
R4(0.65)
R1(0.65)
R9(0.50)
R6(0.60)
2.65
R1(0.60)
R5(0.45)
R9(0.55)
R1(0.50)
R4(0.40)
R6(0.45)
R10(0.55)
R6(0.40)
Objects seen in all 4 lists
R1(0.30)
R6(0.50)
R9(0.30)
R10(0.30)
R8
R7
R2
8
Basic algorithms Threshold Algorithm
  • Similar to FA with slight modification.
  • TA algorithm consists of following steps
  • Sorted access in parallel to each of the m
    lists.
  • Random access for every new object seen in every
    other list to find
  • i th field x I of R.
  • Use aggregation function t(R) t( xI , x 2 ..
    xm) for every object
  • to calculate over all grade and store it in set
    Y only if it belongs to
  • top k objects.
  • Calculate threshold value T of aggregate
    function after every sorted access.
  • Stopping Point As soon as at least k objects
    have been seen
  • whose grade is at least equal to T.
  • Return set Y which has top k values.

9
Basic algorithms Threshold Algorithm
Top 3 Objects
3.90/4
R8(0.95)
R3(0.95)
R5(1.00)
R10(1.00)
R3(3.40/4)
3.60/4
R7(0.95)
R10(0.80)
R2(0.90)
R3(0.95)
R8(3.30/4)
3.30/4
R4(0.70)
R8(0.90)
R5(0.85)
R7(0.85)
R7(3.15/4)
3.10/4
R2(0.85)
R8(0.65)
R3(0.80)
R8(0.80)
R5(3.05/4)
R4(0.80)
R5(0.75)
R7(0.60)
R7(0.75)
R2(2.95/4)
R3(0.70)
R2(0.55)
R9(0.70)
R2(0.65)
R4(0.65)
R1(0.65)
R9(0.50)
R6(0.60)
R10(2.65/4)
R1(0.60)
R5(0.45)
R9(0.55)
R1(0.50)
R4(0.40)
R6(0.45)
R10(0.55)
R6(0.40)
R1(0.30)
R6(0.50)
R9(0.30)
R10(0.30)
10
Basic algorithms Comparison between TA and FA
  • FA is optimal in some cases, but TA is optimal in
    all the cases.
  • TA uses less buffer space, FA requires buffer
    that grows with the database size.
  • TA may do m-1 random access for every object not
    in top k set,
  • but FA does this random access only once for
    every newly seen
  • object in sorted access.

11
Modifications of TA Algorithms
  • Approximation Algorithm to find the top k
    elements with x
  • degree of approximation. Stops earlier then TA.
  • Restricting Sorted Access when sorted access to
    some lists are not allowed, e.g. finding best
    restaurant.
  • Restricting Random Access
  • NRA was developed when no random access was
    allowed, e.g. text retrieval system.
  • CA was developed for situations where random
    access are allowed but are very costly. Is
    combination of TA and NRA, e.g. random disk
    access.

12
Advance algorithms
  • Suppose we already have several ranked lists of
    objects, the
  • problem here is to aggregate these lists to form
    a single
  • ranked
  • list.
  • The problem can be solved using a median finding
    algorithm.
  • Steps involved in the median finding algorithm
    are
  • - Find out the rank of each object in
    each of the ranked lists
  • - Find the median of the ranks obtained
    from these lists for each object.
  • - Sort the list containing the median
    ranks for these objects.
  • - Retrieve the results from this list.

13
Advance algorithms
  • Limitation of the median finding algorithm is
    large number of
  • random accesses, which is overcome by the
    MEDRANK algorithm.
  • MEDRANK algorithm access the ranked lists, one
    element of
  • every list at a time, until some element is seen
    in more than half of the lists.

14
Related work
  • In 1996, Chaudhuri and Gravano presented an
    algorithm which
  • was built on Fagins original FA algorithm.
  • In 1997 and 1998, Carey and Kossmann presented
    techniques
  • to optimize top-k queries.
  • In 1999, Nepal and Ramakrishna presented
    variations on Fagins
  • TA algorithm for processing queries over
    multimedia databases.
  • In 2000, Guntzer made a remarkable contribution
    to the Fagins
  • TA algorithm by reducing the number of random
    accesses.
  • In 2002, Chang and Zwang presented an algorithm
    called as MPro to optimize the execution of
    expensive predicates.

15
How web-search engines rank the web pages (1)
  • Web-search engines rank the web pages based on
    various factors.
  • Some of the most commonly found web-search
    engines are
  • Frequency of occurrence and location are the
    primary factors.
  • Two most important web-search engines
  • Google and AltaVista

16
How web-search engines rank the web pages (2)
  • AltaVista
  • - Maintains a huge phrase dictionary.
  • - basic intuition behind the ranking of web
    pages is
  • as follows
  • It first displays all the pages containing the
    phrase
  • - Then it displays all the pages in which the
    words are
  • closer to each other.
  • - Followed by displaying all pages containing
    all the terms,
  • displaying pages containing any of the terms
  • - Another important factor is the popularity
    of search being
  • performed.

17
How web-search engines rank the web pages (3)
  • Google
  • - Uses a very different technology called as
    page-rank technology.
  • Page rank technology
  • - Measures the importance of a web page by
    solving an equation.
  • - Interprets a link as a vote.
  • - Assesses a pages importance by the no. of
    votes it receives.
  • - Important pages receives a higher rank and
    appears at the top of the search results.

18
Conclusion
  • The literature studied signifies that much work
    is done to solve the problem of retrieving top-k
    results from the database.
  • We came across many algorithms which are very
    tricky to
  • understand.
  • The research in this field is still very active.
  • Now the focus is on devising a more sophisticated
    algorithm for
  • aggregating the ranked lists.

19
References
  • 1 Ronald Fagin, Combining Fuzzy Information
    from Multiple Systems received July 4, 1996
    revised June 22, 1998
  • 2 Ronald Fagin, Combining Fuzzy Information
    an Overview , Appeared in ACM SIGMOD Record 31,
    2, June 2002, pages 109-118
  • 3 Ronald Fagin, Amnon Lotem and Moni Naor.
    Optimal aggregation algorithms for middleware
    Computer and System Sciences 66 (2003), pp.
    614-656. Extended abstract appeared in Proc. 2001
    ACM Symposium on Principles of Database Systems
    (PODS '01), pp. 102-113.
  • 4 Ronald Fagin, Ravi Kumar and D. Sivakumar.
    Efficient similarity search and classification
    via rank Aggregation Proc. 2003 ACM SIGMOD
    Conference (SIGMOD '03), pp. 301-312.
  • 5 Ronald Fagin, Ravi Kumar, Mohammad Mahdian,
    D. Sivakumar, and Erik Vee. Comparing and
    Aggregating Rankings with Ties Proc. 2004 ACM
    Symposium on Principles of Database Systems (PODS
    '04), pp. 47-58.
  • 6 Ronald Fagin, Ravi Kumar, and D. SivaKumar.
    COMPARING TOP k LISTS SIAM J. Discrete
    Mathematics 17, 1 (2003), pp. 134-160. Extended
    abstract in 2003 ACM-SIAM Symposium on Discrete
    Algorithms (SODA '03), pp. 28-36.
  • 7 A. Marian, N. Bruno, and L. Gravano.
    Evaluating Top- k Queries over Web-Accessible
    Databases Accepted for publication in ACM
    Transactions on Database Systems, 2003.
  • 8 Martin P. Courtois and Michael W.Berry,
    Results Ranking in Web Search Engines online
    may 1999.

20
  • Thank you!
  • Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com