IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS

Description:

Schema : specifies name of relation, plus name and type of each column. ... Note that we can store the text part of a document using BLOB or CLOG ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 27
Provided by: peopleSab
Category:

less

Transcript and Presenter's Notes

Title: IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS


1
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS
VIA RDBMS
2
Relational Database Definitions
  • Relational database a set of relations
  • Relation made up of 2 parts
  • Instance a table, with rows and columns. Rows
    cardinality, fields degree / arity.
  • Schema specifies name of relation, plus name
    and type of each column.
  • E.G. Students(sid string, name string, login
    string, age integer, gpa
    real).
  • Can think of a relation as a set of rows or
    tuples (i.e., all rows are distinct).

3
Example Instance of Students Relation
Cardinality 3, degree 5, all rows distinct
4
Relational Query Languages
  • A major strength of the relational model
    supports simple, powerful querying of data.
  • Queries can be written intuitively, and the DBMS
    is responsible for efficient evaluation.

5
The SQL Query Language
  • Developed by IBM (system R) in the 1970s
  • Need for a standard since it is used by many
    vendors
  • Standards
  • SQL-86
  • SQL-89 (minor revision)
  • SQL-92 (major revision, current standard)
  • SQL-99 (major extensions)

6
The SQL Query Language
  • To find all 18 year old students, we can write

SELECT FROM Students S WHERE S.age18
  • To find just names and logins, replace the first
    line

SELECT S.name, S.login
7
Querying Multiple Relations
SELECT S.name, E.cid FROM Students S, Enrolled
E WHERE S.sidE.sid AND E.gradeA
8
Creating Relations in SQL
  • Creates the Students relation. Observe that the
    type (domain) of each field is
    specified, and enforced by the DBMS whenever
    tuples are added or modified.
  • As another example, the Enrolled table holds
    information about courses that students take.

CREATE TABLE Students (sid CHAR(20), name
CHAR(20), login CHAR(10), age INTEGER,
gpa REAL)
CREATE TABLE Enrolled (sid CHAR(20), cid
CHAR(20), grade CHAR(2))
9
Combining Separate Systems
  • Use an IR and RDBMS systems which are
    independent.
  • Divide the query into two
  • Structured part for the RDBMS
  • Unstructured (text) part for the IR
  • Combine the results from IR and RDBMS
  • Good for letting each vendor develop its own
    system
  • Bad for data integrity, recovery, portability,
    and performance

10
User Defined Operators
  • Allow users to modify SQL by adding their own
    functions
  • Some vendors used this approach (such as IBM DB2
    text extender)
  • Lynch and Stonebreaker defined user defined
    operators to implement information retrieval in
    1988

//Retrieves documents that contain term1, term2,
term3 SELECT Doc_Id FROM Doc WHERE
SEARCH-TERM(Text, Term1, Term 2, Term3)
//Retrieves documents that contain term1, term2,
term3 // within a window of 5 terms SELECT
Doc_Id FROM Doc WHERE PROXIMITY(Text,5, Term1,
Term 2, Term3)
11
Non-First Normal Form Approaches
  • Capture the many-to-many relationships into sets
    via nested relations
  • Hard to implement ad-hoc queries
  • No standard yet

12
Using RDBMS for IR
  • Benefits
  • Recovery
  • Performance
  • Data migration
  • Concurrency Control
  • Access control mechanism
  • Logical and physical data independence

13
Using RDBMS for IR
  • Example A bibliography that includes both
    structured and unstructured information
  • DIRECTORY (name, institution) affiliation of
    the author
  • AUTHOR(name,DocId) authorship information
  • INDEX (name, DocId) terms that are used to index
    a document

14
Using RDBMS for IR
  • Preprocessing
  • SGML can be used as a starting point which is a
    standard for defining parts of documents

ltDOCgt ltDOCNOgt WSJ834234234 lt/DOCNOgt ltHLgt How to
make students suffer in IR Course lt/HLgt ltDDgt
03/23/87lt/DDgt ltDATELINEgt Sabanci, Turkey
lt/DATELINEgt ltTEXTgt Crawler HW, Inverted Index,
Querying lt/TEXTgt lt/DOCgt
15
Using RDBMS for IR
  • Preprocessing
  • SGML can be used as a starting point which is a
    standard for defining parts of documents
  • Use a parser together with a hash function to
    identify terms
  • Use STOP_TERM table for referencing stop words
  • Produce three output tables
  • INDEX (DocId, Term, TermFrequency) Models the
    inverted index
  • DOC (DocId, DocName, PubDate, DateLine)
    Document metadata
  • TERM (Term, Idf) stored the weights of each
    term

//Construct TERM table, N is the total number of
documents INSERT INTO TERM SELECT
Term,log(N/Count()) FROM INDEX GROUP BY Term
16
Using RDBMS for IR
An offset can be added together with the term to
be able to answer proximity queries. For example
Vice President should occur together in the
same document for relevant documents
etc. INDEX_PROX (DocId, Term, OffSet)
//Construct TERM table, N is the total number of
documents INSERT INTO INDEX SELECT DocId, Term,
COUNT() FROM INDEX_PROX GROUP BY DocId, Term
17
Using RDBMS for IR
  • Query can be modeled as a relation as well when
    it is a long document
  • QUERY(Term,TermFreq)
  • Ex Find all news documents written on
    03/03/2005 about Sabanci University
  • Data will be extracted from the structured fields
  • Terms will be extracted using the inverted index

SELECT d.DocId FROM DOC d, INDEX i WHERE i.Term
IN (Sabanci, University) AND d.PubDate
03/03/2005 AND d.DocId i.DocId
18
Using RDBMS for IR
  • Boolean Queries Consists of terms with boolean
    operators (AND, OR, and NOT)
  • For a single inputTerm retrieve the document
    texts that contain that term

SELECT d.Text FROM DOC d, WHERE d.DocId IN
(SELECT DISTINCT (i.DocId) FROM
INDEX i WHERE i.Term
inputTerm) Note that we can store the text part
of a document using BLOB or CLOG ( Binary or
Character Large Object)
19
Using RDBMS for IR
  • Boolean Queries that contain OR

SELECT DISTINCT (i.DocId) FROM INDEX i WHERE
i.Term inputTerm1 OR i.Term
inputTerm2 OR .. i.Term
inputTermn OR
20
Using RDBMS for IR
  • Boolean Queries that contain AND

SELECT DISTINCT (i.DocId) FROM INDEX i WHERE
i.Term inputTerm1 AND i.Term
inputTerm2 AND .. i.Term
inputTermn AND ??
21
Using RDBMS for IR
  • Boolean Queries that contain AND (Previous Answer
    Was Wrong)

SELECT DISTINCT (i.DocId) FROM INDEX i1, INDEX
i2, INDEX i3, . INDEX in WHERE i1.Term
inputTerm1 AND i2.Term inputTerm2
AND .. in.Term
inputTermn AND i1.DocID i2.DocId
AND i2.DocID i3.DocId AND
in-1 in.DocID OR YOU CAN USE
INTERSECTION
22
Using RDBMS for IR
  • Boolean Queries that contain AND
  • Commercial DBMSs are not able to process more
    than a fixed number of joins.
  • Solution

SELECT i.DocId FROM INDEX i, Query q WHERE
i.Term q.term GROUP BY i.DocId HAVING
COUNT(i.Term) (SELECT COUNT() FROM
QUERY) Works only when the INDEX contains only
one occurrence of a given term Together with its
frequency. No Proximity is recorded.
23
Using RDBMS for IR
  • Boolean Queries that contain AND
  • Commercial DBMSs are not able to process more
    than a fixed number of joins.
  • Solution for terms appearing more than once in
    the INDEX

SELECT i.DocId FROM INDEX i, Query q WHERE
i.Term q.term GROUP BY i.DocId HAVING
COUNT(DISTINCT(i.Term)) (SELECT COUNT() FROM
QUERY) This is slower since DISTINC requires a
sort for duplicate elimination.
24
Using RDBMS for IR
  • Boolean Queries that contain AND
  • Commercial DBMSs are not able to process more
    than a fixed number of joins.
  • Implementation of TAND (Threshold AND) is also
    simple

SELECT i.DocId FROM INDEX i, Query q WHERE
i.Term q.term GROUP BY i.DocId HAVING
COUNT(DISTINCT(i.Term)) gt k
25
Using RDBMS for IR
  • Proximity Queries for terms within a specific
    window width

SELECT a.DocId FROM INDEX_PROX a, INDEX_PROX b
WHERE a.Term IN (SELECT q.Term FROM QUERY q)
AND b.Term IN (SELECT q.Term FROM
QUERY q) AND a.DocId b.DocId AND
(a.offset b.offset) BETWEEN 0 AND
(width-1) GROUP BY a.DocId, b.DocId, a.Term,
a.offset HAVING COUNT(DISTINCT(b.Term)) SELECT
(COUNT() FROM QUERY)
26
Using RDBMS for IR
  • Calculating Relevance

SELECT i.DocId, SUM(q.tft.idft.tft.idf) FROM
QUERY q, INDEX i, TERM t WHERE q.Term t.term
AND i.Term t.Term GROUP BY i.DocId ORDER BY 2
DESC
Write a Comment
User Comments (0)
About PowerShow.com