1 of 21 - PowerPoint PPT Presentation

About This Presentation
Title:

1 of 21

Description:

To facilitate the identification and retrieval of documents that contain ... W. Bruce Croft, Howard R. Turtle, and David D. Lewis, (1991), The use of phrases ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 22
Provided by: Chris1470
Category:

less

Transcript and Presenter's Notes

Title: 1 of 21


1
CSA3080Adaptive Hypertext Systems I
Lecture 5Information Retrieval I
  • Dr. Christopher Staff
  • Department of Computer Science AI
  • University of Malta

2
Aims and Objectives
  • Aims and objectives of IR
  • Boolean, Extended Boolean, Statistical Models

3
Aims and Objectives
  • You should end up knowing the major differences
    between the simple matching algorithms
  • And what each algorithm considers to be a
    relevant document
  • Bear in mind that we will use IR in AHS to find
    information relevant to our user so that we can
    present it/lead the user to it

4
Aims and Objectives of IR
  • To facilitate the identification and retrieval of
    documents that contain information relevant to an
    information need expressed by a user
  • We are particularly interested in the retrieval
    of information from unstructured data

5
Boolean Information Retrieval
  • Developed in 1950s
  • A document is represented by a collection of
    terms that occur in the document (index)
  • The unique terms occurring in the collection is
    called the vocabulary
  • A document is represented by a bit sequence with
    a 1 representing a term that is present, and 0
    otherwise

6
Boolean Information Retrieval
  • How is the query expressed?
  • User thinks of terms that describe an information
    need
  • Formalises query as a boolean expression
  • (Term27 OR Term46) NOT (Term30 AND Term16)

7
Boolean Information Retrieval
  • How does the matching algorithm work?
  • Each term in the vocabulary has a set (or
    postings list) of documents that contain the term
  • For each term in the query, the postings lists
    are retrieved
  • Set operations (union/disjunction/intersection)
  • All documents in the results set are returned

8
Boolean Information Retrieval
9
Questions Arising
  • Is this really information retrieval?
  • Just because a document contains term x, does it
    mean that the document is about term x?
  • What about concepts?
  • What makes it possible for us to know that a fish
    cake is not a dessert? That she is the apple of
    my eye does not make her a piece of fruit?

10
Questions Arising
  • Can we rank the results of a boolean query?
  • All we are doing is checking the presence and
    absence of terms
  • On what grounds would we rank?
  • And doesnt it look suspiciously like RDBMS/SQL???

11
Does Boolean IR work?
  • BIR works, and works well, when the vocabulary is
    reasonably small
  • when there is no ambiguity in the meaning of
    terms
  • when the presence of a term in a document is
    significant
  • when the absence of a term from a document
    means that the document cannot be about that term

12
Does Boolean IR work?
  • Boolean IR is typically applied to a document
    surrogate
  • And is used with tremendous success in RDBMS
  • Most general purpose IR systems in use on the
    Internet are derived from BIR with some
    extensions

13
Vector Space Model of IR
  • Briefly
  • Documents (query) represented by vector of term
    weights
  • Term weight describes relative importance of term
    to document (query)
  • Similarity of document to query measured
  • The more similar the document to the query, the
    more relevant it is

14
Vector Space Model of IR
  • VSM gives improved results over Boolean
  • Can rank documents
  • Can control output (limit the no. of documents
    returned)
  • But not as easy to construct query
  • Query does not contain any structure
  • Cant express synonymy, etc.

15
Extended Boolean Retrieval Model
  • Developed to address ranking problem in BIR,
    using VSM-like approach, while retaining Boolean
    query structures
  • E-BIR not as strict as BIR (fuzzy matches
    supported, as in VSM)
  • Term features can include frequency, location,
  • Reference
  • G. Salton, E. Fox, and U. Wu. (1983). Extended
    Boolean information retrieval. Communications of
    the ACM, 26(12)1022-1036.

16
Extended Boolean Retrieval Model
  • Matching is still based on presence or absence of
    terms, but now results can be ranked
  • Terms in docs and query are weighted according to
    term features
  • With structured documents (e.g., HTML), term
    features can also include structural information
    (title, heading, style, )

17
Extended Boolean Retrieval Model
  • With location information possible to find terms
    NEAR each other
  • computer NEAR science not the same as computer
    AND science
  • ADJ (adjacent) refines the proximity measure

18
Questions Arising
  • Ranked results are an improvement
  • NEAR is also useful to improve the quality of
    results
  • as is ADJ
  • Are we any closer to information retrieval?

19
Phrase Matching
  • Concepts may be evidenced in text as
    complex/compound identifiers
  • New York, Computer Science, information
    retrieval, database management systems,
  • Brings us closer to information retrieval, but
    still only identifies documents that contain
    phrases
  • Reference
  • W. Bruce Croft, Howard R. Turtle, and David D.
    Lewis, (1991), The use of phrases and structured
    queries in information retrieval, ACM SIGIR,
    32-45.

20
Phrase Matching
  • Extended/Boolean can express phrases using AND
    together with proximity operator
  • VSM cannot, unless the phrase has been indexed!
  • When is a sequence of words a phrase?
  • Croft et. al. use a probabilistic inference net
    model

21
Conclusion
  • The Boolean and Extended Boolean Models give us a
    simple mechanism for representing documents
  • If we can represent a users interest by the
    presence or absence of terms, then the user model
    could be used as a query to locate interesting
    document
  • Phrase matching allows us to recognise complex
    nouns useful only if phrase is pervasive
Write a Comment
User Comments (0)
About PowerShow.com