Intro to Information Retrieval - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Intro to Information Retrieval

Description:

Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies – PowerPoint PPT presentation

Number of Views:229
Avg rating:3.0/5.0
Slides: 23
Provided by: RobertS212
Category:

less

Transcript and Presenter's Notes

Title: Intro to Information Retrieval


1
Intro to Information Retrieval
  • By the end of the lecture you should be able to
  • explain the differences between database and
    information retrieval technologies
  • describe the basic maths underlying set-theoretic
    and vector models of classical IR.

2
Reminder efficiency is vital
  • Reminder Google finds documents which match your
    keywords this must be done EFFICIENTLY cant
    just go through each document from start to end
    for each keyword
  • So, cache stores copy of document, and also a
    cut-down version of the document for searching
    just a bag of words, a sorted list (or
    array/vector/) of words appearing in the
    document (with links back to full document)
  • Try to match keywords against this list if
    found, then return the full document
  • Even cleverer dictionary and inverted file

3
Inverted file structure
dictionary
Inverted or postings file
Data file
1 2 1 2 3 2 2 3 4 . .
Term 1 (2) Term 2 (3) Term 3 (1) Term 4 (3) Term
5 (4) . .
Doc 1 Doc2 Doc3 Doc4 Doc5 Doc6 . .
1 3 6 7 9 . .
4
IR vs DBMS
5
informal introduction
  • IR was developed for bibliographic systems. We
    shall refer to documents, but the technique
    extends beyond items of text.
  • central to IR is representation of a document by
    a set of descriptors or index terms (words
    in the document).
  • searching for a document is carried out (mainly)
    in the space of index terms.
  • we need a language for formulating queries, and a
    method for matching queries with document
    descriptors.

6
architecture
query
user
Query matching
hits
Learning component
feedback
Object base (objects and their descriptions)
7
basic notation
Given a list of m documents, D, and a list of n
index terms, T, we define wi,j to be a weight
associated with the ith keyword and the jth
document. For the jth document, we define an
index term vector, dj dj (w1,j , w2,j , .,
wn,j )
Recipe for jam pudding
For example D d1, d2, d3, T pudding,
jam, traffic, lane, treacle d1 (1, 1, 0, 0,
0), d2 (0, 0, 1, 1, 0), d3 (1, 1, 1, 1, 0)
DoT report on traffic lanes
Radio item on traffic jam in Pudding Lane
8
set theoretic, Boolean model
  • Queries are Boolean expressions formed using
    keywords, eg
  • (Jam V Treacle) ?Pudding ? Lane ?
    Traffic
  • Query is re-expressed in disjunctive normal form
    (DNF)

CF T pudding, jam, traffic, lane, treacle
eg (1, 1, 0, 0, 0) V (1, 0, 0, 0, 1) V (1, 1, 0,
0, 1) To match a document with a query
sim(d, qDNF) 1 if d is equal to a component
of qDNF 0 otherwise
9
(1, 1, 0, 0, 0) V (1, 0, 0, 0, 1) V (1, 1, 0, 0,
1)
T pudding, jam, traffic, lane, treacle
treacle
pudding
jam
traffic
lane
d1 (1, 1, 0, 0, 0), d2 (0, 0, 1, 1, 0), d3
(1, 1, 1, 1, 0)
10
collecting results
T pudding, jam, traffic, lane, treacle
Query (Jam V Treacle) ?Pudding ?
Lane ? Traffic
treacle
pudding
(jam V treacle) ? (pudding) ? (Lane) ?
(Traffic)
jam
traffic
lane
Answer d1 (1, 1, 0, 0, 0) Jam pud recipe
11
Statistical vector model
  • weights, 1 lt wi,j lt 0, no longer binary-valued
  • query also represented by a vector
  • q (w1q, w2q, , wnq)
  • eg q (1.0, 0.6, 0.0, 0.0, 0.8)

CF T pudding, jam, traffic, lane, treacle
to match jth document with a query sim(dj, q)

12
Cosine coefficient
cos(?)
T1
?
T2
13
Cosine coefficient
cos(0) 1
T1
?0
T2
14
Cosine coefficient
cos(90º) 0
T1
D1
w11
? 90º
w1q 0
Q
w2q
w21 0
T2
15
q (1.0, 0.6, 0.0, 0.0, 0.8)
d1 (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe
0.81.0 0.80.6 0.00.0 0.00.0
0.20.8 1.44
0.82 0.82 0.02 0.02 0.22 1.32
1.02 0.62 0.02 0.02 0.82 2.0
16
q (1.0, 0.6, 0.0, 0.0, 0.8)
d2 (0.0, 0.0, 0.9, 0.8, 0), DoT Report
0.01.0 0.00.6 0.90.0 0.80.0
0.00.8 0.0
0.02 0.02 0.92 0.82 0.02 1.45
1.02 0.62 0.02 0.02 0.82 2.0
17
q (1.0, 0.6, 0.0, 0.0, 0.8)
d3 (0.6, 0.9, 1.0, 0.6, 0.0) Radio
Traffic Report
0.61.0 0.90.6 1.00.0 0.60.0
0.00.8 1.14
0.62 0.92 1.02 0.62 0.02 2.53
1.02 0.62 0.02 0.02 0.82 2.0
18
collecting results
CF T pudding, jam, traffic, lane, treacle
q (1.0, 0.6, 0.0, 0.0, 0.8)
Rank document vector document (sim)

1. d1 (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud
recipe (0.89)
2. d3 (0.6, 0.9, 1.0, 0.6, 0.0)
Radio Traffic (0.51) Report
19
Discussion Set theoretic model
  • Boolean model is simple, queries have precise
    semantics, but it is an exact match model, and
    does not Rank results
  • Boolean model popular with bibliographic systems
    available on some search engines
  • Users find Boolean queries hard to formulate
  • Attempts to use set theoretic model as basis for
    a partial-match system Fuzzy set model and the
    extended Boolean model.

20
Discussion Vector Model
  • Vector model is simple, fast and results show
    leads to good results.
  • Partial matching leads to ranked output
  • Popular model with search engines
  • Underlying assumption of term independence (not
    realistic! Phrases, collocations, grammar)
  • Generalised vector space model relaxes the
    assumption that index terms are pairwise
    orthogonal (but is more complicated).

21
questions raised
  • Where do the index terms come from? (ALL the
    words in the source documents?)
  • What determines the weights?
  • How well can we expect these systems to work for
    practical applications?
  • How can we improve them?
  • How do we integrate IR into more traditional DB
    management?

22
Questions to think about
  • Why is traditional database unsuited to retrieval
    of unstructured information?
  • How would you re-express a Boolean query, eg (A
    or B or (C and not D)), in disjunctive normal
    form?
  • For the matching coefficient, sim(., .) show that
    0 lt sim(., .) lt 1, and that sim(a, a) 1.
  • Compare and contrast the vector and set
    theoretic models in terms of power of
    representation of documents and queries.
Write a Comment
User Comments (0)
About PowerShow.com