CSM06 Information Retrieval - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

CSM06 Information Retrieval

Description:

CSM06 Information Retrieval Lecture 1b IR Basics Dr Andrew Salway a.salway_at_surrey.ac.uk – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 25
Provided by: css12
Learn more at: http://www.computing.surrey.ac.uk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CSM06 Information Retrieval


1
CSM06 Information Retrieval
  • Lecture 1b IR Basics
  • Dr Andrew Salway a.salway_at_surrey.ac.uk

2
Requirements for IR systems
  • When developing or evaluating an IR system the
    first considerations are
  • Who are the users of information retrieval
    systems? General public or specialist
    researchers?
  • What kinds of information do they want to
    retrieve? Text, image, audio or video? General
    or specialist information?

3
Information Access Process
  • Most uses of an information retrieval system can
    be characterised by this generic process
  • Start with an information need
  • Select a system / collections to search
  • Formulate a query
  • Send query
  • Receive results (i.e. information items)
  • Scan, evaluate, interpret results
  • Reformulate query and go to (4) OR stop
  • From Baeza-Yates and Ribeiro-Neto (1999), p. 263
  • NB. When doing IR on the web a user can browse
    away from the results returned in step 5 this
    may change the process

4
Information Need ? Query
  • Verbal queries
  • Single-word queries a list of words
  • Context queries phrase ( ) proximity (NEAR)
  • Boolean queries use AND, OR, BUT
  • Natural Language from sentence to whole text

5
Information Need ? Query
  • EXERCISE
  • Tina is a user of an information retrieval
    system who is researching how the industrial
    revolution effected the urban population in
    Victorian England.
  • How could her information need be expressed with
    the different types of query described above?
  • What are the advantages / disadvantages of each
    query type?

6
Ad-hoc Retrieval Problem
  • The ad-hoc retrieval problem is commonly faced by
    IR systems, especially web search engines. It
    takes the form return information items on topic
    t
  • where t is a string of one or more terms
    characterising a users information need.
  • For large collections this needs to happen
    automatically.
  • Note there is not a fixed list of topics!
  • So, the IR system should return documents
    relevant to the query.
  • Ideally it will rank the documents in order of
    relevance, so the user sees the most relevant
    first

7
Generic Architecture of an Text IR System
  • Based on Baeza-Yates and Riberio-Neto (1999),
    Modern Information Retrieval, Figure 1.3, p.10.

8
User Interface
Text Operations
Indexing
Query Operations
Searching
INDEX
Ranking
Text Database
9
IR compared with data retrieval, and knowledge
retrieval
  • Data Retrieval, e.g. SQL query to well-structured
    database if data is stored you get exactly what
    you want

10
IR compared with data retrieval, and knowledge
retrieval
  • Information Retrieval returns information items
    from unstructured source user must still
    interpret them

11
IR compared with data retrieval, and knowledge
retrieval
  • Knowledge Retrieval (see current Information
    Extraction technology) answers specific
    questions by analysing an unstructured
    information source, e.g. user could ask What is
    capital of France? and the system would answer
    Paris by reading a book about France

12
How Good is an IR System?
  • We need ways to measure how good an IR systems
    is, i.e. evaluation metrics
  • Systems should return relevant information items
    (texts, images, etc) systems may rank the items
    in order of relevance
  • Two ways to measure the performance of an IR
    system
  • Precision how many of the retrieved items are
    relevant?
  • Recall how many of the items that should have
    been retrieved were retrieved?
  • These should be objective measures.
  • Both require humans to make decisions about what
    documents are relevant for a given query

13
Calculating Precision and Recall
  • R number of documents in collection relevant to
    topic t
  • A(t) number of documents returned by system in
    response to query t
  • C number of correct (relevant) documents
    returned, i.e. the intersection of R and A(t)
  • PRECISION ((C1)/(A(t)1))100
  • RECALL ((C1)/(R1))100

14
EXERCISE
  • Amanda and Alex each need to choose an
    information retrieval system. Amanda works for
    an intelligence agency, so getting all possible
    information about a topic is important for the
    users of her system. Alex works for a newspaper,
    so getting some relevant information quickly is
    more important for the journalists using his
    system.
  • See below for statistics for two information
    retrieval systems (Search4Facts and InfoULike)
    when they were used to retrieve documents from
    the same document collection in response to the
    same query there were 100,000 documents in the
    collection, of which 50 were relevant to the
    given query. Which system would you advise
    Amanda to choose and which would you advise Alex
    to choose? Your decisions should be based on the
    evaluation metrics of precision and recall.
  • Search4Facts
  • Number of Relevant Documents Returned 12
  • Total Number of Documents Returned 15
  • InfoULike
  • Number of Relevant Documents Returned 48
  • Total Number of Documents Returned 295

15
Precision and Recall refinements
  • May plot graphs of P against R for single queries
    (see Belew 2000, Table 4.2 and Figs. 4.10 and
    4.11)
  • These graphs are unstable for single queries so
    may need to combine P/R curves for multiple
    queries

16
Reference Collections TREC
  • A reference collection comprises a set of
    document and a set of queries for which all
    relevant documents have been identified size is
    important!!
  • TREC Text Retrieval Evaluation Conference
  • TREC Collection 6GB of text (millions of
    documents mainly news related) !!
  • See http//trec.nist.gov/

17
Reference Collections Cystic Fibrosis Collection
  • Cystic Fibrosis collection comprises 1239
    documents from the National Library of Medicines
    MEDLINE database 100 information requests with
    relevant documents four relevance scores (0-2)
    from four experts
  • Available for download http//www.dcc.ufmg.br/irb
    ook/cfc.html

18
Cystic Fibrosis Collection example document
  • PN 74001
  • RN 00001
  • AN 75051687
  • AU Hoiby-N. Jorgensen-B-A. Lykkegaard-E.
    Weeke-B.
  • TI Pseudomonas aeruginosa infection in cystic
    fibrosis.
  • SO Acta-Paediatr-Scand. 1974 Nov. 63(6). P 843-8.
  • MJ CYSTIC-FIBROSIS co. PSEUDOMONAS-AERUGINOSA
    im.
  • MN ADOLESCENCE. BLOOD-PROTEINS me. CHILD.
    CHILD
  • AB The significance of Pseudomonas aeruginosa
    infection in the respiratory tract of 9 cystic
    fibrosis patients have been studied. The
    results indicate no protective value of the many
    precipitins on
  • the tissue of the respiratory tract.
  • RF 001 BELFRAGE S ACTA MED SCAND SUPPL 173 5
    963
  • 002 COOMBS RRA IN GELL PGH 317 964
  • CT 1 HOIBY N SCAND J RESPIR DIS 56 38 975
  • 2 HOIBY N ACTA PATH MICROBIOL SCAND (C)83
    459 975

19
CF Collection example query and details of
relevant documents
  • QN 00001
  • QU What are the effects of calcium on the
    physical properties of mucus
  • from CF patients?
  • NR 00034
  • RD 139 1222 151 2211 166 0001 311 0001 370
    1010 392 0001 439 0001
  • 440 0011 441 2122 454 0100 461 1121 502
    0002 503 1000 505 0001
  • 139 document number
  • 1222 expert 1 scored it relevance 1, experts
    2-4 scored it relevance 2.

20
Further Reading
  • See Belew (2000), pages 119-128
  • See also Belew CD for reference corpus (and lots
    more!)

21
Basic Concepts of IR recap
  • After this lecture, you should be able to explain
    and discuss
  • Information access process ad-hoc retrieval
  • User information need query IR vs. data
    retrieval / knowledge retrieval retrieval vs.
    browsing
  • Relevance Ranking
  • Evaluation metrics - Precision and Recall

22
Set Reading
  • To prepare for next weeks lecture, you should
    look at
  • Weiss et al (2005), handout especially sections
    1.4, 2.3, 2.4 and 2.5
  • Belew, R. K. (2000), pages 50-58

23
Further Reading
  • For more about the IR basics in todays lecture
    see introductions in
  • Belew, R. K. (2000), R. Baeza-Yates and Berthier
    Ribeiro-Neto, pages 1-9, or, Kowalski and Maybury
    (2000).

24
Further Reading
  • To keep up-to-date with web search engine
    developments, see
  • www.searchenginewatch.com
  • I will put a links to some online articles about
    recent developments in web search technologies on
    the module web page
About PowerShow.com