CSE 535 Information Retrieval - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

CSE 535 Information Retrieval

Description:

When did the Buffalo Bills last win the Super Bowl? ... Formula 1 racing; cars, Le Mans, France, tourism. Filtering (push rather than pull) ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 20
Provided by: RohiniS7
Category:

less

Transcript and Presenter's Notes

Title: CSE 535 Information Retrieval


1
CSE 535Information Retrieval
  • Chapter 1 Introduction to IR

2
Motivation
  • IR representation, storage, organization of, and
    access to unstructured data
  • Focus is on the user information need
  • User information need
  • When did the Buffalo Bills last win the Super
    Bowl?
  • Find all docs containing information on cricket
    players who are (1) tempermental, (ii) popular
    in their countries, and (iii) play in
    international test series.
  • Emphasis is on the retrieval of information (not
    data)

3
Motivation
  • Data retrieval
  • which docs contain a set of keywords?
  • Well defined semantics
  • a single erroneous object implies failure!
  • Information retrieval
  • information about a subject or topic
  • deals with unstructured text
  • semantics is frequently loose
  • small errors are tolerated
  • IR system
  • interpret contents of information items
  • generate a ranking which reflects relevance
  • notion of relevance is most important

4
Basic Concepts
  • The User Task
  • Retrieval
  • information or data
  • purposeful
  • needle in a haystack problem
  • Browsing
  • glancing around
  • Formula 1 racing cars, Le Mans, France, tourism
  • Filtering (push rather than pull)

5
Query
  • Which plays of Shakespeare contain the words
    Brutus AND Caesar but NOT Calpurnia?
  • Could grep all of Shakespeares plays for Brutus
    and Caesar, then strip out lines containing
    Calpurnia?
  • Slow (for large corpora)
  • NOT Calpurnia is non-trivial
  • Other operations (e.g., find the phrase Romans
    and countrymen) not feasible

6
Term-document incidence
1 if play contains word, 0 otherwise
7
Incidence vectors
  • So we have a 0/1 vector for each term.
  • To answer query take the vectors for Brutus,
    Caesar and Calpurnia (complemented) ? bitwise
    AND.
  • 110100 AND 110111 AND 101111 100100.

8
Answers to query
  • Antony and Cleopatra, Act III, Scene ii
  • Agrippa Aside to DOMITIUS ENOBARBUS Why,
    Enobarbus,
  • When Antony found
    Julius Caesar dead,
  • He cried almost to
    roaring and he wept
  • When at Philippi he
    found Brutus slain.
  • Hamlet, Act III, Scene ii
  • Lord Polonius I did enact Julius Caesar I was
    killed i' the
  • Capitol Brutus killed me.

9
Bigger document collections
  • Consider N 1million documents, each with about
    1K terms.
  • Avg 6 bytes/term incl spaces/punctuation
  • 6GB of data in the documents.
  • Say there are M 500K distinct terms among these.

10
Cant build the matrix
  • 500K x 1M matrix has half-a-trillion 0s and 1s.
  • But it has no more than one billion 1s.
  • matrix is extremely sparse gt99 zeros
  • Whats a better representation?
  • We only record the 1 positions.
  • Inverted Index

Why?
11
Ad-Hoc Retrieval
  • Most standard IR task
  • System to provide documents from the collection
    that are relevant to an arbitrary user
    information need
  • Information need topic that user wants to know
    about
  • Query users abstraction of the information need
  • Relevance document is relevant if the user
    perceives it as valuable wrt his information need

12
Issues to be Addressed by IR
  • How to improve quality of retrieval
  • Precison what fraction of the returned results
    are relevant to information need?
  • Recall what fraction of relevant documents in
    the collection are returned by the system
  • Understanding user information need
  • Faster indexes and smaller query response times
  • Better understanding of user behaviour
  • interactive retrieval
  • visualization techniques

13
Inverted index
  • For each term T store a list of all documents
    that contain T.
  • Do we use an array or a list for this?

Brutus
Calpurnia
Caesar
13
16
What happens if the word Caesar is added to
document 14?
14
Inverted index
  • Linked lists generally preferred to arrays
  • Dynamic space allocation
  • Insertion of terms into documents easy
  • Space overhead of pointers

2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
Sorted by docID (more later on why).
15
Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
16
Basic Concepts
  • Logical view of the documents
  • documents represented by a set of index terms or
    keywords
  • Document representation viewed as a continuum
    logical view of docs might shift

Accents spacing
Noun groups
Automatic or Manual indexing
stopwords
stemming
Docs
Structure recognition
17
The Retrieval Process
User Interface
Text
user need
2,3
Text Operations
logical view
logical view
Query Operations
Indexing Criteria, Preferences
Indexer
user feedback
9
4,5,6
inverted file
query
Searching
Index
7
retrieved docs
Document Collection
Ranking
ranked docs
7,21
Indexing
Retrieval
18
Applications of IR
  • Specialized Domains
  • biomedical, legal, patents, intelligence
  • Summarization
  • Cross-lingual Retrieval, Information Access
  • Question-Answering Systems
  • Ask Jeeves
  • Web/Text Mining
  • data mining on unstructured text
  • Multimedia IR
  • images, document images, speech, music
  • Web applications
  • shopbots
  • personal assistant agents

19
IR Techniques
  • Machine learning
  • clustering, SVM, latent semantic indexing, etc.
  • improving relevance feedback, query processing
    etc.
  • Natural Language Processing, Computational
    Linguistics
  • better indexing, query processing
  • incorporating domain knowledge e.g., synonym
    dictionaries
  • use of NLP in IR benefits yet to be shown for
    large-scale IR
  • Information Extraction
  • Highly focused Natural language processing (NLP)
  • named entity tagging, relationship/event
    detection
  • Text indexing and compression
  • User interfaces and visualization
  • AI
  • advanced QA systems, inference, etc.
Write a Comment
User Comments (0)
About PowerShow.com