Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Information Retrieval and Web Search

Description:

Most people equate IR with web-search. highly visible, ... 'bat' (baseball vs. mammal) 'Apple' (company vs. fruit) 'bit' (unit of data vs. act of eating) ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 45
Provided by: hen4
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search


1
Information Retrieval and Web Search
  • Heng Ji
  • hengji_at_cs.qc.cuny.edu
  • Sept 16, 2008

Acknowledgement some slides from Jimmy Lin,
Victor Lavrenko
2
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search
  • Project Topic Discussion and Finalization

3
What is Information Retrieval?
  • Most people equate IR with web-search
  • highly visible, commercially successful endeavors
  • leverage 3 decades of academic research
  • IR finding any kind of relevant information
  • web-pages, news events, answers, images,
  • relevance is a key notion

4
IR System
IR System
4
5
What types of information?
  • Text (Documents and portions thereof)
  • XML and structured documents
  • Images
  • Audio (sound effects, songs, etc.)
  • Video
  • Source code
  • Applications/Web services

6
Interesting Examples
  • Google image search
  • Google video search
  • NYU Prof. Sekines Ngram search
  • http//linserv1.cims.nyu.edu23232/ngram/
  • INDRI Demo Show
  • http//www.lemurproject.org/indri/

http//images.google.com/
http//video.google.com/
7
What about databases?
  • What are examples of databases?
  • Banks storing account information
  • Retailers storing inventories
  • Universities storing student grades
  • What exactly is a (relational) database?
  • Think of them as a collection of tables
  • They model some aspect of the world

8
A (Simple) Database Example
Student Table
Department Table
Course Table
Enrollment Table
9
Database Queries
  • What would you want to know from a database?
  • What classes is John Arrow enrolled in?
  • Who has the highest grade in LBSC 690?
  • Whos in the history department?
  • Of all the non-CLIS students taking LBSC 690 with
    a last name shorter than six characters and were
    born on a Monday, who has the longest email
    address?

10
Comparing IR to databases
11
The IR Black Box
Documents
Query
Hits
12
Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
13
Building the IR Black Box
  • Different models of information retrieval
  • Boolean model
  • Vector space model
  • Languages models
  • Representing the meaning of documents
  • How do we capture the meaning of documents?
  • Is meaning just the sum of all terms?
  • Indexing
  • How do we actually store all those words?
  • How do we access indexed terms quickly?

14
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

15
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
16
Relevance
  • Relevance is a subjective judgment and may
    include
  • Being on the proper subject.
  • Being timely (recent information).
  • Being authoritative (from a trusted source).
  • Satisfying the goals of the user and his/her
    intended use of the information (information
    need).

16
17
IR Ranking
  • Early IR focused on set-based retrieval
  • Boolean queries, set of conditions to be
    satisfied
  • document either matches the query or not
  • like classifying the collection into relevant /
    non-relevant sets
  • still used by professional searchers
  • advanced search in many systems
  • Modern IR ranked retrieval
  • free-form query expresses users information need
  • rank documents by decreasing likelihood of
    relevance
  • many studies prove it is superior

18
A heuristic formula for IR
  • Rank docs by similarity to the query
  • suppose the query is cryogenic labs
  • Similarity query words in the doc
  • favors documents with both labs and cryogenic
  • mathematically
  • Logical variations (set-based)
  • Boolean AND (require all words)
  • Boolean OR (any of the words)

19
Term Frequency (TF)
  • Observation
  • key words tend to be repeated in a document
  • Modify our similarity measure
  • give more weight if word occurs multiple times
  • Problem
  • biased towards long documents
  • spurious occurrences
  • normalize by length

20
Inverse Document Frequency (IDF)
  • Observation
  • rare words carry more meaning cryogenic, apollo
  • frequent words are linguistic glue of, the,
    said, went
  • Modify our similarity measure
  • give more weight to rare words but dont be
    too aggressive (why?)
  • C total number of documents
  • df(q) total number of documents that contain q

21
TF normalization
  • Observation
  • D1cryogenic,labs, D2 cryogenic,cryogenic
  • which document is more relevant?
  • which one is ranked higher? (df(labs) gt
    df(cryogenic))
  • Correction
  • first occurrence more important than a repeat
    (why?)
  • squash the linearity of TF

22
State-of-the-art Formula
23
Vector-space approach to IR
cat
  • cat cat
  • cat pig

pig
  • pig cat

dog
24
Language-modeling Approach
  • query is a random sample from a perfect
    document
  • words are sampled independently of each other
  • rank documents by the probability of generating
    query

D
query
4/9 2/9 4/9 3/9
25
PageRank in Google
26
PageRank in Google (Cont)
I1
A
B
I2
  • Assign a numeric value to each page
  • The more a page is referred to by important
    pages, the more this page is important
  • d damping factor (0.85)
  • Many other criteria e.g. proximity of query
    words
  • information retrieval better than
    information retrieval

27
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

28
Keyword Search
  • Simplest notion of relevance is that the query
    string appears verbatim in the document.
  • Slightly less strict notion is that the words in
    the query appear frequently in the document, in
    any order (bag of words).

28
29
Problems with Keywords
  • May not retrieve relevant documents that include
    synonymous terms.
  • restaurant vs. café
  • PRC vs. China
  • May retrieve irrelevant documents that include
    ambiguous terms.
  • bat (baseball vs. mammal)
  • Apple (company vs. fruit)
  • bit (unit of data vs. act of eating)

29
30
Query Expansion
  • http//www.lemurproject.org/lemur/IndriQueryLangua
    ge.php
  • Most errors caused by vocabulary mismatch
  • query cars, document automobiles
  • solution automatically add highly-related words
  • Thesaurus / WordNet lookup
  • add semantically-related words (synonyms)
  • cannot take context into account
  • rail car vs. race car vs. car and cdr
  • Statistical Expansion
  • add statistically-related words (co-occurrence)
  • very successful

31
IR Query Examples
  • http//nlp.cs.qc.cuny.edu/ir.zip
  • Query
  • ltparametersgtltquerygtcombine( weight( 0.063356
    1(explosion) 0.187417 1(blast) 0.411817
    1(wounded) 0.101370 1(injured) 0.161191
    1(death) 0.074849 1(deaths)) weight( 0.311760
    1(Davao Cityinternational airport) 0.311760
    1(Tuesday) 0.103044 1(DAVAO) 0.195505
    1(Philippines) 0.019817 1(DXDC) 0.058113
    1(Davao Medical Center)))lt/querygtlt/parametersgt

32
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

33
Document indexing
  • Goal Find the important meanings and create an
    internal representation
  • Factors to consider
  • Accuracy to represent meanings (semantics)
  • Exhaustiveness (cover all the contents)
  • Facility for computer to manipulate
  • What is the best representation of contents?
  • Char. string (char trigrams) not precise enough
  • Word good coverage, not precise
  • Phrase poor coverage, more precise
  • Concept poor coverage, precise

Accuracy (Precision)
Coverage (Recall)
String Word Phrase Concept
34
Indexer steps
  • Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
35
  • Multiple term entries in a single document are
    merged.
  • Frequency information is added.

36
Stopwords / Stoplist
  • function words do not bear useful information for
    IR
  • of, in, about, with, I, although,
  • Stoplist contain stopwords, not to be used as
    index
  • Prepositions
  • Articles
  • Pronouns
  • Some adverbs and adjectives
  • Some frequent words (e.g. document)
  • The removal of stopwords usually improves IR
    effectiveness
  • A few standard stoplists are commonly used.

37
Stemming
  • Reason
  • Different word forms may bear similar meaning
    (e.g. search, searching) create a standard
    representation for them
  • Stemming
  • Removing some endings of word
  • computer
  • compute
  • computes
  • computing
  • computed
  • computation

comput
38
Lemmatization
  • transform to standard form according to syntactic
    category.
  • E.g. verb ing ? verb
  • noun s ? noun
  • Need POS tagging
  • More accurate than stemming, but needs more
    resources
  • crucial to choose stemming/lemmatization rules
  • noise v.s. recognition rate
  • compromise between precision and recall
  • light/no stemming severe stemming
  • -recall precision recall -precision

39
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

40
IR on the Web
  • No stable document collection (spider, crawler)
  • Invalid document, duplication, etc.
  • Huge number of documents (partial collection)
  • Multimedia documents
  • Great variation of document quality
  • Multilingual problem

41
Web Search
  • Application of IR to HTML documents on the World
    Wide Web.
  • Differences
  • Must assemble document corpus by spidering the
    web.
  • Can exploit the structural layout information in
    HTML (XML).
  • Documents change uncontrollably.
  • Can exploit the link structure of the web.

41
42
Web Search System
IR System
42
43
  • Technical Backup

43
44
Some formulas for Sim
  • Dot product
  • Cosine
  • Dice
  • Jaccard

t1
D
Q
t2
Write a Comment
User Comments (0)
About PowerShow.com