Towards Semantics for IR - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Towards Semantics for IR

Description:

Acknowledgements ... 2004 to 2006: Postdoc in the Text Mining, Search, ... {dog, canine, doggy, puppy, etc.} concept 112986. I deposited my check in the bank. ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 86
Provided by: mathcs
Category:

less

Transcript and Presenter's Notes

Title: Towards Semantics for IR


1
Towards Semantics for IR
  • Eugene Agichtein
  • Emory University

Acknowledgements A bunch of slides in this talk
are adapted from lots of people, including Chris
Manning, ChengXiang Zhai, James Allan, Ray
Mooney, and Jimmy Lin.
2
Who is this guy?
Sept 2006- Assistant Professor in the Math CS
department  at Emory. 2004 to 2006 Postdoc in
the Text Mining, Search, and Navigation group at
Microsoft Research, Redmond. 2004 Ph.D. in
Computer Science from Columbia University
dissertation on extracting structured relations
from large unstructured text databases 1998
B.S. in Engineering from The Cooper Union.
Research interests accessing, discovering, and
managing information in unstructured (text) data,
with current emphasis on developing robust and
scalable text mining techniques for the biology
and health domains.
3
Outline
  • Text Information Retrieval 10-minute overview
  • Problems with lexical retrieval
  • Synonymy, Polysemy, Ambiguity
  • A partial solution synonym lookup
  • Towards concept retrieval
  • LSI
  • Language Models for IR
  • PLSI
  • Towards real semantic search
  • Entities, Relations, Facts, Events in Text (my
    research area)

4
Information Retrieval From Text
IR System
5
Was that the whole story in IR?
Source Selection
Query Formulation
Search
Selection
Examination
Delivery
6
Supporting the Search Process
Source Selection
Resource
Query Formulation
Query
Search
Ranked List
Selection
Indexing
Documents
Index
Examination
Acquisition
Documents
Collection
Delivery
7
Example Query
  • Which plays of Shakespeare contain the words
    Brutus AND Caesar but NOT Calpurnia?
  • One could grep all of Shakespeares plays for
    Brutus and Caesar, then strip out lines
    containing Calpurnia?
  • Slow (for large corpora)
  • NOT Calpurnia requires egrep ?
  • But other operations (e.g., find the word Romans
    near countrymen , or top-K scenes most about )
    not feasible

8
Term-document incidence
1 if play contains word, 0 otherwise
Brutus AND Caesar but NOT Calpurnia
9
Incidence vectors
  • So we have a 0/1 vector for each term.
  • Boolean model
  • To answer query take the vectors for Brutus,
    Caesar and Calpurnia (complemented) ? bitwise
    AND.
  • 110100 AND 110111 AND 101111 100100
  • Vector-space model
  • Compute query-document similarity as dot
    product/cosine between query and document vector
  • Rank by similarity

10
Answers to query
  • Antony and Cleopatra, Act III, Scene ii
  • Agrippa Aside to DOMITIUS ENOBARBUS Why,
    Enobarbus,
  • When Antony found
    Julius Caesar dead,
  • He cried almost to
    roaring and he wept
  • When at Philippi he
    found Brutus slain.
  • Hamlet, Act III, Scene ii
  • Lord Polonius I did enact Julius Caesar I was
    killed i' the
  • Capitol Brutus killed me.

11
Modern Search Engines in 1 Minute
  • Crawl Time
  • Inverted List terms ? doc IDs
  • Content chunks (doc copies)
  • Query Time
  • Lookup query terms in IL? filter set
  • Get content chunks for doc IDs
  • Rank documents using hundreds of features (e.g.,
    term weights, web topology, proximity, position)
  • Retrieve Top K documents for query ( K filter set)

index
angina
5
treatment
4
Content chunks
12
Outline
  • Text Information Retrieval 10-minute overview
  • Problems with lexical retrieval
  • Synonymy, Polysemy, Ambiguity
  • A partial solution synonym lookup
  • Towards concept retrieval
  • LSI
  • Language Models for IR
  • PLSI
  • Towards real semantic search
  • Entities, Relations, Facts, Events

13
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
14
Noisy-Channel Model of IR
Information need
d1
d2
Query

User has a information need, thinks of a
relevant document
and writes down some queries
dn
Task of information retrieval given the query,
figure out which document it came from?
document collection
15
How is this a noisy-channel?
  • No one seriously claims that this is actually
    whats going on
  • But this view is mathematically convenient!

Source
Destination
channel
message
message
noise
Source
Destination
Information need
query terms
Encoder
channel
Query formulation process
16
Problems with term-based retrieval
  • Synonymy
  • Power law vs. Zipf distribution
  • Polysemy
  • Saturn
  • Ambiguity
  • What do frogs eat?

17
Polysemy and Context
  • Document similarity on single word level
    polysemy and context

18
Ambiguity
  • Different documents with the same keywords may
    have different meanings

What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?
  • Adult frogs eat mainly insects and other small
    animals, including earthworms, minnows, and
    spiders.
  • Alligators eat many kinds of small animals that
    live in or near the water, including fish,
    snakes, frogs, turtles, small mammals, and birds.
  • Some bats catch fish with their claws, and a few
    species eat lizards, rodents, small birds, tree
    frogs, and other bats.

?
?
19
Indexing Word Synsets/Senses
  • How does indexing word senses solve the
    synonym/polysemy problem?
  • Okay, so where do we get the word senses?
  • WordNet a lexical database for standard English
  • Automatically find clusters of words that
    describe the same concepts

dog, canine, doggy, puppy, etc. ? concept 112986
I deposited my check in the bank. bank ? concept
76529 I saw the sailboat from the bank. bank ?
concept 53107
http//wordnet.princeton.edu/
20
Example Contextual Word Similarity
Use Mutual Information
Dagan et al, Computer Speech Language, 1995
21
Word Sense Disambiguation
  • Given a word in context, automatically determine
    the sense (concept)
  • This is the Word Sense Disambiguation (WSD)
    problem
  • Context is the key
  • For each ambiguous word, note the surrounding
    words
  • Learn a classifier from a collection of
    examples
  • Use the classifier to determine the senses of
    words in the documents

bank river, sailboat, water, etc. ? side of a
river bank check, money, account, etc. ?
financial institution
22
Example Unsupervised WSD
  • Hypothesis same senses of words will have
    similar neighboring words
  • Disambiguation algorithm
  • Identify context vectors corresponding to all
    occurrences of a particular word
  • Partition them into regions of high density
  • Assign a sense to each such region
  • Sit on a chair
  • Take a seat on this chair
  • The chair of the Math Department
  • The chair of the meeting

23
Does it help retrieval?
  • Not really
  • Examples of limited success.

Ellen M. Voorhees. (1993) Using WordNet to
Disambiguate Word Senses for Text Retrieval.
Proceedings of SIGIR 1993. Mark Sanderson.
(1994) Word-Sense Disambiguation and Information
Retrieval. Proceedings of SIGIR 1994 And others
Hinrich Schütze and Jan O. Pedersen. (1995)
Information Retrieval Based on Word Senses.
Proceedings of the 4th Annual Symposium on
Document Analysis and Information
Retrieval. Rada Mihalcea and Dan Moldovan.
(2000) Semantic Indexing Using WordNet Senses.
Proceedings of ACL 2000 Workshop on Recent
Advances in NLP and IR.
24
Why Disambiguation Can Hurt
  • Bag-of-words techniques already disambiguate
  • Context for each term is established in the query
  • Heuristics (e.g., always most frequent sense)
    work better
  • WSD is hard!
  • Many words are highly polysemous, e.g., interest
  • Granularity of senses is often domain/application
    specific
  • Queries are short not enough context for
    accurate WSD
  • WSD tries to improve precision
  • But incorrect sense assignments would hurt recall
  • Slight gains in precision do not offset large
    drops in recall

25
Outline
  • Text Information Retrieval 10-minute overview
  • Problems with lexical retrieval
  • Synonymy, Polysemy, Ambiguity
  • A partial solution word synsets, WSD
  • Towards concept retrieval
  • LSI
  • Language Models for IR
  • PLSI
  • Towards real semantic search
  • Entities, Relations, Facts, Events

26
Latent Semantic Analysis
  • Perform a low-rank approximation of document-term
    matrix (typical rank 100-300)
  • General idea
  • Map documents (and terms) to a low-dimensional
    representation.
  • Design a mapping such that the low-dimensional
    space reflects semantic associations (latent
    semantic space).
  • Compute document similarity based on the inner
    product in this latent semantic space
  • Goals
  • Similar terms map to similar location in low
    dimensional space
  • Noise reduction by dimension reduction

27
Latent Semantic Analysis
  • Latent semantic space illustrating example

courtesy of Susan Dumais
28
(No Transcript)
29
Simplistic picture
Topic 1
Topic 2
Topic 3
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Some (old) empirical evidence
  • Precision at or above median TREC precision
  • Top scorer on almost 20 TREC 1,2,3 topics (c.f.
    1990)
  • Slightly better on average than original vector
    space
  • Effect of dimensionality

36
(No Transcript)
37
Problems with term-based retrieval
  • Synonymy
  • Power law vs. Zipf distribution
  • Polysemy
  • Saturn
  • Ambiguity
  • What do frogs eat?

38
Outline
  • Text Information Retrieval 5-minute overview
  • Problems with lexical retrieval
  • Synonymy, Polysemy, Ambiguity
  • A partial solution synonym lookup
  • Towards concept retrieval
  • LSI
  • Language Models for IR
  • PLSI
  • Towards real semantic search
  • Entities, Relations, Facts, Events

39
IR based on Language Model (LM)
Information need
d1
generation
d2
query


dn
  • A common search heuristic is to use words that
    you expect to find in matching documents as your
    query why, I saw Sergey Brin advocating that
    strategy on late night TV one night in my hotel
    room, so it must be good!
  • The LM approach directly exploits that idea!

document collection
40
Formal Language (Model)
  • Traditional generative model generates strings
  • Finite state machines or regular grammars, etc.
  • Example

I wish
I wish I wish
I wish I wish I wish
I wish I wish I wish I wish
I
wish

wish I wish
41
Stochastic Language Models
  • Models probability of generating strings in the
    language (commonly all strings over alphabet ?)

Model M
0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 l
ikes
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
P(s M) 0.00000008
42
Stochastic Language Models
  • Model probability of generating any string

Model M1
Model M2
0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1
yon 0.01 maiden 0.0001 woman
0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.
0001 yon 0.0005 maiden 0.01 woman
P(sM2) P(sM1)
43
Stochastic Language Models
  • A statistical model for generating text
  • Probability distribution over strings in a given
    language

M
44
Unigram and higher-order models
  • Unigram Language Models
  • Bigram (generally, n-gram) Language Models
  • Other Language Models
  • Grammar-based models (PCFGs), etc.
  • Probably not the first thing to try in IR

Easy. Effective!
45
Using Language Models in IR
  • Treat each document as the basis for a model
    (e.g., unigram sufficient statistics)
  • Rank document d based on P(d q)
  • P(d q) P(q d) x P(d) / P(q)
  • P(q) is the same for all documents, so ignore
  • P(d) the prior is often treated as the same for
    all d
  • But we could use criteria like authority, length,
    genre
  • P(q d) is the probability of q given ds model
  • Very general formal approach

46
The fundamental problem of LMs
  • Usually we dont know the model M
  • But have a sample of text representative of that
    model
  • Estimate a language model from a sample
  • Then compute the observation probability

M
47
Language Models for IR
  • Language Modeling Approaches
  • Attempt to model query generation process
  • Documents are ranked by the probability that a
    query would be observed as a random sample from
    the respective document model
  • Multinomial approach

48
Retrieval based on probabilistic LM
  • Treat the generation of queries as a random
    process.
  • Approach
  • Infer a language model for each document.
  • Estimate the probability of generating the query
    according to each of these models.
  • Rank the documents according to these
    probabilities.
  • Usually a unigram estimate of words is used
  • Some work on bigrams, paralleling van Rijsbergen

49
Retrieval based on probabilistic LM
  • Intuition
  • Users
  • Have a reasonable idea of terms that are likely
    to occur in documents of interest.
  • They will choose query terms that distinguish
    these documents from others in the collection.
  • Collection statistics
  • Are integral parts of the language model.
  • Are not used heuristically as in many other
    approaches.
  • In theory. In practice, theres usually some
    wiggle room for empirically set parameters

50
(No Transcript)
51
(No Transcript)
52
Query generation probability
  • Ranking formula
  • The probability of producing the query given the
    language model of document d using MLE is

Unigram assumption Given a particular language
model, the query terms occur independently
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
Smoothing (continued)
  • Theres a wide space of approaches to smoothing
    probability distributions to deal with this
    problem, such as adding 1, ½ or ? to counts,
    Dirichlet priors, discounting, and interpolation
    Chen and Goodman, 98
  • Another simple idea that works well in practice
    is to use a mixture between the document
    multinomial and the collection multinomial
    distribution

57
Smoothing Mixture model
  • P(wd) ?Pmle(wMd) (1 ?)Pmle(wMc)
  • Mixes the probability from the document with the
    general collection frequency of the word.
  • Correctly setting ? is very important
  • A high value of lambda makes the search
    conjunctive-like suitable for short queries
  • A low value is more suitable for long queries
  • Can tune ? to optimize performance
  • Perhaps make it dependent on document size (cf.
    Dirichlet prior or Witten-Bell smoothing)

58
Basic mixture model summary
  • General formulation of the LM for IR
  • The user has a document in mind, and generates
    the query from this document.
  • The equation represents the probability that the
    document that the user had in mind was in fact
    this one.

general language model
individual-document model
59
Example
  • Document collection (2 documents)
  • d1 Xerox reports a profit but revenue is down
  • d2 Lucent narrows quarter loss but revenue
    decreases further
  • Model MLE unigram from documents ? ½
  • Query revenue down
  • P(Qd1) (1/8 2/16)/2 x (1/8 1/16)/2
  • 1/8 x 3/32 3/256
  • P(Qd2) (1/8 2/16)/2 x (0 1/16)/2
  • 1/8 x 1/32 1/256
  • A component of model is missing what is it, and
    why?
  • Ranking d1 d2

60
Language Models for IR Tasks
  • Cross-lingual IR
  • Distributed IR
  • Structured doc retrieval
  • Personalization
  • Modelling redundancy
  • Predicting query difficulty
  • Predicting information extraction accuracy
  • PLSI

61
Standard Probabilistic IR
Information need
d1
matching
d2
query

dn
document collection
62
IR based on Language Model (LM)
Information need
d1
generation
d2
query


dn
  • A common search heuristic is to use words that
    you expect to find in matching documents as your
    query why, I saw Sergey Brin advocating that
    strategy on late night TV one night in my hotel
    room, so it must be good!
  • LM approach directly exploits that idea!

document collection
63
Collection-Topic-Document Model
Information need
d1
d2
generation

query


dn
document collection
64
Collection-Topic-Document model
  • 3-level model
  • Whole collection model ( )
  • Specific-topic model relevant-documents model (
    )
  • Individual-document model ( )
  • Relevance hypothesis
  • A request(query topic) is generated from a
    specific-topic model , .
  • Iff a document is relevant to the topic, the same
    model will apply to the document.
  • It will replace part of the individual-document
    model in explaining the document.
  • The probability of relevance of a document
  • The probability that this model explains part of
    the document
  • The probability that the , ,
    combination is better than the ,
    combination

65
Outline
  • Text Information Retrieval 5-minute overview
  • Problems with lexical retrieval
  • Synonymy, Polysemy, Ambiguity
  • A partial solution synonym lookup
  • Towards concept retrieval
  • LSI
  • Language Models for IR
  • PLSI
  • Towards real semantic search
  • Entities, Relations, Facts, Events

66
Probabilistic LSI
  • Uses LSI idea, but based in probability theory
  • Comes from statistical Aspect Language Model
  • Generate co-occurrence model based on
    non-observed class
  • This is a mixture model
  • Models a distribution through a mixture (weighted
    sum) of other distributions
  • Independence Assumptions
  • Observed pairs (doc, word) are generated randomly
  • Conditional independence conditioned on latent
    class, words are generated independently of
    document

67
Aspect Model
K chosen in advance (how many topics in
collection???)
  • Generation process
  • Choose a doc d with prob P(d)
  • There are N ds
  • Choose a latent class z with (generated) prob
    P(zd)
  • There are K zs, and K
  • Generate a word w with (generated) prob P(wz)
  • This creates pair (d, w), without direct concern
    for z
  • Joining the probabilities gives you

Remember P(zd) means probability of z, given d
68
Aspect Model (2)
  • Applying Bayes theorem
  • This is conceptually different than LSI
  • Word distribution P(wd) based on combination of
    specific classes/factors/aspects, P(wz)

69
Detour EM Algorithm
  • Tune parameters of distributions with
    missing/hidden data
  • Topics hidden classes
  • Extremely useful, general technique

70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
Expectation Maximization
  • Sketch of an EM algorithm for PLSI
  • E-step calculate future probabilities of z based
    on current estimates
  • M-step update estimate parameters based on
    calculated probabilities
  • Problem overfitting ?

77
Similarities LSI and PLSI
  • Using intermediate, latent, non-observed data for
    classification (hence the L)
  • Can compose Joint Probability similar to LSI SVD
  • U ? U_hat P(di zk)
  • V ? V_hat P(wj zk)
  • S ? S_hat diag(P(zk))k
  • JP U_hatS_hatV_hat
  • JP is simliar to SVD term-doc matrix N
  • Values calculated probabilistically

P(di zk)
P(wj zk)
diag(P(zk))k
78
Differences LSI and PLSI
  • Basis
  • LSI term frequencies (usually) and performs
    dimension reduction via projection or 0-ing
    weaker components
  • PLSI statistical generate model of
    probabilistic relation between W, D and Z refine
    until effective model is produced

79
Experiment 128-factor decomposition
80
Experiments
81
pLSI Improves on LSI
  • Consistently better accuracy curves than LSI
  • TEM SVD, computationally
  • Better from a modeling sense
  • Uses likelihood of sampling and aims for
    maximization
  • SVD uses L2-norm or other implicit Gaussian
    noise assumption
  • More intuitive
  • Polysemy is recognizable
  • By viewing P(wz)
  • Similar handling of synonymy

82
LSA, pLSA in Practice?
  • Only rumors (about Teoma using it)
  • Both LSA, pLSA VERY expensive
  • LSA
  • Running times of one day on 10K docs
  • pLSA
  • M. Federico (ICASSP 2002, hermes.itc.it) use a
    corpus of 1.2 millions of newspaper articles with
    a vocabulary of 200K words approximate pLSA
    using Non-negative Matrix Factorization (NMF)
  • 612 hours of CPU time (7 processors, 2.5
    hours/iteration, 35 iterations)
  • Do we need (P)LSI for web search?

83
Did we solve our problem?
84
Ambiguity
  • Different documents with the same keywords may
    have different meanings

What is the largest volcano in the Solar System?
What do frogs eat?
keywords frogs, eat
keywords largest, volcano, solar, system
?
  • Adult frogs eat mainly insects and other small
    animals, including earthworms, minnows, and
    spiders.
  • Alligators eat many kinds of small animals that
    live in or near the water, including fish,
    snakes, frogs, turtles, small mammals, and birds.
  • Some bats catch fish with their claws, and a few
    species eat lizards, rodents, small birds, tree
    frogs, and other bats.

?
?
85
What we need
  • Detect and exploit semantic relations between
    entities
  • Whole other lecture ?
Write a Comment
User Comments (0)
About PowerShow.com