Natural Language Processing for Information Retrieval - PowerPoint PPT Presentation

Loading...

PPT – Natural Language Processing for Information Retrieval PowerPoint presentation | free to download - id: 3c425c-ZGZhO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Natural Language Processing for Information Retrieval

Description:

Natural Language Processing for Information Retrieval Douglas W. Oard College of Information Studies Roadmap IR overview NLP for monolingual IR NLP for cross-language ... – PowerPoint PPT presentation

Number of Views:368
Avg rating:3.0/5.0
Slides: 127
Provided by: gridsUcsI
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Natural Language Processing for Information Retrieval


1
Natural Language Processing forInformation
Retrieval
  • Douglas W. Oard
  • College of Information Studies

2
Roadmap
  • IR overview
  • NLP for monolingual IR
  • NLP for cross-language IR

3
What do We Mean by Information?
  • Information is data in context
  • Databases contain data and produce information
  • IR systems contain and provide information
  • How is it different from Knowledge
  • Knowledge is a basis for making decisions
  • Many knowledge bases contain decision rules

4
What Do We Mean by Retrieval?
  • Find something that you are looking for
  • 3 general categories
  • Known item search
  • Find the class home page
  • Answer seeking
  • Is Lexington or Louisville the capital of
    Kentucky?
  • Directed exploration
  • Who makes videoconferencing systems?

5
Retrieval System Model
User
Query Formulation
Detection
Selection
Index
Examination
Docs
Indexing
Delivery
6
Query Formulation
User
Query Formulation
Detection
7
Detection
  • Searches the index
  • Not the web!
  • Looks for words
  • Desired
  • Required ()
  • Together (...)
  • Ranks the results
  • Goal is best first

Query Formulation
Detection
Selection
Index
Docs
8
Selection
About 7381 documents match your query. 1. MAHEC
Videoconference Systems. Major Category. Product
Type. Product. Network System. Multipoint
Conference Server (MCS) PictureTel Prism - 8
port. . - size 5K - 6-Jun-97 - English - 2.
VIDEOCONFERENCING PRODUCTS. Aethra offers a
complete product line of multimedia and
videocommunications products to meet all the
applications needs of... - size 4K - 1-Jul-97 -
English -
User
Detection
Selection
Index
Examination
Docs
9
Relevance Feedback
  • Query refinement based on search results

10
Examination
Aethra offers a complete product line of
multimedia and videocommunications products to
meet all the applications needs of users. The
standard product line is augmented by a bespoke
service to solve customer specific functional
requirements. Standard Videoconferencing Product
Line Vega 384 and Vega 128, the improved Aethra
Set-top systems, can be connected to any TV
monitor for high quality videoconferencing up to
384 Kbps. A compact and lightweight device,
VEGA is very easy to use and can be quickly
installed in any officeand work environment.
Voyager, is the first Videoconference briefcase
designed for journalist, reporters and people
on-the-go. It combines high quality
video-communication (up to 384 Kbps) with the
necessary reliability in a small and light
briefcase.
User
Selection
Examination
Docs
Delivery
11
Delivery
User
  • Bookmark a page for later use
  • Email as a URL or as HTML
  • Cut and paste into a presentation
  • Print a hardcopy for later review

Examination
Docs
Delivery
12
Human-Machine Synergy
  • Machines are good at
  • Doing simple things accurately and quickly
  • Scaling to larger collections in sublinear time
  • People are better at
  • Accurately recognizing what they are looking for
  • Evaluating intangibles such as quality
  • Humans and machines are pretty bad at
  • Mapping concepts into search terms

13
Detection Component Model
Utility
Human Judgment
Information Need
Document
Query Formulation
Query
Document Processing
Query Processing
Representation Function
Representation Function
Query Representation
Document Representation
Comparison Function
Retrieval Status Value
14
Controlled Vocabulary Retrieval
  • A straightforward concept retrieval approach
  • Works equally well for non-text materials
  • Assign a unique descriptor to each concept
  • Can be done by hand for collections of limited
    scope
  • Assign some descriptors to each document
  • Practical for valuable collections of limited
    size
  • Use Boolean retrieval based on descriptors

15
Controlled Vocabulary Example
Document 1
Descriptor
Doc 1
Doc 2
The quick brown fox jumped over the lazy dogs
back.
Canine
0
1
Fox
0
1
Political action
1
0
Volunteerism
1
0
Canine Fox
  • Canine AND Fox
  • Doc 1
  • Canine AND Political action
  • Empty
  • Canine OR Political action
  • Doc 1, Doc 2

Document 2
Now is the time for all good men to come to the
aid of their party.
Political action Volunteerism
16
Challenges
  • Thesaurus design is expensive
  • Shifting concepts generate continuing expense
  • Manual indexing is even more expensive
  • And consistent indexing is very expensive
  • User needs are often difficult to anticipate
  • Challenge for thesaurus designers and indexers
  • End users find thesauri hard to use
  • Codesign problem with query formulation

17
Bag of Words Representation
  • Simple strategy for representing documents
  • Count how many times each term occurs
  • A term is any lexical item that you chose
  • A fixed-length sequence of characters (an
    n-gram)
  • A word (delimited by white space or
    punctuation)
  • Some standard root form of each word (e.g., a
    stem)
  • A phrase (e.g., phrases listed in a dictionary)
  • Counts can be recorded in any consistent order

18
Bag of Words Example
Document 1
Stopword List
Indexed Term
Document 1
Document 2
The quick brown fox jumped over the lazy dogs
back.
for
aid
0
1
is
all
0
1
back
of
1
0
the
brown
1
0
to
come
0
1
dog
1
0
fox
1
0
Document 2
good
0
1
jump
1
0
lazy
1
0
Now is the time for all good men to come to the
aid of their party.
men
0
1
now
0
1
over
1
0
party
0
1
quick
1
0
their
0
1
time
0
1
19
Why Boolean Retrieval Works
  • Boolean operators approximate natural language
  • Find documents about a good party that is not
    over
  • AND can discover relationships between concepts
  • good party
  • OR can discover alternate terminology
  • excellent party
  • NOT can discover alternate meanings
  • Democratic party

20
Proximity Operators
  • More precise versions of AND
  • NEAR n allows at most n-1 intervening terms
  • WITH requires terms to be adjacent and in order
  • Easy to implement, but less efficient
  • Store a list of positions for each word in each
    doc
  • Stopwords become very important!
  • Perform normal Boolean computations
  • Treat WITH and NEAR like AND with an extra
    constraint

21
Ranked Retrieval Paradigm
  • Exact match retrieval often gives useless sets
  • No documents at all, or way too many documents
  • Query reformulation is one solution
  • Manually add or delete query terms
  • Best-first ranking can be superior
  • Select every document within reason
  • Put them in order, with the best ones first
  • Display them one screen at a time

22
Similarity-Based Queries
  • Treat the query as if it were a document
  • Create a query bag-of-words
  • Find the similarity of each document
  • Using the coordination measure, for example
  • Rank order the documents by similarity
  • Most similar to the query first
  • Surprisingly, this works pretty well!
  • Especially for very short queries

23
Document Similarity
  • How similar are two documents?
  • In particular, how similar is their bag of words?

1
2
3
1
complicated
1 Nuclear fallout contaminated Siberia.
1
contaminated
1
fallout
2 Information retrieval is interesting.
1
1
information
3 Information retrieval is complicated.
1
interesting
1
nuclear
1
1
retrieval
1
siberia
24
Coordination Measure Example
1
2
3
1
complicated
Query complicated retrieval Result 3, 2
1
contaminated
1
fallout
Query interesting nuclear fallout Result 1, 2
1
1
information
1
interesting
1
nuclear
Query information retrieval Result 2, 3
1
1
retrieval
1
siberia
25
Incorporating Term Frequency
  • High term frequency is evidence of meaning
  • And high IDF is evidence of term importance
  • Recompute the bag-of-words
  • Compute TF IDF for every element

26
TFIDF Example
1
2
3
4
1
2
3
4
Unweighted query contaminated
retrieval Result 2, 3, 1, 4
5
2
1.51
0.60
complicated
0.301
4
1
3
0.50
0.13
0.38
contaminated
0.125
5
4
3
0.63
0.50
0.38
fallout
0.125
Weighted query contaminated(3)
retrieval(1) Result 1, 3, 2, 4
6
3
3
2
information
0.000
1
0.60
interesting
0.602
3
7
0.90
2.11
nuclear
0.301
IDF-weighted query contaminated
retrieval Result 2, 3, 1, 4
6
1
4
0.75
0.13
0.50
retrieval
0.125
2
1.20
siberia
0.602
27
Document Length Normalization
  • Long documents have an unfair advantage
  • They use a lot of terms
  • So they get more matches than short documents
  • And they use the same words repeatedly
  • So they have much higher term frequencies

28
Cosine Normalization Example
1
2
3
4
1
2
3
4
1
2
3
4
0.13
0.57
0.69
5
2
1.51
0.60
complicated
0.301
0.29
0.14
4
1
3
0.50
0.13
0.38
contaminated
0.125
0.37
0.19
0.44
5
4
3
0.63
0.50
0.38
fallout
0.125
6
3
3
2
information
0.000
0.62
1
0.60
interesting
0.602
0.53
0.79
3
7
0.90
2.11
nuclear
0.301
0.77
0.05
0.57
6
1
4
0.75
0.13
0.50
retrieval
0.125
0.71
2
1.20
siberia
0.602
1.70
0.97
2.67
0.87
Length
Unweighted query contaminated retrieval,
Result 2, 4, 1, 3 (compare to 2, 3, 1, 4)
29
Summary So Far
  • Find documents most similar to the query
  • Optionally, Obtain query term weights
  • Given by the user, or computed from IDF
  • Compute document term weights
  • Some combination of TF and IDF
  • Normalize the document vectors
  • Cosine is one way to do this
  • Compute inner product of query and doc vectors
  • Multiply corresponding elements and then add

30
Passage Retrieval
  • Another approach to long-document problem
  • Break it up into coherent units
  • Recognizing topic boundaries is hard
  • But overlapping 300 word passages work fine
  • Document rank is best passage rank
  • And passage information can help guide browsing

31
Advantages of Ranked Retrieval
  • Closer to the way people think
  • Some documents are better than others
  • Enriches browsing behavior
  • Decide how far down the list to go as you read it
  • Allows more flexible queries
  • Long and short queries can produce useful results

32
Ranked Retrieval Challenges
  • Best first is easy to say but hard to do!
  • Probabilistic retrieval tries to approximate it
  • How can the user understand the ranking?
  • It is hard to use a tool that you dont
    understand
  • Efficiency may become a concern
  • More complex computations take more time

33
Evaluation Criteria
  • Effectiveness
  • Set, ranked list, user-machine system
  • Efficiency
  • Retrieval time, indexing time, index size
  • Usability
  • Learnability, novice use, expert use

34
What is Relevance?
  • Relevance relates a topic and a document
  • Duplicates are equally relevant by definition
  • Constant over time and across users
  • Pertinence relates a task and a document
  • Accounts for quality, complexity, language,
  • Utility relates a user and a document
  • Accounts for prior knowledge
  • We seek utility, but relevance is what we get!

35
IR Effectiveness Evaluation
  • System-centered strategy
  • Given documents, queries, and relevance judgments
  • Try several variations on the retrieval system
  • Measure which ranks more good docs near the top
  • User-centered strategy
  • Given several users, and at least 2 retrieval
    systems
  • Have each user try the same task on both systems
  • Measure which system works the best

36
Measures of Effectiveness
  • Good measures
  • Capture some aspect of what the user wants
  • Have predictive value for other situations
  • Different queries, different document collection
  • Are easily replicated by other researchers
  • Can be expressed as a single number
  • Allows two systems to be easily compared

37
IR Test Collections
  • Documents
  • Representative quantity
  • Representative sources and topics
  • Topics
  • Used to form queries
  • Relevance judgments
  • For each document, with respect to each topic
  • This is the expensive part!

38
Some Assumptions
  • Unchanging, known queries
  • The same queries are used by each system
  • Binary relevance
  • Every document is either relevant or it is not
  • Unchanging, known relevance
  • The relevance of each doc to each query is known
  • But only used for evaluation, not retrieval!
  • Focus on effectiveness

39
The Contingency Table
Action
Retrieved
Not Retrieved
Doc
Relevant Retrieved
Relevant Rejected
Relevant
Irrelevant Retrieved
Irrelevant Rejected
Not relevant
40
The Precision-Recall Curve
R
R
Action
Retrieved
Not Retrieved
R
Doc10
Relevant Retrieved
Relevant Rejected
Relevant4
Irrelevant Retrieved
Irrelevant Rejected
Not relevant6
R
41
Precision at recall0.1
Average Precision
Breakeven Point
Precision at 10 docs
42
Single-Number MOE Weaknesses
  • Precision at 10 documents
  • Pays no attention to recall
  • Precision at constant recall
  • A specific recall fraction is rarely the users
    goal
  • Breakeven point
  • Nobody ever searches at the breakeven point
  • Average precision
  • Users typically operate near an extreme of the
    curve
  • So the average is not very informative

43
Why Choose Average Precision?
  • It is easy to trade between recall and precision
  • Adding related query terms improves recall
  • But naive query expansion techniques kill
    precision
  • Limiting matches by part-of-speech helps
    precision
  • But it almost always hurts recall
  • Comparisons should give some weight to both
  • Average precision is a principled way to do this
  • Rewards improvements in either factor

44
How Much is Enough?
  • The maximum average precision is 1.0
  • But inter-rater reliability is 0.8 or less
  • So 0.8 is a practical upper bound at every point
  • Precision ??0.8 is sometimes seen at low recall
  • Two goals
  • Achieve a meaningful amount of improvement
  • This is a judgment call, and depends on the
    application
  • Achieve that improvement reliably across queries
  • This can be verified using statistical tests

45
Obtaining Relevance Judgments
  • Exhaustive assessment can be too expensive
  • TREC has 50 queries for gt1 million docs each year
  • Random sampling wont work either
  • If relevant docs are rare, none may be found!
  • IR systems can help focus the sample
  • Each system finds some relevant documents
  • Different systems find different relevant
    documents
  • Together, enough systems will find most of them

46
Pooled Assessment Methodology
  • Each system submits top 1000 documents
  • Top 100 documents for each are judged
  • All are placed in a single pool
  • Duplicates are eliminated
  • Placed in an arbitrary order to avoid bias
  • Evaluated by the person that wrote the query
  • Assume unevaluated documents not relevant
  • Overlap evaluation shows diminishing returns
  • Compute average precision over all 1000 docs

47
Lessons From TREC
  • Incomplete judgments are useful
  • If sample is unbiased with respect to systems
    tested
  • Different relevance judgments change absolute
    score
  • But rarely change comparative advantages when
    averaged
  • Evaluation technology is predictive
  • Results transfer to operational settings)

Adapted from a presentation by Ellen Voorhees at
the University of Maryland, March 29, 1999
48
What Query Averaging Hides
49
Roadmap
  • IR overview
  • NLP for monolingual IR
  • NLP for cross-language IR

50
Machine Assisted Indexing
  • Automatically suggest controlled vocabulary
  • Better consistency with lower cost
  • Typically use a rule-based expert system
  • Design thesaurus by hand in the usual way
  • Design an expert system to process text
  • String matching, proximity operators,
  • Write rules for each thesaurus/collection/language
  • Try it out and fine tune the rules by hand

51
Text Categorization
  • Fully automatic controlled vocabulary
  • Typically based on machine learning
  • Assign descriptors manually for a training set
  • Design a learning algorithm find and use patterns
  • Bayesian classifier, neural network, genetic
    algorithm,
  • Present new documents
  • System assigns descriptors like those in training
    set

52
Representing Electronic Texts
  • A character set specifies semantic units
  • Characters are the smallest units of meaning
  • Abstract entities, separate from their
    representation
  • A font specifies the printed representation
  • What each character will look like on the page
  • Different characters might be depicted
    identically
  • An encoding is the electronic representation
  • What each character will look like in a file
  • One character may have several representations
  • An input method is a keyboard representation

53
Language/Character Set Identification
  • Can be specified using metadata
  • Included in HTTP and HTML
  • Can be determined using word-scale features
  • Which dictionary gets the most hits?
  • Can be determined using subword features
  • Letter n-grams, for example

24
54
Units of Meaning
  • Topical retrieval is a search for concepts
  • But what we index are character strings
  • What strings best represent concepts?
  • In English, words are a good starting point
  • But stems and well chosen phrases can be better
  • In German, compounds may need to be split
  • Otherwise queries using constituent words would
    fail
  • In Chinese, word boundaries are not marked
  • This segmentation problem is similar to that of
    speech

55
Stemming
  • Suffix removal allows word variants to match
  • In English, word roots often precede modifiers
  • Roots often convey topicality better
  • Boolean systems often allow manual truncation
  • limit? -gt limit, limits, limited, limitation,
  • Stemming does automatic truncation
  • Much cheaper than complete morphology
  • On average, just as effective for retrieval

56
Porter Stemmer
  • Nine step process, 1 to 21 rules per step
  • Within each step, only the first valid rule fires
  • Rules rewrite suffixes. Example

static RuleList step1a_rules 101, "sses",
"ss", 3, 1, -1, NULL, 102, "ies",
"i", 2, 0, -1, NULL, 103,
"ss", "ss", 1, 1, -1, NULL,
104, "s", LAMBDA, 0, -1, -1, NULL, 000,
NULL, NULL, 0, 0, 0, NULL,
57
Phrase Formation
  • Two types of phrases
  • Compositional composition of word meanings
  • Noncompositional idiomatic expressions
  • e.g., kick the bucket or buy the farm
  • Three sources of evidence
  • Dictionary lookup
  • Parsing
  • Cooccurrence

58
Semantic Phrases
  • Same idea as longest substring match
  • But look for word (not character) sequences
  • Compile a term list that includes phrases
  • Technical terminology can be very helpful
  • Index any phrase that occurs in the list
  • Most effective in a limited domain
  • Otherwise hard to capture most useful phrases

59
Syntactic Phrases
  • Automatically construct sentence diagrams
  • Fairly good parsers are becoming available
  • Index the noun phrases
  • Assumes that queries will focus on objects

Sentence
Prepositional Phrase
Noun Phrase
Noun phrase
Det
Adj
Adj
Noun
Verb
Adj
Noun
Adj
Det
Prep
The quick brown fox jumped over the lazy dogs
back
60
Statistical Phrases
  • Compute observed occurrence probability
  • For each single word and each word n-gram
  • buy 10 times in 1000 words yields 0.01
  • the 100 times in 1000 words yields 0.10
  • farm 5 times in 1000 words yields 0.005
  • buy the farm 4 times in 1000 words yields 0.004
  • Compute n-gram probability if truly independent
  • 0.010.100.0050.000005
  • Compare with observed probability
  • Keep phrases that occur more often than expected

61
Phrase Indexing Lessons
  • Poorly chosen phrases hurt effectiveness
  • And some techniques can be slow (e.g., parsing)
  • Better to index phrases and words
  • Want to find constituents of compositional
    phrases
  • Better weighting schemes benefit less
  • Negligible improvement in some TREC-6 systems
  • Very helpful for cross-language retrieval
  • Noncompositional translation, less ambiguity

62
Longest Substring Segmentation
  • A greedy segmentation algorithm
  • Based solely on lexical information
  • Start with a list of every possible term
  • Dictionaries are a handy source for term lists
  • For each unsegmented string
  • Remove the longest single substring in the list
  • Repeat until no substrings are found in the list
  • Can be extended to explore alternatives

63
Longest Substring Example
  • Possible German compound term
  • washington
  • List of German words
  • ach, hin, hing, sei, ton, was, wasch
  • Longest substring segmentation
  • was-hing-ton
  • Word n-grams might recognize this as bad
  • Roughly translates to What tone is attached?

64
Model-Based Segmentation
  • Choose a model (or set of models)
  • Possible segmentation points, for example
  • Assemble evidence
  • Lexicons, corpora, algorithms, user knowledge
  • Choose a preference criterion
  • Longest substring, for example
  • Choose a search strategy
  • Greedy, exhaustive, dynamic programming

65
Segmentation Models
  • Unique segmentation
  • Decide whether to put a boundary at each point
  • Plausible segmentation
  • Produce all plausible substrings
  • Plausible interpretation
  • Produce all plausible implied substrings
  • Contractions, alternate transliterations, etc.

66
Sources of Evidence
  • Lexicons
  • Dictionaries, term lists, name lists, gazeteers
  • Corpus statistics
  • Segmented or unsegmented
  • Within-document, within-collection, general use
  • Algorithms
  • Transliteration rules, name cues, syntax analysis
  • User knowledge
  • Forced join, forced split

67
Preference Criterion
  • A basis for optimization
  • Single-valued, automatically computed
  • Integrates multiple desiderata
  • Longest substrings
  • Unknown word identification
  • Out of vocabulary, transliteration
  • Consistent usage
  • User knowledge

68
Search Strategy
  • Greedy
  • Choose one search order, decide as you go
  • Fast, but suboptimal
  • Exhaustive
  • Check all alternatives, choose the best
  • Generally too slow to be practical
  • Dynamic programming
  • Use limited lookahead to generate several options
  • Allows you to balance speed and accuracy

69
Synonymy and Homonymy
  • Word matching suffers from two problems
  • Synonymy many words with similar meanings
  • Homonymy one word has dissimilar meanings
  • Disambiguation seeks to resolve homonymy
  • By indexing word senses rather than words
  • Synonymy can be addressed by
  • Thesaurus-based query expansion
  • Latent semantic indexing

70
Word Sense Disambiguation
  • Context provides clues to word meaning
  • The doctor removed the appendix.
  • For each occurrence, note surrounding words
  • Typically /- 5 non-stopwords
  • Group similar contexts into clusters
  • Based on overlaps in the words that they contain
  • Separate clusters represent different senses

71
Disambiguation Example
  • Consider four example sentences
  • The doctor removed the appendix
  • The appendix was incomprehensible
  • The doctor examined the appendix
  • The appendix was removed
  • What clusters can you find?
  • Can you find enough word senses this way?
  • Might you find too many word senses?

72
Why Disambiguation Hurts
  • Bag-of-words techniques already disambiguate
  • When more words are present, documents rank
    higher
  • So a context for each term is established in the
    query
  • Same reason that passages are better than
    documents
  • Formal disambiguation tries to improve precision
  • But incorrect sense assignments would hurt recall
  • Average precision balances recall and precision
  • But the possible precision gains are small
  • And present techniques substantially hurt recall

73
Latent Semantic Indexing
  • Term vectors can reveal term dependence
  • Look at the matrix as a bag of documents
  • Compute term similarities using cosine measure
  • Reduce the number of dimensions
  • Assign similar terms to a single composite
  • Map the composite term to a single dimension

74
LSI Transformation
d1
d2
d3
d4
k1
k2
k3
k4
1
0.11
0.39
0.07
t1
t1
1
0.11
0.39
0.07
t2
t2
1
0.18
-0.10
-0.39
t3
t3
1
0.44
0.15
-0.16
t4
t4
1
0.11
0.39
0.07
t5
t5
1
0.11
0.39
0.07
t6
t6
1
0.45
t7
t7
1
0.15
-0.16
0.44
t8
t8
1
1
1
0.44
0.13
0.12
t9
t9
1
0.45
t10
t10
1
0.11
0.39
0.07
t11
t11
1
1
0.33
-0.26
0.05
t12
t12
1
0.45
t13
t13
1
0.18
-0.10
-0.39
t14
t14
1
0.45
t15
t15
1
1
0.33
-0.26
0.05
t16
t16
1
0.45
t17
t17
2
1
1
0.63
0.02
-0.27
t18
t18
1
0.15
-0.16
0.44
t19
t19
75
Computing Similarity
  • First choose k
  • Never greater than the number of docs or terms
  • Add the weighted vectors for each term
  • Multiply each vector by term weight
  • Sum each element separately
  • Do the same for query or second document
  • Compute inner product
  • Multiply corresponding elements and add

76
LSI Example
d2
d3
k1
k2
k3
k4
t1
0.18
-0.10
-0.39
t2
t3
1
0.44
0.13
0.12
t3
t9
1
0.33
-0.26
0.05
t4
t12
0.18
-0.10
-0.39
t5
t14
0.33
-0.26
0.05
t6
t16
0.63
0.02
-0.27
t7
t18
1
0.63
0.02
-0.27
t8
t18
1
1
2.72
-0.55
-1.10
t9
Sum2
t10
k1
k2
k3
k4
t11
0.44
0.15
-0.16
t4
1
1
t12
0.15
-0.16
0.44
t8
t13
0.44
0.13
0.12
t9
1
t14
Removing Dimensions k1 and k2 6.40 k1 alone
5.92
0.33
-0.26
0.05
t12
t15
0.33
-0.26
0.05
t16
1
1
t16
0.63
0.02
-0.27
t18
t17
0.15
-0.16
0.44
t19
2
1
t18
2.18
-0.85
1.26
Sum3
1
t19
77
Benefits of LSI
  • Removing dimensions can improve things
  • Assigns similar vectors to similar terms
  • Queries and new documents easily added
  • Folding in as weighted sum of term vectors
  • Gets the same cosines with shorter vectors

78
Weaknesses of LSI
  • Words with several meanings confound LSI
  • Places them at the midpoint of the right
    positions
  • LSI vectors are dense
  • Sparse vectors (tfidf) have several advantages
  • The required computations are expensive
  • But T matrix and doc vectors are done in advance
  • Query vector and cosine at query time
  • The cosine may not be the best measure
  • Pivoted normalization can probably help

79
Roadmap
  • IR overview
  • NLP for monolingual IR
  • NLP for cross-language IR

80
Cross-Language IR
Allow anyone to find information that is
expressed in any language
81
Widely Spoken Languages
Source http//www.g11n.com/faq.html
82
Web Content
Source Network Wizards Jan 99 Internet Domain
Survey
83
Access
Use
Machine Translation
Cross-Language Browsing
Cross-Language Search
Select
Examine
English Query
English Document
84
Multilingual Retrieval Architecture
Chinese Term Selection
Monolingual Chinese Retrieval
1 0.72 2 0.48
Language Identification
Chinese Term Selection
Chinese Query
English Term Selection
Cross- Language Retrieval
3 0.91 4 0.57 5 0.36
85
Document-Language Retrieval
Chinese Query Terms
Query Vector Translation
English Document Terms
Monolingual English Retrieval
3 0.91 4 0.57 5 0.36
86
Query-Language Retrieval
Chinese Query Terms
English Document Terms
Monolingual Chinese Retrieval
3 0.91 4 0.57 5 0.36
Document Vector Translation
87
Query vs. Document Translation
  • Query translation
  • Very efficient for short queries
  • Not as big an advantage for relevance feedback
  • Hard to resolve ambiguous query terms
  • Document translation
  • May be needed by the selection interface
  • And supports adaptive filtering well
  • Slow, but only need to do it once per document
  • Poor scale-up to large numbers of languages

23
88
Interlingual Retrieval
Chinese Query Terms
Query Vector Translation
English Document Terms
Interlingual Retrieval
3 0.91 4 0.57 5 0.36
Document Vector Translation
89
Cognate-Based Retrieval
Chinese Query Terms
English Document Terms
Cognate Matching
3 0.91 4 0.57 5 0.36
90
Three Key Challenges
oil petroleum
probe survey take samples
No translation!
Which translation?
probe survey take samples
goeringii
cymbidium
Wrong segmentation
oil petroleum
restrain
91
Cross-Language Text Retrieval
Query Translation
Document Translation
Text Translation Vector Translation
Controlled Vocabulary Free Text
Knowledge-based
Corpus-based
Ontology-based Dictionary-based
Term-aligned Sentence-aligned
Document-aligned Unaligned
Thesaurus-based
Parallel Comparable
11
92
Translation Knowledge
  • A lexicon
  • e.g., extract term list from a bilingual
    dictionary
  • Corpora
  • Parallel or comparable, linked or unlinked
  • Algorithmic
  • e.g., transliteration rules, cognate matching
  • The user

93
Types of Lexicons
  • Ontology
  • Representation of concepts and relationships
  • Thesaurus
  • Ontology specialized for retrieval
  • Bilingual lexicon
  • Ontology specialized for machine translation
  • Bilingual dictionary
  • Ontology specialized for human translation

94
Multilingual Thesauri
  • Adapt the knowledge structure
  • Cultural differences influence indexing choices
  • Use language-independent descriptors
  • Matched to a unique term in each language
  • Three construction techniques
  • Build it from scratch
  • Translate an existing thesaurus
  • Merge monolingual thesauri

95
Machine Readable Dictionaries
  • Based on printed bilingual dictionaries
  • Becoming widely available
  • Used to produce bilingual term lists
  • Cross-language term mappings are accessible
  • Sometimes listed in order of most common usage
  • Some knowledge structure is also present
  • Hard to extract and represent automatically
  • The challenge is to pick the right translation

96
Measuring Lexicon Coverage
  • Lexicon size
  • Vocabulary coverage of the collection
  • Term instance coverage of the collection
  • Term weight coverage of the collection
  • Term weight coverage on representative queries
  • Retrieval performance on a test collection

97
Accommodating Lexical Gaps
  • Corpora
  • Parallel (translation equivalent)
  • Comparable (topically related)
  • Algorithmic
  • Transliteration rules, cognate matching
  • The user
  • Choice based on reverse translations

98
Hieroglyphic
Demotic
Greek
99
Types of Bilingual Corpora
  • Parallel corpora translation-equivalent pairs
  • Document pairs
  • Sentence pairs
  • Term pairs
  • Comparable corpora
  • Content-equivalent document pairs
  • Unaligned corpora
  • Content from the same domain

100
Learning From Document Pairs
  • Count how often each term occurs in each pair
  • Treat each pair as a single document

English Terms
Spanish Terms
E1 E2 E3 E4 E5 S1 S2
S3 S4
Doc 1
4
2
2
1
Doc 2
8
4
4
2
Doc 3
2
2
1
2
Doc 4
2
1
2
1
Doc 5
4
1
2
1
101
Similarity-Based Dictionaries
  • Automatically developed from aligned documents
  • Terms E1 and E3 are used in similar ways
  • Terms E1 S1 (or E3 S4) are even more similar
  • For each term, find most similar in other
    language
  • Retain only the top few (5 or so)
  • Performs as well as dictionary-based techniques
  • Evaluated on a comparable corpus of news stories
  • Stories were automatically linked based on date
    and subject

102
Generalized Vector Space Model
  • Term space of each language is different
  • But the document space for a corpus is the same
  • Describe new documents based on the corpus
  • Vector of cosine similarity to each corpus
    document
  • Easily generated from a vector of term weights
  • Multiply by the term-document matrix
  • Compute cosine similarity in document space
  • Excellent results when the domain is the same

103
Cross-Language LSI
  • Designed to reduce term choice effects
  • Works just as well across languages
  • Cross-language is just a type of term choice
    variation
  • Train using aligned document pairs
  • Concatenate both languages in each document
  • Map queries and documents into LSI space

104
Cooccurrence-Based Translation
  • Align terms using cooccurrence statistics
  • How often do a term pair occur in sentence pairs?
  • Weighted by relative position in the sentences
  • Retain term pairs that occur unusually often
  • Useful for query translation
  • Excellent results when the domain is the same
  • Also practical for document translation
  • Term usage reinforces good translations

105
Exploiting Unaligned Corpora
  • Documents about the same set of subjects
  • No known relationship between document pairs
  • Easily available in many applications
  • Two approaches
  • Use a dictionary for rough translation
  • But refine it using the unaligned bilingual
    corpus
  • Use a dictionary to find alignments in the corpus
  • Then extract translation knowledge from the
    alignments

106
Feedback with Unaligned Corpora
  • Pseudo-relevance feedback is fully automatic
  • Augment the query with top ranked documents
  • Improves recall
  • Recenters queries based on the corpus
  • Short queries get the most dramatic improvement
  • Two opportunities
  • Query language Improve the query
  • Document language Suppress translation error

107
Context Linking
  • Automatically align portions of documents
  • For each query term
  • Find translation pairs in corpus using dictionary
  • Select a context of nearby terms
  • e.g., /- 5 words in each language
  • Choose translations from most similar contexts
  • Based on cooccurrence with other translation pairs

108
Choosing the Right Translation(s)
  • Apply known constraints
  • Part-of-speech, syntactic dependency,
  • Maximize the retrieval status value
  • Exploits context provided by other query terms
  • Use explicit statistical information
  • Translation probability, n-gram frequencies,
  • Perform pseudorelevance feedback

109
Unconstrained Query Translation
  • Replace each word with every translation
  • Typically 5-10 translations per word
  • About 50 of monolingual effectiveness
  • Ambiguity is a serious problem
  • Example Fly (English)
  • 8 word senses (e.g., to fly a
    flag)
  • 13 Spanish translations (enarbolar, ondear, )
  • 38 English retranslations (hoist, brandish, lift)

110
Exploiting Part-of-Speech Tags
  • Constrain translations by part of speech
  • Noun, verb, adjective,
  • Effective taggers are available
  • Works well when queries are full sentences
  • Short queries provide little basis for tagging
  • Constrained matching can hurt monolingual IR
  • Nouns in queries often match verbs in documents

111
Phrase Indexing
  • Improves retrieval effectiveness two ways
  • Phrases are less ambiguous than single words
  • Idiomatic phrases translate as a single concept
  • Three ways to identify phrases
  • Semantic (e.g., appears in a dictionary)
  • Syntactic (e.g., parse as a noun phrase)
  • Cooccurrence (words found together often)
  • Semantic phrase results are impressive

112
Query Formulation
  • Interactive word sense disambiguation
  • Show users the translated query
  • Retranslate it for monolingual users
  • Provide an easy way of adjusting it
  • But dont require that users adjust or approve it

113
Query Interface Example
Search
Swiss bank
Query in English
114
User-Assisted Query Translation
Search
Swiss bank
Query in English
Click on a box to remove a possible translation
bank
Bankgebäude ( ) bankverbindung (bank account,
correspondent) bank (bench, settle) damm (caus
eway, dam, embankment) ufer (shore, strand,
waterside) wall (parapet, rampart)
Continue
115
Selection
  • Goal Provide information to support decisions
  • May not require very good translations
  • e.g., Word-by-word title translation
  • People can read past some ambiguity
  • May help to display a few alternative translations

116
Language-Specific Selection
Search
Swiss bank
Query in English
English
German
(Swiss) (Bankgebäude, bankverbindung, bank)
1 (0.72) Swiss Bankers Criticized AP / June 14,
1997 2 (0.48) Bank Director Resigns AP / July
24, 1997
1 (0.91) U.S. Senator Warpathing NZZ / June 14,
1997 2 (0.57) Bankensecret Law Change SDA /
August 22, 1997 3 (0.36) Banks Pressure
Existent NZZ / May 3, 1997
117
Merged-Multilingual Selection
Search
Swiss bank
Query in English
German Query
(Swiss) (Bankgebäude, bankverbindung, bank)
1 (0.91) U.S. Senator Warpathing NZZ June 14,
1997 2 (0.57) Bankensecret Law Change
SDA August 22, 1997 3 (0.52) Swiss Bankers
Criticized AP June 14, 1997 4 (0.36) Banks
Pressure Existent NZZ May 3, 1997 5 (0.28) Bank
Director Resigns AP July 24, 1997
118
Ways of Merging Ranked Lists
  • Estimate probabilities using a test collection
  • Compute probability mass from judgments
  • Normalize scores using paired documents
  • Fit a score adjustment function to known pairs
  • Adopt a rank-based strategy
  • Interleave documents from each collection

119
Examination
  • Goal support decisions and vocabulary discovery
  • Document translation supports examination well
  • Translated documents are already available
  • Query translation is more challenging
  • Begin translating top documents in ranked list
  • Use cache and focused translation to minimize
    delays
  • Backstop with word-by-word translation

120
Delivery
  • Use may require high-quality translation
  • Machine translation quality is often rough
  • Route to best translator based on
  • Required speed
  • Required quality (language and technical skills)
  • Cost

121
Summary
  • Controlled vocabulary
  • Mature, efficient, easily explained
  • Dictionary-based
  • Simple, broad coverage
  • Comparable and parallel corpora
  • Effective in the same domain
  • Unaligned corpora
  • Experimental

122
CLIR Opportunities
Segmentation Phrase Indexing
Lexical Coverage
123
Open Questions (1)
  • User-assisted query disambiguation
  • Limited to the most troublesome terms?
  • Enrich dictionaries with unlinked corpora
  • Tailored title translation techniques
  • Rapid translation and/or summarization
  • Can we use queries to focus translation effort?
  • Automated global translation brokering
  • Balance capacity, capability, and user needs

124
Open Questions (2)
  • Pronunciation-based cognate matching
  • Combined evidence corpora and lexicons
  • Segmentation-sensitive term weights
  • Recorded speech, scanned page images

125
Open Questions (3)
  • Cross-language question answering
  • Multilingual multidocument summarization
  • Multilingual textual data mining

126
For More Information
  • Cross-Language Information Retrieval
  • http//www.clis.umd.edu/dlrg/clir/
  • TREC
  • http//trec.nist.gv
  • Topic Detection and Tracking
  • http//www.nist.gov/speech/tdt3/tdt3.htm

127
An Experiment
  • Usegmented Chinese queries, English documents
  • Dictionary-based query vector translation
  • Extracted bilingual term list from Optilex
    dictionary
  • 177,063 term pairs, in weak predominance order
  • Machine Translation (MT) query translation
  • Systran Chinese-English MT system

128
Experiment Design
Segmented Queries
Segmenter
Chinese Queries
Translated Queries
DQT
Bilingual Term List
InQuery
English Docs
Ranked Lists
129
Alternative Components
  • Chinese query segmentation
  • LDC segmenter
  • NMSU segmenter
  • Exhaustive segmentation (EXH)
  • Dictionary-based query translation (DQT)
  • DQT-FT choose the first translation for each
    word
  • DQT-ET choose all translations for each word

130
Evaluation Resources
  • Document collection (from TREC disk 4)
  • 210,158 English Financial Times articles,
    1991-1994
  • Queries
  • TREC7 English ad hoc topics 351-400, title and
    desc fields
  • Translated into Chinese by a native speaker.
  • Relevance judgments
  • Developed for NIST with pooled assessment
    methodology
  • Inquery 3.1p1 IR system
  • kstem stemmer, standard English stopword list

131
(No Transcript)
132
(No Transcript)
133
First Translation
134
NMSU vs. LDC Segmenters
NMSU Better
LDC Better
135
How Segmentation Errors Hurt
  • Segmentation errors produced single characters
  • Because most single characters are words
  • 1-character words often have several meanings
  • Some common meanings, some quite rare
  • Translation produced some rare English words
  • Ranked retrieval gives high weight to rare words
  • Very poor retrieval (25 of monolingual)
  • Segmentation, translation and term weights
    interact

136
Computing IDF Weights
  • Collection frequency is evidence of selectivity
  • Naturally bound to the query, not the documents
  • Best to compute IDF in the query language
  • If a representative collection is available

137
Lessons From TDT-3Improving the Signal to Noise
Ratio
  • Translation coverage
  • Enrich the LDC term list using large dictionaries
  • Translation selection
  • Statistical evidence from comparable corpora
  • Enriching indexing vocabulary
  • Add related terms from comparable corpora

138
Post-Translation Document Expansion
Query Vector
Term Selection
Documents to Index
Top 5
PRISE
Results
Single Document
PRISE
Word-to-Word Translation
Comp. English Corpus
ASR Transcript
NMSU Segmenter
Mandarin
English
BN
NWT
BN
NWT
139
Mandarin Newswire Text
140
Mandarin Broadcast News
141
Why Document Expansion Works
  • Story-length objects provide useful context
  • Ranked retrieval finds signal amid the noise
  • Selective terms discriminate among documents
  • Enrich index with high IDF terms from top
    documents
  • Similar strategies work well in other
    applications
  • TREC-7 SDR Singhal et al., 1998
  • CLIR query translation Ballesteros Croft, 1997

142
n-best Translation
  • We generally used 1-best translation
  • Highest unigram frequency in comparable corpus
  • Tried 2-best two highest-ranked translations
  • Duplicating unique translations where necessary
  • Should reduce miss rate
  • But at what cost in false alarms?

143
Mandarin Newswire Text
144
Mandarin Broadcast News
145
Comparison With Systran
  • Used baseline translations provided by LDC
  • Untranslated words not used
  • No document expansion
  • Systran produces 1-best translations
  • Natural comparison is with our 2-best run

146
Mandarin Newswire Text
147
Mandarin Broadcast News
148
Selection Interface Evaluation
  • Can model selection as manual classification
  • Consensus categories, learned by example
  • Control assign categories in native language
  • Experiment assign categories to translations
  • Classification consistency measure

149
Classification Consistency
Average Distance from English Decisions
Bayesian Classifier
Glossed from Japanese
English
Control
About PowerShow.com