Title: WIDIT in TREC2008 Blog Track: Leveraging multiple sources of opinion evidence
1WIDIT in TREC-2008 Blog TrackLeveraging
multiple sources of opinion evidence
- Kiduk Yang
- WIDIT Laboratory
- School of Library Information Science
- Indiana University
2Blog Track Challenge
- Targeted Opinion Detection
- Subjective language is context-dependent
- Both objective and subjective documents are
composed of a mixture of subjective and objective
language - Must associate opinion to the target
- Blogosphere Characteristics
- Highly personalized ? non-standard use of
language - Interactive ? opinion may span a fraction of the
posting - Blogware ? embedded noise
- Spam
3Research Questions Opinion Detection
- What are the evidences of opinion?
- Opinion Terminology
- Words often used in expressing an opinion
- e.g., Skype sucks, Skype rocks, Skype is
cool - Opinion Collocations
- Collocations that mark an opinion
- e.g., I think tomato is a fruit, Tomato is a
vegetable to me - Opinion Morphology
- Word morphing to emphasize an opinion
- e.g., Vista is soooo buggy, Vista is metacool
- How can they be leveraged?
- Opinion classification via Supervised Learning
- Document scoring using Opinion Lexicons
- How can they be combined to detect opinionated
blogs? - Weighted sum optimized via Dynamic Tuning
4WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
5WIDIT Approach Opinion Lexicons
- Lexicon-based Opinion Detection
- Construct Opinion Lexicons from multiple sources
of opinion evidence - Opinion Terminology, Opinion Collocations,
Opinion Morphology - Score documents using Opinion Lexicons
- Opinion Terminology
- Wilsons Lexicons
- A subset of Wilsons subjectivity terms
- 4747 strong 2190 weak subjective terms with
polarity - 240 emphasis terms, 88 negation n-grams
- High Frequency (HF) Lexicon
- For each of IMDb movie 2006 blog training data
- Extract high frequency terms from positive
training data (e.g., movie review) - Exclude terms that occur in negative training
data (e.g., movie plot summary) - Select a set of opinion terms
- Combine the IMDb blog term sets
- Assign polarity strength to each term
- Expand with synonyms antonyms from Wordnet
6WIDIT Approach Opinion Lexicons
- Opinion Collocations
- I-You (IU) Lexicon
- For each of movie review positive blog training
data - Extract n-grams that begin/end with IU anchors
(e.g., I, You, my, your, me) - Select a set of opinion collocations
- Combine the movie blog term sets
- Assign strength polarity to each collocation
- Add verb conjugations noun plurals
- Expand with HF Wilson terms
- Acronym Lexicon
- Select opinion collocations from netlingo
acronyms - e.g., afaik (as far as I know), imho (in my
humble opinion) - Assign strength polarity to each collocation
7WIDIT Approach Opinion Lexicons
- Opinion Morphology
- When expressing opinion, people become creative
and tend to use uncommon/rare terms
(Wiebe,Wilson, Bruce, Bell, Martin, 2004) - LF Lexicon LF Regex
- Compile a set of Low Frequency (LF) terms in the
blog collection - Exclude terms that occur frequently in negative
training data - Construct regular expressions (LF regex) to
identify Opinion Morph (OM) terms - Based on examination of HF terms LF patterns
- Compound words (e.g., crazygood, ohmygod)
- Repeat-character words (e.g., sooo, fantaaastic)
- Morph-spelled words (e.g., luv, hizzarious)
- Apply regex to LF term set
- Iteratively refine regex based on the examination
of regex results - Exclude regex matches from LF term set
- Select OM terms (LF lexicon) from the remaining
set
8WIDIT Approach Opinion Reranking
- Opinion Reranking factors
- Opinion Terminology
- Wilsons lexicon, HF lexicon
- Opinion Collocations
- AC lexicon, IU lexicon
- Opinion Morphology
- LF lexicon, LF regex
- Opinion Reranking (OR) Method
- Compute OR scores for each document
- Document-length normalized frequency
- Rerank topic-reranked documents using
- combined OR score topic-reranking groups
9Opinion Reranking
- Adjective-Verb (AV) Module
- Hypothesis
- Opinion blogs have a high density of Opinion
Adjectives Verbs - Method
- Construct AV lexicons
- Manually compile a AV seed set
- e.g., good, bad, support, against, like, hate
- Expand the seed set with synonyms antonyms from
lexical sources (AV1) - Expand AV1 with similar AV terms using
Distributional Similarity (AV2) - Compute AV scores
- AV1 score Document-length normalized frequency
- AV1 terms near query title string in document
- AV2 score AV2 density in document
- AV2 term frequency / total adjectiveverb
frequency
10Opinion Reranking
- AV expansion by Distributional Similarity
- Objective
- Find a cluster of similar words given a seed set
of Opinion AV - Hypothesis
- Similar words have similar distributional
(co-occurrence) patterns. - Learning Subjective Language (Wiebe et al.,
2004) - Method
- Split the training data into a training set and a
validation set - Find terms that co-occur with seed set terms in
the training set - Refine the expanded term set E(n)
- Classify the validation set with E(1)..E(n)
- Select E(k), which has the highest classification
performance - Manually filter E(k) to create the final Opinion
AV lexicon
11WIDIT Approach Polarity Detection
- For each opinion-reranked document,
- Compute positive negative polarity scores
- Combine polarity scores using D-tuned formula
- fsc(p), fsc(n)
- Apply polarity detection heuristic
- Positive polarity if
- most of opinion factors are positive,
- fsc(p)-fsc(n) gt threshold
- fsc(p) gtgt fsc(n)
- Negative polarity if
- most of opinion factors are negative,
- fsc(n)-fsc(p) gt threshold
- fsc(n) gtgt fsc(p)
- Mixed polarity otherwise
12WIDIT Approach Dynamic Tuning
- Reranking formula
- RS aNSorig ß?(wiNSi)
- wi weight of reranking factor i
- NSi normalized score of factor i
- (Si Smin) / (Smax Smin)
- a weight of original score
- ß weight of overall reranking score
- How to determine a, ß, wi?
- Too many parameters for exhaustive combinations
- Linear combination may not suffice
- Dynamic Tuning
- Real-time display of parameter tuning effect on
performance - To guide the user towards local optimum
- By harnessing both human intelligence (pattern
recognition) w/ computational power of machine
13WIDIT Approach Dynamic Tuning
14WIDIT Approach Dynamic Tuning
15WIDIT Approach Fusion
- Weighted Sum Fusion Formula
- FS ?(wiNSi)
- Fusion Type
- Normalized sum (Min-Max) fusion wi 1
- MAP fusion wi MAP of training runs
- D-tuned fusion
- Fusion Combinations
- By Query Length
- Short, Long, Long w/ nouns
- By Term Weight
- Okapi, SMART
- Fusion Levels
- Baseline results
- Topic-reranked results
- Opinion-reranked results
wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin)
16Result At a Glance
17Result At a Glance
- Polarity Ranking (positive)
18Result At a Glance
- Polarity Ranking (negative)
19Reranking Effect
20Reranking Dynamic Tuning Effect
21Opinion Reranking Factors
22Relative Short Query Performance of WIDIT Opinion
Finding System
Improvements over Baseline for Short Query
Opinion Detection Performances by TREC-2007
participants
Good OnTopic retrieval ? good opinion retrieval
- but not necessarily due to oprinion reranking
23Concluding Remarks
- Noise Reduction
- Positive effect on retrieval performance
- Reranking Dynamic Tuning
- Most influential components of the system
- Fusion
- Combining multiple complementary sources is
effective for opinion detection - Future Study
- Method fusion Lexicon Machine Learning NLP
- Automatic construction of lexicons
- Query Expansion optimization
- Failure Analysis
24Questions?
- Wilsons lexicon
- http//www.cs.pitt.edu/mpqa/opinionfinderrelease/
- Movie Review Data
- http//www.cs.cornell.edu/people/pabo/movie-review
-data/ - Movie Plot Summaries
- http//www.imdb.com/Sections/Plots/
- Netlingo Terms
- http//www.netlingo.com/emailsh.cfm
- WIDIT Lexicons
- http//elvis.slis.indiana.edu/lexlist.htm