Combining Lexiconbased Methods to Detect Opinionated Blogs - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Combining Lexiconbased Methods to Detect Opinionated Blogs

Description:

Extract all unique tag patterns from the permalink data ... Extract high frequency terms from positive training data (e.g., movie review) ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 42

Provided by: SLIS69

Category:

more less

Transcript and Presenter's Notes

Title: Combining Lexiconbased Methods to Detect Opinionated Blogs

1
Combining Lexicon-based Methodsto Detect
Opinionated Blogs

Kiduk Yang, Ning Yu, Hui Zhang
WIDIT Laboratory
School of Library Information Science
Indiana University

2
Outline

Research Questions
WIDIT Approach
Results

3
Research Questions

Does noise in blog data affect IR performance?
Compare performance of runs with noise and
without noise
What are the characteristics of blog noise?
How can blog noise be identified?
How to retrieve blogs about something that are
opinionated?
Optimize the retrieval of blogs about a target
(i.e. OnTopic)
Boost the rankings of blogs with evidences of
opinion
What are the evidences of opinion?
How can they be leveraged?
How can they be combined to detect opinionated
blogs?

4
Research Questions Noise Reduction

What are the characteristics of blog noise?
Non-English (NE) blogs
Large proportion of NE tokens
High frequency NE stopwords low frequency
English stopwords
Non-blog content
Non-post/comment text generated by blogware
e.g., sidebar/navigation, header/footer,
advertisement, etc.
Spam postings
How can blog noise be identified?
NE blogs
Language tags in feeds
NE tokens in permalinks
Non-blog content
Mark-up tags in permalinks

5
Research Questions Opinion Detection

What are the evidences of opinion?
Opinion Terminology
Words often used in expressing an opinion
e.g., Skype sucks, Skype rocks, Skype is
cool
Opinion Collocations
Collocations that mark an opinion
e.g., I think tomato is a fruit, Tomato is a
vegetable to me
Opinion Morphology
Word morphing to emphasize an opinion
e.g., Vista is soooo buggy, Vista is metacool
How can they be leveraged?
Opinion classification via Supervised Learning
Document scoring using Opinion Lexicons
How can they be combined to detect opinionated
blogs?
Weighted sum optimized via Dynamic Tuning

6
WIDIT Approach
Blog collection

Noise Reduction
Non-English blog elimination
Non-blog content exclusion
On-topic Retrieval
Initial Retrieval
On-Topic Reranking
Opinion Detection
Opinion Reranking
Polarity Classification

Subset where target appears
Subset where target appears
On-topic
On-topic

7
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
8
WIDIT Approach Noise Reduction

Non-English (NE) blog Identification
Language tags in Feeds
Not always present
Extract all unique tags from the feed data
Identify tags that indicate non-English language
Flag permalinks in feeds with non-English
language tag
NE tokens in Permalinks
NE-blogs w/ English tokens English blogs w/ NE
tokens
For each permalink,
Identify tokens consisting of non-ASCII
characters (i.e., NE tokens)
Compute the NE Content Rate NECR NE_tokens /
tokens
Compute the frequency proportion of English
stopwords Est(f), Est(p)
Flag if large NE content or some NE content with
few E-stopwords
NECR gt 0.5 with Est(f)lt10
NECR lt 0.5 with Est(f)lt10 and Est(p)lt0.02)
NECR lt 0.5 with dlengt1000, Est(f)lt20 and
Est(p)lt0.01)

9
WIDIT Approach Noise Reduction

Non-English (NE) blog Identification
Language tags in Feeds (FD)
16,121 permalinks flagged (1,473 not flagged by
PM)
NE tokens in Permalinks (PM)
334,219 permalinks flagged (11.6)
NE blog Validation
OnTopic (relgt0) NE blogs in 2006 qrels file
24 (PMonly)
Suspected qrels error
NE blogs in 2006 qrels file
59 (FDonly) 2043 (PMonly) 203 (both) 2304
All but 3 manually validated as Non-English
blog-- BLOG06-20051230-022-0008930772 short
blog with no content -- BLOG06-20060118-000-00086
92678 short blog-- BLOG06-20051208-020-002294594
8 Uhhhh.... (long) It sucks!

10
Suspected Non-English blogs in qrels.blog06
11
WIDIT Approach Noise Reduction

Non-blog content Exclusion
Extract all unique tag patterns from the
permalink data
Compile a list of content noise tags with high
frequency
Construct regular expressions (regex) to
identify content noise tags
Apply regex to the unique tag set
Modify regex based on the examination of regex
results
Repeat steps 4 5 until
regex correctly identifies content noise tags
For each permalink,
Extract blog segment using the content regex
Extract ltpostcommentgt text
If no ltpostcommentgt tags, extract ltcontentgt text
If no ltcontentgt tag, extract ltbodygt text
Exclude noise text using the noise regex
e.g., lt(divspan).?(footerprofilesidenavadv
ertisesponsor).?gt

12
WIDIT Approach Noise Reduction

Noise Reduction (NR) Statistics
Blog length reduction
Over 50 blogs with length difference
Average length reduction 551 bytes
74.3 million (7.4) tokens excluded by NR
21,283 (0.6) unique tokens excluded by NR

13
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
14
WIDIT Approach Topic Reranking

Topic Reranking factors
Exact Match of query title text in document
Near Match of query title/description text in
document
All of the query terms occur in sequence near
each other
Noun Phrase Match
Non-Rel Match
Non-relevant phrases/nouns from the topic
narrative occur in document
Topic Reranking (TR) Method
Compute TR scores for each document
Document-length normalized frequency
Categorize initial retrieval into reranking
groups
g1 exact match (query title to document. title
body)
g2 exact match (multi-term query title to
document title only)
g3 exact match (query title to doc. body only)
g4 other
Rerank documents within groups using combined TR
score

15
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
16
WIDIT Approach Opinion Lexicons

Lexicon-based Opinion Detection
Construct Opinion Lexicons from multiple sources
of opinion evidence
Opinion Terminology, Opinion Collocations,
Opinion Morphology
Score documents using Opinion Lexicons
Opinion Terminology
Wilsons Lexicons
A subset of Wilsons subjectivity terms
4747 strong 2190 weak subjective terms with
polarity
240 emphasis terms, 88 negation n-grams
High Frequency (HF) Lexicon
For each of IMDb movie 2006 blog training data
Extract high frequency terms from positive
training data (e.g., movie review)
Exclude terms that occur in negative training
data (e.g., movie plot summary)
Select a set of opinion terms
Combine the IMDb blog term sets
Assign polarity strength to each term
Expand with synonyms antonyms from Wordnet

17
WIDIT Approach Opinion Lexicons

Opinion Collocations
I-You (IU) Lexicon
For each of movie review positive blog training
data
Extract n-grams that begin/end with IU anchors
(e.g., I, You, my, your, me)
Select a set of opinion collocations
Combine the movie blog term sets
Assign strength polarity to each collocation
Add verb conjugations noun plurals
Expand with HF Wilson terms
Acronym Lexicon
Select opinion collocations from netlingo
acronyms
e.g., afaik (as far as I know), imho (in my
humble opinion)
Assign strength polarity to each collocation

18
WIDIT Approach Opinion Lexicons

Opinion Morphology
When expressing opinion, people become creative
and tend to use uncommon/rare terms
(Wiebe,Wilson, Bruce, Bell, Martin, 2004)
LF Lexicon LF Regex
Compile a set of Low Frequency (LF) terms in the
blog collection
Exclude terms that occur frequently in negative
training data
Construct regular expressions (LF regex) to
identify Opinion Morph (OM) terms
Based on examination of HF terms LF patterns
Compound words (e.g., crazygood, ohmygod)
Repeat-character words (e.g., sooo, fantaaastic)
Morph-spelled words (e.g., luv, hizzarious)
Apply regex to LF term set
Iteratively refine regex based on the examination
of regex results
Exclude regex matches from LF term set
Select OM terms (LF lexicon) from the remaining
set

19
WIDIT Approach Opinion Reranking

Opinion Reranking factors
Opinion Terminology
Wilsons lexicon, HF lexicon
Opinion Collocations
AC lexicon, IU lexicon
Opinion Morphology
LF lexicon, LF regex
Opinion Reranking (OR) Method
Compute OR scores for each document
Document-length normalized frequency
Rerank topic-reranked documents using
combined OR score topic-reranking groups

20
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
21
WIDIT Approach Dynamic Tuning

Reranking formula
RS aNSorig ß?(wiNSi)
wi weight of reranking factor i
NSi normalized score of factor i
(Si Smin) / (Smax Smin)
a weight of original score
ß weight of overall reranking score
How to determine a, ß, wi?
Too many parameters for exhaustive combinations
Linear combination may not suffice
Dynamic Tuning
Real-time display of parameter tuning effect on
performance
To guide the user towards local optimum
By harnessing both human intelligence (pattern
recognition) w/ computational power of machine

22
WIDIT Approach Dynamic Tuning

Opinion Reranking

23
WIDIT Approach Dynamic Tuning

Polarity Detection

24
WIDIT Approach Fusion

Weighted Sum Fusion Formula
FS ?(wiNSi)
Fusion Type
Baseline (Min-Max) fusion wi 1
MAP fusion wi MAP of training runs
Fusion Combinations
By Query Length
Short, Long, Long w/ nouns
By Term Weight
Okapi, SMART
Fusion Levels
Baseline results
Topic-reranked results
Opinion-reranked results

wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin)
25
WIDIT Approach Polarity Detection

For each opinion-reranked document,
Compute positive negative polarity scores
Combine polarity scores using D-tuned formula
fsc(p), fsc(n)
Apply polarity detection heuristic
Positive polarity if
most of opinion factors are positive,
fsc(p)-fsc(n) gt threshold
fsc(p) gtgt fsc(n)
Negative polarity if
most of opinion factors are negative,
Fsc(n)-fsc(p) gt threshold
Fsc(n) gtgt fsc(p)
Mixed polarity otherwise

26
Result Overview

Independent Variables
Noise Reduction
Query Length
Topic Reranking
Opinion Reranking
Dynamic Tuning
Fusion
Topic Difficulty
Failure analysis

27
Results Summary

Noise Reduction
Adverse effect on retrieval performance
Many relevant documents had contents excluded by
the WIDIT Noise Reduction module
Query Length
Longer the query, the better the performance
Topic Reranking
4 improvement (Qshort),10 improvement (Qlong)
over initial result
Opinion Reranking
15 improvement (Qshort), 10 improvement (Qlong)
over TopicRR
Dynamic Tuning
4 improvement (Qshort), 9 improvement (Qlong)
over no tuning
Fusion
20 improvement (Qshort) over best baseline
non-fusion
Topic Difficulty
Improvement by Opinion reranking not related to
topic difficulty

28
Concluding Remarks

Noise Reduction
Good idea, but faulty implementation
Effect on retrieval is not yet clear
Post-retrieval Reranking, Dynamic Tuning, and
Fusion all improve retrieval perfomance
Compound effect is even more beneficial
Opinion Modules
Need better training data

29
Result At a Glance

Topic MAP
Opinion MAP

30
(No Transcript)
31
Query Length Effect
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Topic Difficulty
38
Failure Analysis