Combining Lexiconbased Methods to Detect Opinionated Blogs - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Combining Lexiconbased Methods to Detect Opinionated Blogs

Description:

Extract all unique tag patterns from the permalink data ... Extract high frequency terms from positive training data (e.g., movie review) ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 42
Provided by: SLIS69
Category:

less

Transcript and Presenter's Notes

Title: Combining Lexiconbased Methods to Detect Opinionated Blogs


1
Combining Lexicon-based Methodsto Detect
Opinionated Blogs
  • Kiduk Yang, Ning Yu, Hui Zhang
  • WIDIT Laboratory
  • School of Library Information Science
  • Indiana University

2
Outline
  • Research Questions
  • WIDIT Approach
  • Results

3
Research Questions
  • Does noise in blog data affect IR performance?
  • Compare performance of runs with noise and
    without noise
  • What are the characteristics of blog noise?
  • How can blog noise be identified?
  • How to retrieve blogs about something that are
    opinionated?
  • Optimize the retrieval of blogs about a target
    (i.e. OnTopic)
  • Boost the rankings of blogs with evidences of
    opinion
  • What are the evidences of opinion?
  • How can they be leveraged?
  • How can they be combined to detect opinionated
    blogs?

4
Research Questions Noise Reduction
  • What are the characteristics of blog noise?
  • Non-English (NE) blogs
  • Large proportion of NE tokens
  • High frequency NE stopwords low frequency
    English stopwords
  • Non-blog content
  • Non-post/comment text generated by blogware
  • e.g., sidebar/navigation, header/footer,
    advertisement, etc.
  • Spam postings
  • How can blog noise be identified?
  • NE blogs
  • Language tags in feeds
  • NE tokens in permalinks
  • Non-blog content
  • Mark-up tags in permalinks

5
Research Questions Opinion Detection
  • What are the evidences of opinion?
  • Opinion Terminology
  • Words often used in expressing an opinion
  • e.g., Skype sucks, Skype rocks, Skype is
    cool
  • Opinion Collocations
  • Collocations that mark an opinion
  • e.g., I think tomato is a fruit, Tomato is a
    vegetable to me
  • Opinion Morphology
  • Word morphing to emphasize an opinion
  • e.g., Vista is soooo buggy, Vista is metacool
  • How can they be leveraged?
  • Opinion classification via Supervised Learning
  • Document scoring using Opinion Lexicons
  • How can they be combined to detect opinionated
    blogs?
  • Weighted sum optimized via Dynamic Tuning

6
WIDIT Approach
Blog collection
  • Noise Reduction
  • Non-English blog elimination
  • Non-blog content exclusion
  • On-topic Retrieval
  • Initial Retrieval
  • On-Topic Reranking
  • Opinion Detection
  • Opinion Reranking
  • Polarity Classification

Subset where target appears
Subset where target appears
On-topic
On-topic


7
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
8
WIDIT Approach Noise Reduction
  • Non-English (NE) blog Identification
  • Language tags in Feeds
  • Not always present
  • Extract all unique tags from the feed data
  • Identify tags that indicate non-English language
  • Flag permalinks in feeds with non-English
    language tag
  • NE tokens in Permalinks
  • NE-blogs w/ English tokens English blogs w/ NE
    tokens
  • For each permalink,
  • Identify tokens consisting of non-ASCII
    characters (i.e., NE tokens)
  • Compute the NE Content Rate NECR NE_tokens /
    tokens
  • Compute the frequency proportion of English
    stopwords Est(f), Est(p)
  • Flag if large NE content or some NE content with
    few E-stopwords
  • NECR gt 0.5 with Est(f)lt10
  • NECR lt 0.5 with Est(f)lt10 and Est(p)lt0.02)
  • NECR lt 0.5 with dlengt1000, Est(f)lt20 and
    Est(p)lt0.01)

9
WIDIT Approach Noise Reduction
  • Non-English (NE) blog Identification
  • Language tags in Feeds (FD)
  • 16,121 permalinks flagged (1,473 not flagged by
    PM)
  • NE tokens in Permalinks (PM)
  • 334,219 permalinks flagged (11.6)
  • NE blog Validation
  • OnTopic (relgt0) NE blogs in 2006 qrels file
  • 24 (PMonly)
  • Suspected qrels error
  • NE blogs in 2006 qrels file
  • 59 (FDonly) 2043 (PMonly) 203 (both) 2304
  • All but 3 manually validated as Non-English
    blog-- BLOG06-20051230-022-0008930772 short
    blog with no content -- BLOG06-20060118-000-00086
    92678 short blog-- BLOG06-20051208-020-002294594
    8 Uhhhh.... (long) It sucks!

10
Suspected Non-English blogs in qrels.blog06
11
WIDIT Approach Noise Reduction
  • Non-blog content Exclusion
  • Extract all unique tag patterns from the
    permalink data
  • Compile a list of content noise tags with high
    frequency
  • Construct regular expressions (regex) to
    identify content noise tags
  • Apply regex to the unique tag set
  • Modify regex based on the examination of regex
    results
  • Repeat steps 4 5 until
  • regex correctly identifies content noise tags
  • For each permalink,
  • Extract blog segment using the content regex
  • Extract ltpostcommentgt text
  • If no ltpostcommentgt tags, extract ltcontentgt text
  • If no ltcontentgt tag, extract ltbodygt text
  • Exclude noise text using the noise regex
  • e.g., lt(divspan).?(footerprofilesidenavadv
    ertisesponsor).?gt

12
WIDIT Approach Noise Reduction
  • Noise Reduction (NR) Statistics
  • Blog length reduction
  • Over 50 blogs with length difference
  • Average length reduction 551 bytes
  • 74.3 million (7.4) tokens excluded by NR
  • 21,283 (0.6) unique tokens excluded by NR

13
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
14
WIDIT Approach Topic Reranking
  • Topic Reranking factors
  • Exact Match of query title text in document
  • Near Match of query title/description text in
    document
  • All of the query terms occur in sequence near
    each other
  • Noun Phrase Match
  • Non-Rel Match
  • Non-relevant phrases/nouns from the topic
    narrative occur in document
  • Topic Reranking (TR) Method
  • Compute TR scores for each document
  • Document-length normalized frequency
  • Categorize initial retrieval into reranking
    groups
  • g1 exact match (query title to document. title
    body)
  • g2 exact match (multi-term query title to
    document title only)
  • g3 exact match (query title to doc. body only)
  • g4 other
  • Rerank documents within groups using combined TR
    score

15
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
16
WIDIT Approach Opinion Lexicons
  • Lexicon-based Opinion Detection
  • Construct Opinion Lexicons from multiple sources
    of opinion evidence
  • Opinion Terminology, Opinion Collocations,
    Opinion Morphology
  • Score documents using Opinion Lexicons
  • Opinion Terminology
  • Wilsons Lexicons
  • A subset of Wilsons subjectivity terms
  • 4747 strong 2190 weak subjective terms with
    polarity
  • 240 emphasis terms, 88 negation n-grams
  • High Frequency (HF) Lexicon
  • For each of IMDb movie 2006 blog training data
  • Extract high frequency terms from positive
    training data (e.g., movie review)
  • Exclude terms that occur in negative training
    data (e.g., movie plot summary)
  • Select a set of opinion terms
  • Combine the IMDb blog term sets
  • Assign polarity strength to each term
  • Expand with synonyms antonyms from Wordnet

17
WIDIT Approach Opinion Lexicons
  • Opinion Collocations
  • I-You (IU) Lexicon
  • For each of movie review positive blog training
    data
  • Extract n-grams that begin/end with IU anchors
    (e.g., I, You, my, your, me)
  • Select a set of opinion collocations
  • Combine the movie blog term sets
  • Assign strength polarity to each collocation
  • Add verb conjugations noun plurals
  • Expand with HF Wilson terms
  • Acronym Lexicon
  • Select opinion collocations from netlingo
    acronyms
  • e.g., afaik (as far as I know), imho (in my
    humble opinion)
  • Assign strength polarity to each collocation

18
WIDIT Approach Opinion Lexicons
  • Opinion Morphology
  • When expressing opinion, people become creative
    and tend to use uncommon/rare terms
    (Wiebe,Wilson, Bruce, Bell, Martin, 2004)
  • LF Lexicon LF Regex
  • Compile a set of Low Frequency (LF) terms in the
    blog collection
  • Exclude terms that occur frequently in negative
    training data
  • Construct regular expressions (LF regex) to
    identify Opinion Morph (OM) terms
  • Based on examination of HF terms LF patterns
  • Compound words (e.g., crazygood, ohmygod)
  • Repeat-character words (e.g., sooo, fantaaastic)
  • Morph-spelled words (e.g., luv, hizzarious)
  • Apply regex to LF term set
  • Iteratively refine regex based on the examination
    of regex results
  • Exclude regex matches from LF term set
  • Select OM terms (LF lexicon) from the remaining
    set

19
WIDIT Approach Opinion Reranking
  • Opinion Reranking factors
  • Opinion Terminology
  • Wilsons lexicon, HF lexicon
  • Opinion Collocations
  • AC lexicon, IU lexicon
  • Opinion Morphology
  • LF lexicon, LF regex
  • Opinion Reranking (OR) Method
  • Compute OR scores for each document
  • Document-length normalized frequency
  • Rerank topic-reranked documents using
  • combined OR score topic-reranking groups

20
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
21
WIDIT Approach Dynamic Tuning
  • Reranking formula
  • RS aNSorig ß?(wiNSi)
  • wi weight of reranking factor i
  • NSi normalized score of factor i
  • (Si Smin) / (Smax Smin)
  • a weight of original score
  • ß weight of overall reranking score
  • How to determine a, ß, wi?
  • Too many parameters for exhaustive combinations
  • Linear combination may not suffice
  • Dynamic Tuning
  • Real-time display of parameter tuning effect on
    performance
  • To guide the user towards local optimum
  • By harnessing both human intelligence (pattern
    recognition) w/ computational power of machine

22
WIDIT Approach Dynamic Tuning
  • Opinion Reranking

23
WIDIT Approach Dynamic Tuning
  • Polarity Detection

24
WIDIT Approach Fusion
  • Weighted Sum Fusion Formula
  • FS ?(wiNSi)
  • Fusion Type
  • Baseline (Min-Max) fusion wi 1
  • MAP fusion wi MAP of training runs
  • Fusion Combinations
  • By Query Length
  • Short, Long, Long w/ nouns
  • By Term Weight
  • Okapi, SMART
  • Fusion Levels
  • Baseline results
  • Topic-reranked results
  • Opinion-reranked results

wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin)
25
WIDIT Approach Polarity Detection
  • For each opinion-reranked document,
  • Compute positive negative polarity scores
  • Combine polarity scores using D-tuned formula
  • fsc(p), fsc(n)
  • Apply polarity detection heuristic
  • Positive polarity if
  • most of opinion factors are positive,
  • fsc(p)-fsc(n) gt threshold
  • fsc(p) gtgt fsc(n)
  • Negative polarity if
  • most of opinion factors are negative,
  • Fsc(n)-fsc(p) gt threshold
  • Fsc(n) gtgt fsc(p)
  • Mixed polarity otherwise

26
Result Overview
  • Independent Variables
  • Noise Reduction
  • Query Length
  • Topic Reranking
  • Opinion Reranking
  • Dynamic Tuning
  • Fusion
  • Topic Difficulty
  • Failure analysis

27
Results Summary
  • Noise Reduction
  • Adverse effect on retrieval performance
  • Many relevant documents had contents excluded by
    the WIDIT Noise Reduction module
  • Query Length
  • Longer the query, the better the performance
  • Topic Reranking
  • 4 improvement (Qshort),10 improvement (Qlong)
    over initial result
  • Opinion Reranking
  • 15 improvement (Qshort), 10 improvement (Qlong)
    over TopicRR
  • Dynamic Tuning
  • 4 improvement (Qshort), 9 improvement (Qlong)
    over no tuning
  • Fusion
  • 20 improvement (Qshort) over best baseline
    non-fusion
  • Topic Difficulty
  • Improvement by Opinion reranking not related to
    topic difficulty

28
Concluding Remarks
  • Noise Reduction
  • Good idea, but faulty implementation
  • Effect on retrieval is not yet clear
  • Post-retrieval Reranking, Dynamic Tuning, and
    Fusion all improve retrieval perfomance
  • Compound effect is even more beneficial
  • Opinion Modules
  • Need better training data

29
Result At a Glance
  • Topic MAP
  • Opinion MAP

30
(No Transcript)
31
Query Length Effect
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Topic Difficulty
38
Failure Analysis
  • Possible reasons for failure (General)
  • Sense Ambiguity
  • 877 Sonic
  • game? Team(sonics)? Software? Toothbrush?
  • Usage Ambiguity
  • 881 Fox News Report
  • (non-rel) used more as a news source than the
    target of discussion
  • Narrow Search
  • 887 World Trade Organization
  • time frame, reaction to the meeting but not WTO
    in general
  • 900 MacDonald
  • regarding to the food only
  • 890 Olympics
  • overall appeal and impression of the Winter and
    Summer Olympics.

39
Failure Analysis
  • Possible reasons for failure (WIDIT-specific)
    OnTopic
  • Noise Effect
  • 898 Business Intelligence Resources
  • pages having sidebar with link to business
    intelligence resources
  • Document Length Normalization
  • 874 Coretta Scott King
  • 866 Whole Foods
  • (reldoc) long article with small portion of
    relevant information
  • Exact Match Failure
  • 869 Mohammad Cartoon
  • Non-rel docs with exact Topic title
  • Stopword Failure
  • 866 whole foods
  • stopword list contains whole

40
Failure Analysis
  • Possible reasons for failure (WIDIT-specific)
    Opinion
  • Retrieved documents contain opinion but not on
    the target. (20)
  • Document on topic but opinions are on non-topic
    portion 1(898) 2(879)
  • Opinion about the original post (e.g.good
    stuff)
  • Inconsistent Assessment? (20)
  • 891 Intel 1(1) 2(3) 3(0)
  • 879 Hybrid cars 1(0) 2(3)
  • 899 cholesterol 1(4) 2(1)
  • 882 seahawks 1(1)
  • Others
  • Few relevant document
  • 898 Business Intelligence Resources has 10
    relevant documents.
  • IU module failed when there are lots of comments
    following a post.
  • 899 1

41
Questions?
  • Wilsons lexicon
  • http//www.cs.pitt.edu/mpqa/opinionfinderrelease/
  • Movie Review Data
  • http//www.cs.cornell.edu/people/pabo/movie-review
    -data/
  • Movie Plot Summaries
  • http//www.imdb.com/Sections/Plots/
  • Netlingo Terms
  • http//www.netlingo.com/emailsh.cfm
  • WIDIT Lexicons
  • http//elvis.slis.indiana.edu/lexlist.htm
Write a Comment
User Comments (0)
About PowerShow.com