Combining Lexiconbased Methods to Detect Opinionated Blogs - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Combining Lexiconbased Methods to Detect Opinionated Blogs

Description:

High Frequency (HF) Lexicon. For each of IMDb movie & 2006 blog training data ... LF Lexicon & LF Regex. Compile a set of Low Frequency (LF) terms in the blog ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 40
Provided by: SLIS69
Category:

less

Transcript and Presenter's Notes

Title: Combining Lexiconbased Methods to Detect Opinionated Blogs


1
Combining Lexicon-based Methodsto Detect
Opinionated Blogs
  • Kiduk Yang, Ning Yu, Hui Zhang
  • WIDIT Laboratory
  • School of Library Information Science
  • Indiana University

2
Research Questions Noise Reduction
  • Does noise in blog data affect IR performance?
  • What are the characteristics of blog noise?
  • Non-English (NE) blogs
  • Large proportion of NE tokens
  • High frequency NE stopwords low frequency
    English stopwords
  • Non-blog content
  • Non-post/comment text generated by blogware
  • e.g., sidebar/navigation, header/footer,
    advertisement, etc.
  • Spam postings
  • How can blog noise be identified?
  • NE blogs
  • Language tags in feeds
  • NE tokens in permalinks
  • Non-blog content
  • Mark-up tags in permalinks

3
Research Questions Opinion Detection
  • How to retrieve blogs about something that are
    opinionated?
  • What are the evidences of opinion?
  • Opinion Terminology
  • Words often used in expressing an opinion
  • e.g., Skype sucks, Skype rocks, Skype is
    cool
  • Opinion Collocations
  • Collocations that mark an opinion
  • e.g., I think tomato is a fruit, Tomato is a
    vegetable to me
  • Opinion Morphology
  • Word morphing to emphasize an opinion
  • e.g., Vista is soooo buggy, Vista is metacool
  • How can they be leveraged?
  • Opinion classification via Supervised Learning
  • Document scoring using Opinion Lexicons
  • How can they be combined to detect opinionated
    blogs?
  • Weighted sum optimized via Dynamic Tuning

4
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
5
WIDIT Approach Noise Reduction
  • Non-English (NE) blog Identification
  • Language tags in Feeds
  • Not always present
  • Extract all unique tags from the feed data
  • Identify tags that indicate non-English language
  • Flag permalinks in feeds with non-English
    language tag
  • NE tokens in Permalinks
  • NE-blogs w/ English tokens English blogs w/ NE
    tokens
  • For each permalink,
  • Identify tokens consisting of non-ASCII
    characters (i.e., NE tokens)
  • Compute the NE Content Rate NECR NE_tokens /
    tokens
  • Compute the frequency proportion of English
    stopwords Est(f), Est(p)
  • Flag if large NE content or some NE content with
    few E-stopwords
  • NECR gt 0.5 with Est(f)lt10
  • NECR lt 0.5 with Est(f)lt10 and Est(p)lt0.02)
  • NECR lt 0.5 with dlengt1000, Est(f)lt20 and
    Est(p)lt0.01)

6
WIDIT Approach Noise Reduction
  • Non-blog content Exclusion
  • Extract all unique tag patterns from the
    permalink data
  • Compile a list of content noise tags with high
    frequency
  • Construct regular expressions (regex) to
    identify content noise tags
  • Apply regex to the unique tag set
  • Modify regex based on the examination of regex
    results
  • Repeat steps 4 5 until
  • regex correctly identifies content noise tags
  • For each permalink,
  • Extract blog segment using the content regex
  • Extract ltpostcommentgt text
  • If no ltpostcommentgt tags, extract ltcontentgt text
  • If no ltcontentgt tag, extract ltbodygt text
  • Exclude noise text using the noise regex
  • e.g., lt(divspan).?(footerprofilesidenavadv
    ertisesponsor).?gt

7
WIDIT Approach Noise Reduction
  • Noise Reduction (NR) Statistics
  • 335,691 (11.7) NE permalinks excluded
  • Blog length reduction
  • Over 50 permalinks with length difference
  • Average length reduction 551 bytes
  • 74.3 million (7.4) tokens excluded by NR
  • 21,283 (0.6) unique tokens excluded by NR

8
WIDIT Approach Topic Reranking
  • Topic Reranking factors
  • Exact Match of query title text in document
  • Near Match of query title/description text in
    document
  • All of the query terms occur in sequence near
    each other
  • Noun Phrase Match
  • Non-Rel Match
  • Non-relevant phrases/nouns from the topic
    narrative occur in document
  • Topic Reranking (TR) Method
  • Compute TR scores for each document
  • Document-length normalized frequency
  • Categorize initial retrieval into reranking
    groups
  • g1 exact match (query title to document. title
    body)
  • g2 exact match (multi-term query title to
    document title only)
  • g3 exact match (query title to doc. body only)
  • g4 other
  • Rerank documents within groups using combined TR
    score

9
WIDIT Approach Opinion Lexicons
  • Lexicon-based Opinion Detection
  • Construct Opinion Lexicons from multiple sources
    of opinion evidence
  • Opinion Terminology, Opinion Collocations,
    Opinion Morphology
  • Score documents using Opinion Lexicons
  • Opinion Terminology
  • Wilsons Lexicons
  • A subset of Wilsons subjectivity terms
  • 4747 strong 2190 weak subjective terms with
    polarity
  • 240 emphasis terms
  • High Frequency (HF) Lexicon
  • For each of IMDb movie 2006 blog training data
  • Extract high frequency terms from positive
    training data (e.g., movie review)
  • Exclude terms that occur in negative training
    data (e.g., movie plot summary)
  • Select a set of opinion terms
  • Combine the IMDb blog term sets
  • Assign polarity strength to each term
  • Expand with synonyms antonyms from Wordnet

10
WIDIT Approach Opinion Lexicons
  • Opinion Collocations
  • I-You (IU) Lexicon
  • For each of movie review positive blog training
    data
  • Extract n-grams that begin/end with IU anchors
    (e.g., I, You, my, your, me)
  • Select a set of opinion collocations
  • Combine the movie blog term sets
  • Assign strength polarity to each collocation
  • Add verb conjugations noun plurals
  • Expand with HF Wilson terms
  • Acronym Lexicon
  • Select opinion collocations from netlingo
    acronyms
  • e.g., afaik (as far as I know), imho (in my
    humble opinion)
  • Assign strength polarity to each collocation

11
WIDIT Approach Opinion Lexicons
  • Opinion Morphology
  • When expressing opinion, people become creative
    and tend to use uncommon/rare terms
    (Wiebe,Wilson, Bruce, Bell, Martin, 2004)
  • LF Lexicon LF Regex
  • Compile a set of Low Frequency (LF) terms in the
    blog collection
  • Exclude terms that occur frequently in negative
    training data
  • Construct regular expressions (LF regex) to
    identify Opinion Morph (OM) terms
  • Based on examination of HF terms LF patterns
  • Compound words (e.g., crazygood, ohmygod)
  • Repeat-character words (e.g., sooo, fantaaastic)
  • Morph-spelled words (e.g., luv, hizzarious)
  • Apply regex to LF term set
  • Iteratively refine regex based on the examination
    of regex results
  • Exclude regex matches from LF term set
  • Select OM terms (LF lexicon) from the remaining
    set

12
WIDIT Approach Opinion Reranking
  • Opinion Reranking factors
  • Opinion Terminology
  • Wilsons lexicon, HF lexicon
  • Opinion Collocations
  • AC lexicon, IU lexicon
  • Opinion Morphology
  • LF lexicon, LF regex
  • Opinion Reranking (OR) Method
  • Compute OR scores for each document
  • Document-length normalized frequency
  • Rerank topic-reranked documents using
  • combined OR score topic-reranking groups

13
WIDIT Approach Polarity Detection
  • For each opinion-reranked document,
  • Compute positive negative polarity scores
  • Presence of valence shifters near opinion terms
    reverse polarity
  • e.g., not, never, no, without, hardly, barely,
    scarecely
  • Combine polarity scores using D-tuned formula
  • fsc(p), fsc(n)
  • Apply polarity detection heuristic
  • Positive polarity if
  • most of opinion factors are positive,
  • fsc(p)-fsc(n) gt threshold
  • fsc(p)/fsc(n) gt threshold2
  • Negative polarity if
  • most of opinion factors are negative,
  • Fsc(n)-fsc(p) gt threshold
  • Fsc(n)/fsc(p) gt threshold2
  • Mixed polarity otherwise

14
WIDIT Approach Dynamic Tuning
  • Reranking formula
  • RS aNSorig ß?(wiNSi)
  • wi weight of reranking factor i
  • NSi normalized score of factor i
  • (Si Smin) / (Smax Smin)
  • a weight of original score
  • ß weight of overall reranking score
  • How to determine a, ß, wi?
  • Too many parameters for exhaustive combinations
  • Linear combination may not suffice
  • Dynamic Tuning
  • Real-time display of parameter tuning effect on
    performance
  • To guide the user towards local optimum
  • By harnessing both human intelligence (pattern
    recognition) w/ computational power of machine

15
WIDIT Approach Dynamic Tuning
  • Opinion Reranking

16
WIDIT Approach Dynamic Tuning
  • Polarity Detection

17
Results Overview
  • Noise Reduction
  • Positive effect on retrieval performance
  • Topic Reranking
  • 16 improvement (Qshort), 9 improvement (Qlong)
    over initial result
  • Opinion Reranking
  • 15 improvement (Qshort), 11 improvement (Qlong)
    over TopicRR

18
Relative Performance Opinion Reraking Effects
(short query)
Good OnTopic retrieval ? good opinion retrieval
- but not necessarily due to oprinion reranking
19
Relative Performance Polarity Detection
Steeper slope ? worse relative performance
20
Result Overview 2006 vs. 2007
Topic MAP
Opinion MAP
21
Noise Reduction Effect
22
Reranking Effect
23
Opinion Reranking Factors
24
Concluding Remarks
  • Noise Reduction
  • Positive effect on retrieval performance
  • Reranking
  • Most influential component of the system
  • Next year
  • Improve baseline performance
  • Spam filtering?
  • Incorporate Machine Learning into the fusion pool

25
Questions?
  • Wilsons lexicon
  • http//www.cs.pitt.edu/mpqa/opinionfinderrelease/
  • Movie Review Data
  • http//www.cs.cornell.edu/people/pabo/movie-review
    -data/
  • Movie Plot Summaries
  • http//www.imdb.com/Sections/Plots/
  • Netlingo Terms
  • http//www.netlingo.com/emailsh.cfm
  • WIDIT Lexicons
  • http//elvis.slis.indiana.edu/lexlist.htm

26
Result At a Glance
27
Reranking EffectDynamic Tuning
r1 topic reranking r2 opinion reranking
s0R reranking w/o Dynamic Tuning s0R1
reranking w Dynamic Tuning
  • Opinion reranking
  • - sacrifices topical performance for the sake
    of opinion detection
  • Dynamic Tuning
  • - improves reranking performance across the
    board

28
Query Length Effect
29
Term Weight Effect
30
Result At a Glance
31
Result At a Glance
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
WIDIT Approach Fusion
  • Weighted Sum Fusion Formula
  • FS ?(wiNSi)
  • Fusion Type
  • Baseline (Min-Max) fusion wi 1
  • MAP fusion wi MAP of training runs
  • Fusion Combinations
  • By Query Length
  • Short, Long, Long w/ nouns
  • By Term Weight
  • Okapi, SMART
  • Fusion Levels
  • Baseline results
  • Topic-reranked results
  • Opinion-reranked results

wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin)
37
WIDIT Approach Noise Reduction
  • Non-English (NE) blog Identification
  • Language tags in Feeds (FD)
  • 16,121 permalinks flagged (1,473 not flagged by
    PM)
  • NE tokens in Permalinks (PM)
  • 334,219 permalinks flagged (11.6)
  • NE blog Validation
  • OnTopic (relgt0) NE blogs in 2006 qrels file
  • 24 (PMonly)
  • Suspected qrels error
  • NE blogs in 2006 qrels file
  • 59 (FDonly) 2043 (PMonly) 203 (both) 2304
  • All but 3 manually validated as Non-English
    blog-- BLOG06-20051230-022-0008930772 short
    blog with no content -- BLOG06-20060118-000-00086
    92678 short blog-- BLOG06-20051208-020-002294594
    8 Uhhhh.... (long) It sucks!

38
Suspected Non-English blogs in qrels.blog06
39
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com