Ranking for Sentiment - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Ranking for Sentiment

Description:

Terrier. University of Glasgow. Open source / Java. Retrieval: ... Terrier: http://ir.dcs.gla.ac.uk/terrier/ HTMLParser: http://htmlparser.sourceforge.net ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 28
Provided by: barry7
Category:

less

Transcript and Presenter's Notes

Title: Ranking for Sentiment


1
Ranking for Sentiment
  • DCU at TREC 2008 The Blog TrackAdam
    Berminghamabermingham_at_computing.dcu.ie

2
DCU Team Sentiment!
CLARITY Sensor for Web TechnologiesCentre for
Digital Video Processing
National Centre For Language Technology
Prof. Alan Smeaton
Dr. JenniferFoster
Adam Bermingham
Dr. Deirdre Hogan
3
Sentiment Analysis I
  • Who is favourite to win the match, Ireland or New
    Zealand?
  • What is the sentiment towards Barack Obama / the
    new iPod / Lehman Brothers Holdings Inc?
  • How opinionated is the discussion around Mary
    Hearney, Enda Kenny, Zig Zag?

4
Sentiment Analysis II
  • Identification of subjectivity and polarity of
    opinion in textual information
  • Crossover of Information Retrieval, NLP, Text
    Mining
  • The Challenges
  • Document classification, scoring
  • Opinion extraction
  • Opinion summarization, visualization
  • Real world correlation

5
SA Document Scoring
  • Permits reranking, fusion with other document
    information
  • Eg relevance, authority, pagerank etc.
  • Machine Learning approaches
  • Bag-of-words variants
  • Lexicon approaches
  • Dictionaries for sentiment, polarity and
    subjectivity.
  • Alternative text features
  • Out of vocabulary words, punctuation (etc)

6
The Blog Track
  • Run at TREC since 2006
  • Three tasks (2008)
  • Find relevant blog posts
  • Find opinionated blog posts
  • Find positive negative blog posts
  • Results 1000 ranked documents per topic (query),
    per task
  • 50 topics per yearPrimary evaluation metric MAP
    Mean Average Precision

7
Topic Example
  • ltnumgt Number 1049 lt/numgtlttitlegt YouTube
    lt/titlegt
  • ltdescgt DescriptionFind views about the YouTube
    video-sharing website.lt/descgt
  • ltnarrgt Narrative The YouTube video-sharing
    website provides internet users with a relatively
    new way to share videos. Documents which express
    views about how well it succeeds in meeting the
    needs of users are relevant.lt/narrgt

8
Assessments
  • Relevance judgementsPoolingHuman Assessors
  • QRELSNot relevantRelevant, non-opinionatedRele
    vant, positively opinionatedRelevant, negatively
    opinionated Relevant, mixed opinionatedNot
    judged
  • 32,021 QRELS from 2006, 2007 available

9
Corpus Blog06
  • gt3 million blog posts
  • Crawled over a few weeks in 2006
  • Permalink HTML
  • Also available homepage HTML, RSS
  • Real-world
  • Includes spam blogs (splogs), multilingual
    blogs, inappropriate content

10
Blogs
  • Weblog coined 1997A website containing regular
    timestamped posts in chronological order
  • Universal McCann (March 2008)184 million WW have
    started a blog 26.4 US 346 million WW read
    blogs 60.3 US77 of active Internet users read
    blogs

11
Blog an example
  • Blog
  • Date
  • Post
  • Links
  • Tags
  • Comments

12
Approach
  • Get relevant documents
  • Assess results for sentiment using three feature
    sets
  • Re-rank relevant results using late fusion of
    feature sets

13
Approach feature set
  • Lexicon Features
  • Aggregate sentiment scores for a documents
    constituent words in a sentiment lexicon.
  • Surface Features
  • Textual features which do not require parsing or
    syntactic understanding of the sentence
    structure.
  • Syntactic Features
  • Textual features derived from parsing and
    part-of-speech tagging documents.

14
System Architecture
15
Retrieval
  • Terrier
  • University of Glasgow
  • Open source / Java
  • Retrieval
  • Okapi BM25
  • Query Expansion
  • Bo1 (Bose-Einstein) Divergence From Randomness

16
Preprocessing
  • Parse HTML
  • HTMLParser tool
  • Divide into text sections according to breaking
    HTML elements
  • Noise Removal
  • Discard sections with
  • A high anchor text to non-anchor text ratio (eg
    ad, blogroll)
  • A high non-alphabetic character to alphabetic
    character ratio (eg date, code, gobbledegook)

17
Machine Learning
  • WEKA - Waikato Environment for Knowledge Analysis
    JavaGood entry pointPerformance Issues (?)
  • Three-way Binary Logistic Regression
    ClassificationScores are obtained from
    distributions for classified documents

18
Syntactic Features
  • Parsed using Charniak and Johnson re-ranking
    parserICHEC Irish Centre for High End
    Computing
  • The 50 most discriminative part-of-speech
    unigrams, bigrams and trigrams
  • Penn Treebank phrasal typesNormalised counts of
    typesNormalised counts of types as root of
    treeNormalised counts of parse tree structures
    likely to reflect subjectivity

Thanks to Joachim Wagner!
19
Surface Features
  • Normalized word counts for a manually created
    lexicon of obscenities and emotive and polarised
    words
  • Non-word characters and character sequences such
    as punctuation and emoticons.
  • Regex patterns to detect unusual word and
    punctuation structures. Eg arrrrgh, ?!?!,
    ...., b
  • Document measurements

20
Lexicon Features
  • SentiWordNetPositivity, Negativity score for
    each Synset in WordNetScoringWeighted sum of
    mean positivity and negativity scores per document

21
Weighting
  • Weighted Comb Sum
  • Learning weights from 2006, 2007 MAP
  • Rather than cross validation
  • Scores from 3 classifiers fused before merging
    with relevance score

22
Weighting
Opinion finding
Polarised Opinion finding
23
Results
Baseline (Opinion)
Opinion Finding
Polarised Opinion Finding
24
Results per topic
25
Preliminary Conclusions
  • Syntactic features appear to subsume surface
    features
  • Observed during in training weights
  • Significant gains can be had through an
    efficient, uniform baseline
  • Subjectivity important in polarity detection
  • Bigger difference in writing style between
    objective and subjective texts than between
    negative and positive texts.

26
Future Work
  • Further work on parse trees for sentiment
    classification
  • Movie review classification Wolfgang Seeker
  • Sub-document relevance and sentiment modelling
  • Unstructured text
  • Logical levels Sentence? Phrase? Paragraph?
    Passage? N.O.T.A?

27
Thanks!
  • TREC Blog Track Wiki
  • http//ir.dcs.gla.ac.uk/wiki/TREC-BLOG/
  • Opinion Mining and Sentiment Analysis Survey
  • http//www.cs.cornell.edu/home/llee/opinion-mining
    -sentiment-analysis-survey.html
  • TREC Blog Track 2007 overview
  • http//trec.nist.gov/pubs/trec16/papers/BLOG.OVERV
    IEW08.pdf
  • Tools
  • Weka http//www.cs.waikato.ac.nz/ml/weka/
  • Terrier http//ir.dcs.gla.ac.uk/terrier/
  • HTMLParser http//htmlparser.sourceforge.net/
    SentiWordNet http//sentiwordnet.isti.cnr.it/
Write a Comment
User Comments (0)
About PowerShow.com