Experimental Study on Sentiment Classification of Chinese Review using Machine Learning Techniques - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Experimental Study on Sentiment Classification of Chinese Review using Machine Learning Techniques

Description:

Experimental Study on Sentiment Classification of Chinese Review using Machine Learning Techniques Jun Li and Maosong Sun Department of Computer Science and Technology – PowerPoint PPT presentation

Number of Views:222
Avg rating:3.0/5.0
Slides: 18
Provided by: nlpCsaiTs
Category:

less

Transcript and Presenter's Notes

Title: Experimental Study on Sentiment Classification of Chinese Review using Machine Learning Techniques


1
Experimental Study on Sentiment Classification of
Chinese Review using Machine Learning Techniques
  • Jun Li and Maosong Sun
  • Department of Computer Science and Technology
  • Tsinghua University, Beijing, China
  • IEEE NLP-KE 2007

2
Outline
  • Introduction
  • Corpus
  • Features
  • Performance Comparison
  • Analysis and Conclusion

3
Introduction
  • Why do we perform the task ?
  • Much of the attention has centered on feature
    based sentiment extraction
  • Sentence-level analysis is useful, but it
    involves complex processing and usually format
    dependent (liu et al www05)
  • Sentiment Classification using machine learning
    techniques
  • based on the overall sentiment of a text
  • Easily transfer to new domains with a training
    set.
  • Applications
  • Split reviews into the sets of positive and
    negative
  • Monitor bloggers mood trend
  • Filter subjective web pages

4
Corpus
  • From www.ctrip.com
  • Average length 69.6 words with std 89.0
  • 90 of the reviews are less than 155 words
  • including some English words

5
Review rating distribution score threadhold
  • 4.5 and up are considered positive, 2.0 and below
    are considered negative.
  • 12,000 reviews as training set, 4,000 reviews as
    test set

6
Features text representation
  • Text representation schemes
  • Word-Based Unigram (WBU), widely used
  • Word-Based Bigram (WBB)
  • Chinese Character-Based Bigram (CBB)
  • Chinese Character-Based Trigram (CBT)

Unique feature Total feature Avg-Len/Std
WBU 21,074 830,826 69.3/88.9
WBB 251,289 818,832 68.3/88.9
CBB 128,049 1,053,860 88.0/112.8
CBT 340,501 918,841 76.8/99.8
Table 1. Statistics of training set with four
text representation schemes
7
Features representation in a graph model
  • Features representation (n2) in a graph
    model.

D
f1
f2
fk-1
x1
x2
x3
xk
xk-1
8
Features - weight
9
Performance Comparison - methods
  • Support Vector Machines (SVM)
  • Naïve Bayes (NB)
  • Maximum Entropy (ME)
  • Artifical Neural Network (ANN)
  • two layers feed-forward
  • Baseline Naive Counting
  • Predict by comparsion of number of sentiment
    words.
  • Heaivly depends on the sentiment dictionary
  • micro-averaging F1 0.7931, macro-averaging F1
    0.7573.

10
Performance Comparison - WBU
SVM, NB, ME, ANN using WBU as features with
different feature weights
11
Performance Comparison - WBU
Four methods using WBU as features
12
Performance Comparison - WBB
Four methods using WBB as features
13
Performance Comparison CBB CBT
Four methods using CBB as features
Four methods using CBT as features
14
Performance Comparison
15
Analysis
  • On the average, NB outperforms all the other
    classifiers using WBB and CBT
  • N-gram based features relaxes conditional
    independent assumption of Naive Bayes Model
  • capture real integral semantic content
  • People like to use combination of words to
    express positive and negative sentiment.

16
Conclusion
  • (1) On the average, NB outperforms all the
    classifiers when using WBB, CBT as text
    representation scheme with bool weighing under
    different feature dimensionality reduced by
    chi-max, and is more stable than others.
  • (2) Compared with WBU, WBB and CBB have more
    strong meaning as semantic unit for classifiers.
  • (3) at most time, tfidf-c is much better for SVM
    and ME.
  • (4) Considering SVM achieve the best performance
    under all conditions and is the most popular
    method. We recommend using WBB, CBB to represent
    text with tfidf-c as feature weighting to obtain
    a better performance relative to WBU.

17
Thank you!
  • Q A

Dataset and software is avaiable at
http//nlp.csai.tsinghua.edu.cn/lj/
Write a Comment
User Comments (0)
About PowerShow.com