Title: Movie Review Mining : a comparison between Supervised and Unsupervised Classification Approaches
1Movie Review Mining a comparison between
Supervised and Unsupervised Classification
Approaches
- Pimwadee Chaovalit Lina Zhou
- Department of Information Systems
- University of Maryland, Baltimore Country
20. Contents
- 1. Introduction
- 2. Background
- 2.1. Opinion mining
- 2.2. Movie review mining
- 2.3. Machine learning vs. Semantic orientation
- 3. Methodology
- 3.1. Machine learning approach
- 3.2. Semantic Orientation (SO) approach
- 3.3. Test data
- 3.4. Rating decisions
- 3.5. Evaluation techniques
- 4. Experiment result and analysis
- 4.1. Supervised machine learning approach
- 4.2. Unsupervised learning approach
- 5. Discussion
- 5.1. Summary
- 5.2. Limitations
- 5.3. Future work
- 5.4. Conclusion
31. Introduction
- The main objective of opinion mining
- Classify a large number of opinions
using web-mining techniques into bipolar
orientation - Help consumers in making their purchasing
decisions - Apply in constructing information presentation
- Two types of techniques
- Machine learning
- Semantic orientation
- Which approach is better for opinion mining?
- Still an open question!
42. Background2.1. Opinion mining
- Mining reviews of various products
by classifying them into positive or negative
opinions. - Summarize and give users statistics information
- Analyze product reputations
- Mining opinion Process
- Data of product reviews are to be crawled from
web - Data preparation
- Objectivity classification
- Subjectivity analysis
52. Background2.2. Movie review mining
- Special challenges!
- Domain specific
- Word semantics in a particular review could
contradict with overall semantic direction - ex) unpredictable camera vs. unpredictable
plot - So, we need to
- Train the machine learning classifier with
dataset - Adapt the semantic orientation approach
to movie review domain
62. Background 2.3. Machine learning vs. semantic
orientation
- Supervised training approach
- Collecting training dataset from certain
websites - Selecting features
- Training a classifier on the corpus
- Evaluating on the test dataset based on chosen
criteria -
- The above processes are usually repeated in
multiple iterations - to produce a better model
72. Background 2.3. Machine learning vs. semantic
orientation
- Semantic orientation
- Dimensions
- Direction
- Indicating whether a word has positive or
negative meaning - Intensity
- Designating how strong the word is
- Conjunctive words
- Help understanding the tome of the sentence
- Improving training a supervised learning
algorithm - Prior work Turneys study
- POS tagger (adj, adv) SO-PMI technique
83. Methodology3.1. Machine learning approach
- 3.1.1. Corpus
- Good corpus resources
- Good review quality
- Available metadata
- Easy spidering
- Reasonably large number of reviews and products
- Select a ready-to-use clean dataset
- 1,400 text files in total (700 as pos, 700 as
neg) - Transforming a 4-star or 5-star rating-system
- Manually examining to ensure quality
93. Methodology3.1. Machine learning approach
- 3.1.2. N-gram classifiers
- Select n-gram models
- Tool a shareware Rubryx version 2.0
- Classification algorithms based on N-gram
features - Classification models
- Stop-word lists
- Domain-specific dictionaries
- Capture a large number of features
at the beginning - Over-filter data -gt remove important information
- Try multiple sets of features to select best one
103. Methodology3.2. Semantic Orientation approach
- Using POS tag -gt extract two-word phrase
- Phrase pattern from Turneys study
- Adjective or adverb provides subjectivity
113. Methodology3.2. Semantic Orientation approach
- Similarity between phrase and excellent/poor
- Using average of the SO values of all extracted
phrases -gt Compare with a threshold
123. Methodology3.3. Test data
- Collect 384 reviews -gt MJ rating 378 dataset
133. Methodology3.4. Rating decisions
- The data are positively skewed!
- Applying the rating decision of Pang et al
- 5-star rating system
- 4 stars and up positive
- 2 stars and below negative
- 4-star rating system
- 3 stars and up positive
- 1 star and below negative
- Ignore neutral ratings Group into 2 categories
- A, B positive -gt 285 opinions
- D, F negative -gt 47 opinions
143. Methodology3.5. Evaluation techniques
- Accuracy
- Ratio between number of reviews that has been
classified correctly to the total number of
reviews being classified - Recall
- Ratio of the number of reviews correctly
classified into a category to the total number of
reviews belonging to that category - Precision
- Ratio of the number of reviews classified
correctly to the total number of reviews in that
category
154. Experiment result and analyses 4.1.
Supervised machine learning approach
- Removing stop words from n-gram features
- Training size of 5 per category
164. Experiment result and analyses 4.1.
Supervised machine learning approach
- The possible explanation for the poor result
- The nature of movie review domain
- Regularly about 600 words per review
- Reviewers used a wide variety of words in their
reviews - 5 documents randomly selected from the corpus
- Not be representative
- In the next trials
- Large size of training data
- Carefully selected training files
representative! - A set of dictionary for movie review domain
174. Experiment result and analyses 4.1.
Supervised machine learning approach
- Large training dataset
- Accuracy 66.27, Recall 70.88(pos)/38.30(neg)
- 3-fold cross validation
- Average accuracy 85.54, Recall 6.25, 0,
0(neg)
184. Experiment result and analyses 4.2.
Unsupervised learning approach
- Mapping between Brills POS-tagger and the tag
set of Minipar -gt Comparable - Utilize different POS-tagger of Turneys
- Automatically extract phrase from the tagged
text - Generate Google search queries
- Manually clean up misrecognized one
- Changing threshold from 0 to -0.57
194. Experiment result and analyses 4.2.
Unsupervised learning approach
- Using the baseline from Turneys study
- Phrases are categorized with negative bias
204. Experiment result and analyses 4.2.
Unsupervised learning approach
- Setup own baseline using 6 phrases
214. Experiment result and analyses 4.2.
Unsupervised learning approach
- Accuracy 77, Recall 77.91(pos)/71.43(neg)
225. Discussion 5.1. Summary
- Improve semantic orientation approach
- Adapting the threshold to movie review domain
- Automated the time-consuming process of
collection data from the Web - Compare with Pang et al.
- Pangs best accuracy 77.482.9
- This study 85.54(3-fold cross), 66.27(test
dataset) - Compare with Turney
- Turneys accuracy 65.83 in 120 movie reviews
- This study 77 in 100 movie reviews (new
baseline)
235. Discussion 5.1. Summary
245. Discussion 5.2. Limitations
- Depends largely on the preprocessing steps
- Machine learning approach
- Careful feature selection
- Apply POS tagger to facilitate better feature
selection - Semantic orientation approach
- Arbitrary parameter -gt change the results
- Deal with mixed factual information in the
reviews and sarcastic style of review writing
255. Discussion 5.3. Future work
- lt Maching learning approach gt
- Use TFIDF weighting
- Reduce the number of features
- Solve sparse features problem
- Employ specific lexicon or dictionary for movie
review domain - Limit the words in classification
- Support removing factual information
- Apply POS tagger to limit features
- Develop representative training documents
265. Discussion 5.3. Future work
- lt Semantic orientation approach gt
- Select words other than excellent/poor
- Better represent polarities for movie review
mining - Produce more realistic results
- Revisit certain pattern of two-word phrases
- Represent the tone of reviews
- Employ effective preprocessing steps
- Such as subjectivity analysis
- improve the quality of training corpus
- Help feature selection for classification
275. Discussion 5.4. Conclusion
- Movie review mining is a challenging sentimental
classification problem - Classification of personal opinions
- diverse opinions from product reviews
- Sparsity of words in movie reviews
- Hard to use bag-of-word features
- Mixed with factual background information