Movie Review Mining : a comparison between Supervised and Unsupervised Classification Approaches - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Movie Review Mining : a comparison between Supervised and Unsupervised Classification Approaches

Description:

Data of product reviews are to be crawled from web. Data preparation. Objectivity classification ... from product reviews. Sparsity of words in movie reviews ... – PowerPoint PPT presentation

Number of Views:674
Avg rating:3.0/5.0
Slides: 28
Provided by: HYAN
Category:

less

Transcript and Presenter's Notes

Title: Movie Review Mining : a comparison between Supervised and Unsupervised Classification Approaches


1
Movie Review Mining a comparison between
Supervised and Unsupervised Classification
Approaches
  • Pimwadee Chaovalit Lina Zhou
  • Department of Information Systems
  • University of Maryland, Baltimore Country

2
0. Contents
  • 1. Introduction
  • 2. Background
  • 2.1. Opinion mining
  • 2.2. Movie review mining
  • 2.3. Machine learning vs. Semantic orientation
  • 3. Methodology
  • 3.1. Machine learning approach
  • 3.2. Semantic Orientation (SO) approach
  • 3.3. Test data
  • 3.4. Rating decisions
  • 3.5. Evaluation techniques
  • 4. Experiment result and analysis
  • 4.1. Supervised machine learning approach
  • 4.2. Unsupervised learning approach
  • 5. Discussion
  • 5.1. Summary
  • 5.2. Limitations
  • 5.3. Future work
  • 5.4. Conclusion

3
1. Introduction
  • The main objective of opinion mining
  • Classify a large number of opinions
    using web-mining techniques into bipolar
    orientation
  • Help consumers in making their purchasing
    decisions
  • Apply in constructing information presentation
  • Two types of techniques
  • Machine learning
  • Semantic orientation
  • Which approach is better for opinion mining?
  • Still an open question!

4
2. Background2.1. Opinion mining
  • Mining reviews of various products
    by classifying them into positive or negative
    opinions.
  • Summarize and give users statistics information
  • Analyze product reputations
  • Mining opinion Process
  • Data of product reviews are to be crawled from
    web
  • Data preparation
  • Objectivity classification
  • Subjectivity analysis

5
2. Background2.2. Movie review mining
  • Special challenges!
  • Domain specific
  • Word semantics in a particular review could
    contradict with overall semantic direction
  • ex) unpredictable camera vs. unpredictable
    plot
  • So, we need to
  • Train the machine learning classifier with
    dataset
  • Adapt the semantic orientation approach
    to movie review domain

6
2. Background 2.3. Machine learning vs. semantic
orientation
  • Supervised training approach
  • Collecting training dataset from certain
    websites
  • Selecting features
  • Training a classifier on the corpus
  • Evaluating on the test dataset based on chosen
    criteria
  • The above processes are usually repeated in
    multiple iterations
  • to produce a better model

7
2. Background 2.3. Machine learning vs. semantic
orientation
  • Semantic orientation
  • Dimensions
  • Direction
  • Indicating whether a word has positive or
    negative meaning
  • Intensity
  • Designating how strong the word is
  • Conjunctive words
  • Help understanding the tome of the sentence
  • Improving training a supervised learning
    algorithm
  • Prior work Turneys study
  • POS tagger (adj, adv) SO-PMI technique

8
3. Methodology3.1. Machine learning approach
  • 3.1.1. Corpus
  • Good corpus resources
  • Good review quality
  • Available metadata
  • Easy spidering
  • Reasonably large number of reviews and products
  • Select a ready-to-use clean dataset
  • 1,400 text files in total (700 as pos, 700 as
    neg)
  • Transforming a 4-star or 5-star rating-system
  • Manually examining to ensure quality

9
3. Methodology3.1. Machine learning approach
  • 3.1.2. N-gram classifiers
  • Select n-gram models
  • Tool a shareware Rubryx version 2.0
  • Classification algorithms based on N-gram
    features
  • Classification models
  • Stop-word lists
  • Domain-specific dictionaries
  • Capture a large number of features
    at the beginning
  • Over-filter data -gt remove important information
  • Try multiple sets of features to select best one

10
3. Methodology3.2. Semantic Orientation approach
  • Using POS tag -gt extract two-word phrase
  • Phrase pattern from Turneys study
  • Adjective or adverb provides subjectivity

11
3. Methodology3.2. Semantic Orientation approach
  • Similarity between phrase and excellent/poor
  • Using average of the SO values of all extracted
    phrases -gt Compare with a threshold

12
3. Methodology3.3. Test data
  • Collect 384 reviews -gt MJ rating 378 dataset

13
3. Methodology3.4. Rating decisions
  • The data are positively skewed!
  • Applying the rating decision of Pang et al
  • 5-star rating system
  • 4 stars and up positive
  • 2 stars and below negative
  • 4-star rating system
  • 3 stars and up positive
  • 1 star and below negative
  • Ignore neutral ratings Group into 2 categories
  • A, B positive -gt 285 opinions
  • D, F negative -gt 47 opinions

14
3. Methodology3.5. Evaluation techniques
  • Accuracy
  • Ratio between number of reviews that has been
    classified correctly to the total number of
    reviews being classified
  • Recall
  • Ratio of the number of reviews correctly
    classified into a category to the total number of
    reviews belonging to that category
  • Precision
  • Ratio of the number of reviews classified
    correctly to the total number of reviews in that
    category

15
4. Experiment result and analyses 4.1.
Supervised machine learning approach
  • Removing stop words from n-gram features
  • Training size of 5 per category

16
4. Experiment result and analyses 4.1.
Supervised machine learning approach
  • The possible explanation for the poor result
  • The nature of movie review domain
  • Regularly about 600 words per review
  • Reviewers used a wide variety of words in their
    reviews
  • 5 documents randomly selected from the corpus
  • Not be representative
  • In the next trials
  • Large size of training data
  • Carefully selected training files
    representative!
  • A set of dictionary for movie review domain

17
4. Experiment result and analyses 4.1.
Supervised machine learning approach
  • Large training dataset
  • Accuracy 66.27, Recall 70.88(pos)/38.30(neg)
  • 3-fold cross validation
  • Average accuracy 85.54, Recall 6.25, 0,
    0(neg)

18
4. Experiment result and analyses 4.2.
Unsupervised learning approach
  • Mapping between Brills POS-tagger and the tag
    set of Minipar -gt Comparable
  • Utilize different POS-tagger of Turneys
  • Automatically extract phrase from the tagged
    text
  • Generate Google search queries
  • Manually clean up misrecognized one
  • Changing threshold from 0 to -0.57

19
4. Experiment result and analyses 4.2.
Unsupervised learning approach
  • Using the baseline from Turneys study
  • Phrases are categorized with negative bias

20
4. Experiment result and analyses 4.2.
Unsupervised learning approach
  • Setup own baseline using 6 phrases

21
4. Experiment result and analyses 4.2.
Unsupervised learning approach
  • Accuracy 77, Recall 77.91(pos)/71.43(neg)

22
5. Discussion 5.1. Summary
  • Improve semantic orientation approach
  • Adapting the threshold to movie review domain
  • Automated the time-consuming process of
    collection data from the Web
  • Compare with Pang et al.
  • Pangs best accuracy 77.482.9
  • This study 85.54(3-fold cross), 66.27(test
    dataset)
  • Compare with Turney
  • Turneys accuracy 65.83 in 120 movie reviews
  • This study 77 in 100 movie reviews (new
    baseline)

23
5. Discussion 5.1. Summary
24
5. Discussion 5.2. Limitations
  • Depends largely on the preprocessing steps
  • Machine learning approach
  • Careful feature selection
  • Apply POS tagger to facilitate better feature
    selection
  • Semantic orientation approach
  • Arbitrary parameter -gt change the results
  • Deal with mixed factual information in the
    reviews and sarcastic style of review writing

25
5. Discussion 5.3. Future work
  • lt Maching learning approach gt
  • Use TFIDF weighting
  • Reduce the number of features
  • Solve sparse features problem
  • Employ specific lexicon or dictionary for movie
    review domain
  • Limit the words in classification
  • Support removing factual information
  • Apply POS tagger to limit features
  • Develop representative training documents

26
5. Discussion 5.3. Future work
  • lt Semantic orientation approach gt
  • Select words other than excellent/poor
  • Better represent polarities for movie review
    mining
  • Produce more realistic results
  • Revisit certain pattern of two-word phrases
  • Represent the tone of reviews
  • Employ effective preprocessing steps
  • Such as subjectivity analysis
  • improve the quality of training corpus
  • Help feature selection for classification

27
5. Discussion 5.4. Conclusion
  • Movie review mining is a challenging sentimental
    classification problem
  • Classification of personal opinions
  • diverse opinions from product reviews
  • Sparsity of words in movie reviews
  • Hard to use bag-of-word features
  • Mixed with factual background information
Write a Comment
User Comments (0)
About PowerShow.com