Applicability of N-Grams to Data Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Applicability of N-Grams to Data Classification

Description:

... learning can be applied to classifying movie reviews as positive or negative. The main reasons why movie reviews were chosen are their wide availability, ease ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 11
Provided by: css64
Category:

less

Transcript and Presenter's Notes

Title: Applicability of N-Grams to Data Classification


1
Applicability of N-Grams to Data Classification
  • A review of 3 NLP-related papers
  • Presented by Andrei Missine
  • (CS 825, Fall 2003)

2
What are N-Grams?
  • Sequences of words or tokens from a corpus.
  • Used to predict the probability of a word W being
    the next word given 0 (n - 1) words before it.
  • Common N-grams unigrams, bigrams, trigrams and
    four-grams.
  • One of the simpler statistical models used in NLP.

3
N-Grams and Authorship Attribution
  • Authorship Attribution is the process of
    determining who the author of a given text is.
  • An approach suggested by the authors of this
    paper(1) is to parse a known document written by
    an author A1 on the byte level and to extract
    n-grams.
  • The most frequent n-grams are then saved as the
    author profile for this author (A1).
  • This process is repeated for all other authors
    (A2 An). We now have a collection of author
    profiles.
  • Given a new text it is compared versus the
    existing profiles and the one with the smallest
    dissimilarity is chosen as the most likely author.

(1) N-Gram-based Author Profiles for Authorship
Attribution
4
N-Grams on Byte Level?
  • Instead of treating text as a collection of
    words, just look at the bytes.
  • No modifications to the algorithm are required
    when switching between languages.
  • The good side the experiment performed with
    100(2) accuracy for English and 97(2) accuracy
    for Greek data. This is much better than any of
    the previously attempted methods.
  • The bad side this approach did worse on Chinese
    data, performing with 89(2) accuracy (previously
    achieved accuracy is 94).
  • A likely reason for this is because many Asian
    languages use Unicode (2 bytes) to encode
    characters, so some n-grams might include only
    half of a character.

(2) Best achieved accuracy
5
N-Grams and Sentiment Classification
  • In this particular paper(3) the authors discuss
    how N-Grams and machine learning can be applied
    to classifying movie reviews as positive or
    negative.
  • The main reasons why movie reviews were chosen
    are their wide availability, ease of
    programmatically determining if the review is
    positive or negative (e.g. by the number of
    stars)and finally, the large availability of
    different reviewers.
  • Some preliminary results the chance of guessing
    the classification is 50. When 2 graduate
    computer science were asked to provide a list of
    positive and negative words the results were 58
    and 64 accurate. Finally, when a statistical
    method was applied to get such a list the rate of
    accuracy was 69.

(3) Thumbs up? Sentiment Classification using
Machine Learning Techniques
6
N-Grams and Sentiment Classification (continued)
  • So how well did machine learning do?
  • Naïve Bayesian classification has the best
    performance of 81.5 when unigrams and Parts of
    Speech(4) are used.
  • Maximum Entropy classification has slightly lower
    best performance at 81.0 when top 2633 unigrams
    are chosen.
  • Support Vector Machines have the best overall
    performance of the three, with the highest being
    82.9 achieved when 16153 unigrams were used.
  • Notes
  • The data was acquired from the corpus collected
    from IMDb.
  • Interestingly, the presence of the n-grams
    appears to be more important than their frequency
    in this application.

(4) As mentioned by authors crude form of sense
disambiguation
7
N-Grams and Sentiment Classification (continued)
- Problems
  • Why is machine learning not doing so well on some
    articles?
  • Sometimes considering just the N-grams is not
    enough one needs to look at the broader context
    in which they are used.
  • One of the examples provided by the authors is
    thwarted expectations where the reviewer goes
    on describing how great the movie should have
    been, and finishes with a quick comment on how
    bad it turned out. In this case there will be a
    larger amount of positive information and only a
    small bit of negative and the article might
    wrongfully get a positive rating.
  • The converse of the above is also true an
    article might wrongfully get a negative rating on
    a positive review such as It was sick,
    disgusting and disturbing It was great!(5)

(5) Same idea as the Spice Girls review in the
paper
8
Affect Sensing on the Sentence Level
  • The last approach(6) I examined is based on
    affect sensing by trying to apply well known
    facts to a sentence and thus detecting the
    overall mood.
  • Source of common-sense information used was Open
    Mind Common Sense which has 500,000 sentences
    in its corpus.
  • Some simple linguistic models were used in
    conjunction with a smoothing model which would be
    responsible for determining how the mood was
    carried over from one sentence to the next.
  • These were combined to produce an email client
    which would attempt to react emotionally (via a
    simple drawing of a face) to the users text.
  • The approach used by the authors is different
    from N-grams.

(6) A Model of Textual Affect Sensing using
Real-World Knowledge
9
Affect Sensing versus N-Grams
  • Can be used to provide the user with a friendlier
    and more natural interface.
  • The structure proposed by the authors can handle
    negations and slightly trickier linguistic
    structures than most simple n-gram based
    approaches.
  • Can use common sense to infer more information
    than n-grams.
  • Comes at a price of much more complicated
    algorithms and dependency on language-specific
    sources such as OMCS.
  • Affect sensing is very young and was not
    evaluated thoroughly whereas n-grams have been
    around for some time and are well studied.
  • Final note Neither can handle sarcasm Yeah,
    right.

10
References
  • N-gram-based Author Profiles for Authorship
    Attribution by Keselj Vlado, Peng Fuchun,
    Cercone Nick and Thomas Calvin. In Proceedings
    of the Conference Pacific Association for
    Computational Linguistics, PACLING'03, Dalhousie
    University, Halifax, Nova Scotia, Canada, August
    2003. http//www.cs.dal.ca/vlado/papers/pacling0
    3-keselj-etc.html
  • Thumbs up? Sentiment Classification using
    Machine Learning Techniques (2002) by Bo Pang,
    Lillian Lee, Shivakumar Vaithyanathan
    Proceedings of the 2002 Conference on Empirical
    Methods in Natural Language Processing (EMNLP)
    http//citeseer.nj.nec.com/pang02thumbs.html
  • A Model of Textual Affect Sensing using
    Real-World Knowledge by Hugo Liu, Henry
    Lieberman and Ted Selker. International
    Conference on Intelligent User Interfaces (IUI
    2003). Miami, Florida http//citeseer.nj.nec.com/
    liu03model.html
  • Foundations of Statistical Natural Language
    Processing, by Christopher D. Manning and
    Hinrich Schutze
Write a Comment
User Comments (0)
About PowerShow.com