Political Party, Gender, and Age Classification Based on Political Blogs - PowerPoint PPT Presentation

About This Presentation
Title:

Political Party, Gender, and Age Classification Based on Political Blogs

Description:

Pick features which are not 50/50 male/female or 50/50 Republican/Democrat. Classification ... with a low probability of being a Republican were classified as Democrat ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 15
Provided by: nlpSta
Learn more at: https://nlp.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Political Party, Gender, and Age Classification Based on Political Blogs


1
Political Party, Gender, and Age Classification
Based on Political Blogs
  • Michelle Hewlett and Elizabeth Lingg

2
Introduction
  • Can individuals be classified by their writing
    style?
  • Do people under 25 use different punctuation than
    those over 25?
  • Do they use different words and phrases?
  • Can you figure out someones political ideologies
    by analyzing their writing using probabilistic
    methods?

3
Classifier
  • Hold Out Cross Validation
  • 80 of Data in Training Set
  • 20 of Data in Test Set
  • Classify Bloggers using a Feature Vector
  • Features generated from training data

4
Features
  • Most frequent unigrams, bigrams, trigrams
  • Bush, troops in Iraq, McCain
  • Sentence length, Word length
  • Punctuation
  • Pronoun usage

5
Features
  • Compute feature probabilities based on frequency
    in the training data
  • If women use the word myself three times as
    often as men use the word myself,
    P(femalemyself) 75
  • Pick features which are not 50/50 male/female or
    50/50 Republican/Democrat

6
Classification
  • Using the feature vector to classify, bloggers
    with a low probability of being a Republican were
    classified as Democrat
  • Writers with high Probability of being a
    Republican were classified as Republican
  • Writers with moderate Probability were not
    classified or Unknown

7
Classifier Results
8
Classifier Results
9
Classifier Results
10
Clustering
  • K-means clustering algorithm used with entire
    data set
  • Used sum of absolute differences instead of
    Euclidean distance because our differences were
    so small
  • Initialized centroids to a reasonable guess

11
Clustering Results
o Democrat Cluster 1 Democrat Cluster 2 o
Republican Cluster 1 Republican Cluster 2 o
Unknown Cluster 1 Unknown Cluster 2
12
Clustering Results
o Male Cluster 1 Male Cluster 2 o Female
Cluster 1 Female Cluster 2 o Unknown Cluster
1 Unknown Cluster 2
13
Conclusion
  • It is possible to identify the characteristics of
    a writer based on writing style, words and
    phrases!
  • Political Party gave the best results, followed
    by Gender, then Age

14
Future Work
  • Generalize results with a larger data set and
    greater number of features
  • Generalize results in a different domain
  • Possibly implement linear regressions, logistic
    regressions, SVM
Write a Comment
User Comments (0)
About PowerShow.com