- PowerPoint PPT Presentation

About This Presentation

Title:

Description:

Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 21

Provided by: ramy7

Learn more at: http://cobweb.cs.uga.edu

Category:

more less

Transcript and Presenter's Notes

Title:

1
Using Machine Learning Techniques in Stylometry

Ramyaa, Congzhou He, Dr. Khaled Rasheed

2
Introduction

Stylometry
Major problems facing stylometry
Decision trees
Artificial Neural Networks

3
Stylometry

The measure of style
Fundamental assumption there is an unconscious
aspect to an authors style that cannot be
consciously manipulated but which possesses
quantifiable and distinctive features.
Major applications today clinical tools in
disease detection and forensic tools in court
trials, text categorization, author attribution.

4
Major problems facing stylometry

no consensus as to what characteristic features
to use
Which indicators to use word length, sentence
length, tests of position, the distribution of
once-occurring words (hapax legomena), the
frequencies of marker words, letter sequence,
syllable length or syntactical measures?

5
Major problems facing stylometry

No consensus as to what methodology or
techniques to apply in standard research
Which techniques to use -- statistical
methods and automated pattern recognition
methods?
Statistical methods e.g. Bayesian analysis,
cluster analysis such as the widely used
Principal Components Analysis (PCA).
Automated pattern recognition methods e.g.
Artificial Neural Networks (ANN), Genetic
Programming (GP).

6
Significant Featuresof our paper

Recognizing the works of five authors
Use of unconventional indicators such as
punctuation marks as well as standard indicators
such as function words
Only 21 indicators, which shows that not many
features are required for high-performance
classification as opposed to common belief

7
Data Extraction

78 samples from five popular Victorian authors
Jane Austen
Pride and Prejudice Chapters 1-5
Mansfield Park Chapters 1-5
Emma Chapters 1-5
Sense and Sensibility Chapters 1-5

Charles Dickens
David Copperfield Chapters 1-5
Great Expectations Chapters 1-5
Hard Times Chapters 1-6
Tale of Two Cities Chapters 1-6
-- William Thackeray
Vanity Fair Chapters 1-6
Mens Wives Chapters 1-6
Emily Bronte
Wuthering Heights Chapters 1-12
Charlotte Bronte
Jane Eyre Chapters 1-12

9
21 attributes as input

type-token ratio
mean word length
mean sentence length
standard deviation of sentence length
mean paragraph length
chapter length
number of commas per thousand tokens
number of semicolons per thousand tokens
number of quotation marks per thousand
tokens

number of exclamation marks /1000 tokens
number of hyphens per thousand tokens
number of ands per thousand tokens
number of buts per thousand tokens
number of howevers per thousand tokens
number of ifs per thousand tokens
number of thats per thousand tokens
number of mores per thousand tokens
number of musts per thousand tokens
number of mights per thousand tokens
number of thiss per thousand tokens
number of verys per thousand tokens

11
Decision Tree Learning

See5 package by Quinlan based on ID3 algorithm
features of decision tree results easy to
understand focus on individual attributes
Use fuzzy thresholds for continuous values
Either winnowing or boosting gives the best
result 82.4 accuracy, significantly above
random guess (20).

12
Result from winnowing

Evaluation on test data (17 cases)
Decision Tree
----------------
Size Errors
5 3(17.6) ltlt
(a) (b) (c) (d) (e) lt-classified
as
---- ---- ---- ---- ----
4 1 (a)
class jane
5 1 (b)
class charles
2 (c)
class william
1 1 (d)
class emily
2 (e)
class charlotte

13
Results from boosting

Evaluation on test data (17 cases)
boost 3(17.6) ltlt
(a) (b) (c) (d) (e) lt-classified
as
---- ---- ---- ---- ----
4 1 (a) class
jane
5 1 (b) class
charles
2 (c) class
william
1 1 (d) class
emily
2 (e) class
charlotte

14
Artificial Neural Network (ANN) Learning

practical and powerful method of pattern
recognition
can invent new features that are not explicit in
the input
all attributes taken into consideration
inductive rules not accessible to humans

Many architectures were tried.
Kohonen SOM, Probabilistic nets, Nets based on
statistical model were tried
Back propagation feed forward nets gave the best
results
The best network had 21 inputs and 10 outputs
The best architecture had 15 hidden nodes in the
first hidden layer and 11 in the second

16
Predictor analysis
17
Results from ANN
(a) (b) (c) (d) (e) ? classified
as ---- ---- ---- ---- ---- 2
(a) class
jane 2
(b) class charles 2
(c) class william
2 4 (d) class
emily 5 (e)
class charlotte
18
Misclassifications

No. 4 Pride and Prejudice Chapter 3 is
misclassified as written by Charlotte Bronte
Nos. 67 71 Tale of Two Cities Chapter 1 and
Chapter 5 are misclassified as written by William
Thackeray.
All the other authors are correctly classified.
(88.2 accuracy on the validation set)

19
Conclusion

Very good results were obtained in both the
experiments
Artificial Intelligence provides stylometry with
excellent classifiers that require fewer input
variables than traditional statistics
Future Research
GA/GP
a general classifier applicable to all authors
Different set of features

20
Thank you

Write a Comment

User Comments (0)