Stylometry%20Project - PowerPoint PPT Presentation

About This Presentation
Title:

Stylometry%20Project

Description:

... of Windows keys. Number of Up keys. Number of Left Shift keys. Number of ... Number of Enter keys. Number of Delete keys. Number of Tab keys. Number of words ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 17
Provided by: christi162
Learn more at: http://csis.pace.edu
Category:

less

Transcript and Presenter's Notes

Title: Stylometry%20Project


1
Stylometry Project
  • May 4, 2007

Paces Research Day
2
TEAM MEMBERS
  • Rob Goodman, Programmer
  • Currently working at KPMG
  • Completing MS in Computer Science in December
    2008
  • Matt Hahn, Quality Assurance
  • Currently working at Affiliated Computer
    Services, Inc.
  • Completing MS in in Information Technologies in
    May 2007
  • Madhuri Marella, Programmer
  • Completing MS in Computer Science in May 2007
  • Chris Ojar, Team Leader
  • Currently working at Paces Evening Support
    Office in Pleasantville
  • Completing MS in Internet Technologies in May 2007

3
WHAT IS STYLOMETRY?
  • Unique linguistic styles and writing behaviors of
    individuals in order to determine authorship
  • Used to attribute authorship to anonymous or
    disputed documents, and it has legal as well as
    academic and literary applications
  • Uses statistical analysis, pattern recognition,
    and artificial intelligence techniques. For
    features, stylometry typically analyzes the text
    by using word frequencies and identifying
    patterns in common parts of speech

4
THE PROGRAM
  • A pattern recognition system to identify the
    author of arbitrary email using stylometry
    features
  • Phase 1 Data Collection
  • Raw data from Keystroke Biometric Project
  • Plain text emails
  • Phase 2 Feature Extraction
  • Measurements of punctuation, content format, and
    keystrokes when applicable
  • Normalize features to 0-1 range
  • Phase 3 Classification
  • k-Nearest-Neighbor using Euclidean distance
  • Defaulted to 10

5
RAW DATA EXAMPLES
File Name Goodman-email.txt Dear Ms.
Sanderson I enjoyed our conversation on February
18th at the Family and Child Development seminar
on teaching young children and appreciated your
personal input about helping children attend
school for the first time.  This letter is to
follow-up about the Fourth Grade Teacher position
as discussed at the seminar.  I will be
completing my Bachelor of Science Degree in
Family and Child Development with a concentration
in Early Childhood Education at Pace in May of
2007, and will be available for employment at
that time
6
DIRTY DATA EXAMPLE
ltShiftgt I'm on my second take and ltShiftgt I'm
still writing about the same book ltShiftgt
ltShiftgt " ltShiftgt A ltShiftgt Million ltShiftgt
Little ltShiftgt Pieces. ltBackspacegt
ltBackspacegt ltShiftgt " ltShiftgt I'm not sure if
ltShiftgt I am supposed to be typing the same this
ltBackspacegt ng ltShiftgt I typed on submit
ltBackspacegt ssion ltShiftgt 1 as ltShiftgt I am on
sb ltBackspacegt ubmission ltShiftgt 2, but since
ltShiftgt my sister is skiing in ltShiftgt Vermont,
ltShiftgt I'll just continued ltBackspacegt .
ltShiftgt In any event, as a ltBackspacegt soon as
ltShiftgt I found out the book was not true,
ltShiftgt I couldn't pick it up for a few days.
ltShiftgt Then, it got the best of me. ltShiftgt
It is tu ltBackspacegt ltBackspacegt a fact that
ltShiftgt James ltShiftgt Frey is a great ri
ltBackspacegt ltBackspacegt writer. ltShiftgt He
holds your interest and attention a ltBackspacegt
so ltShiftgt I go ltBackspacegt t b ltBackspacegt
past the fact the ltBackspacegt ltBackspacegt at he
lied, and continued on. ltShiftgt I have to say
ltShiftgt I endj ltBackspacegt ltBackspacegt joyed the
book a lot better as a non-fiction book than
ltShiftgt I did as a fiction novel.
7
CLEAN DATA EXAMPLE
I'm on my second take and I'm still writing about
the same book "A Million Little Pieces." I'm
not sure if I am supposed to be typing the same
thing I typed on submission 1 as I am on
submission 2, but since my sister is skiing in
Vermont, I'll just continue. In any event, as
soon as I found out the book was not true, I
couldn't pick it up for a few days. Then, it got
the best of me. It is a fact that James Frey is
a great writer. He holds your interest and
attention so I got past the fact that he lied,
and continued on. I have to say I enjoyed the
book a lot better as a non-fiction book than I
did as a fiction novel.
8
THE PROGRAM
  • A pattern recognition system to identify the
    author of arbitrary email using stylometry
    features
  • Phase 1 Data Collection
  • Raw data from Keystroke Biometric Project
  • Plain text emails
  • Phase 2 Feature Extraction
  • Measurements of punctuation, content format, and
    keystrokes when applicable
  • Normalize features to 0-1 range
  • Phase 3 Classification
  • k-Nearest-Neighbor using Euclidean distance
  • Defaulted to 10

9
LIST OF 62 FEATURES MEASURED
Number of Accents Number of Left curly braces Number of Right curly braces Number of Vertical lines Number of Tildes Number of Windows keys Number of Up keys Number of Left Shift keys Number of Right Shift keys Number of Page Down keys Number of Insert keys Number of Home keys Number of End keys Number of Down keys Number of Ctrl keys Number of Context menu keys Number of Caps Lock keys Number of Alt keys Number of F12 keys Number of Right keys Number of Backspace keys Number of Enter keys Number of Delete keys Number of Tab keys Number of words Number of sentences Average words per sentence Number of paragraphs Average words per paragraph Average word length Number of sentences beginning with upper case
Number of sentences beginning with lower case Number of White spaces Number of exclamation points Number of Number signs Number of Dollar signs Number of percent signs Number of Ampersands Number of Single quotes Number of Left parentheses Number of Right parentheses Number of Asterisks Number of Plus signs Number of Commas Number of Dashes Number of Periods Number of Forward slashes Number of Colons Number of Semi-colons Number of Less than signs Number of Equal signs Number of Greater than signs Number of Question marks Number of multiple question marks Number of multiple exclamation marks Number of ellipsis Number of At signs Number of Left square brackets Number of Back slashes Number of Right square brackets Number of Caret signs Number of Underscores
10
THE PROGRAM
  • A pattern recognition system to identify the
    author of arbitrary email using stylometry
    features
  • Phase 1 Data Collection
  • Raw data from Keystroke Biometric Project
  • Plain text emails
  • Phase 2 Feature Extraction
  • Measurements of punctuation, content format, and
    keystrokes when applicable
  • Normalize features to 0-1 range
  • Phase 3 Classification
  • k-Nearest-Neighbor using Euclidean distance
  • Defaulted to 10

11
k-NEAREST NEIGHBOR USING EUCLIDEAN DISTANCE
12
CLASSIFICATION PHASE
13
DESIGN MODEL
14
ANALYSIS MODEL
15
PROJECT HOME PAGE
http//utopia.csis.pace.edu/cs615/2006-2007/team2/
16
QUESTIONS
Contact cojar_at_pace.eduor ctappert_at_pace.edufor
more informationor visithttp//utopia.csis.pace.
edu/cs615/2006-2007/team2
Write a Comment
User Comments (0)
About PowerShow.com