Author Attribution Alex Genkin, Paul Kantor, David Lewis, David Madigan, and Fred Roberts - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Author Attribution Alex Genkin, Paul Kantor, David Lewis, David Madigan, and Fred Roberts

Description:

Build a classifier for each putative author and cluster the resulting classifiers ... Examine clustring algorithms for 'putative authors' ... – PowerPoint PPT presentation

Number of Views:231

Avg rating:3.0/5.0

Slides: 27

Provided by: Madi1

Category:

more less

Transcript and Presenter's Notes

Title: Author Attribution Alex Genkin, Paul Kantor, David Lewis, David Madigan, and Fred Roberts

1
Author AttributionAlex Genkin, Paul Kantor,
David Lewis, David Madigan, and Fred Roberts
Presentation for ISI Panel, 11-June-2004
2
Statistical Analysis of Text

Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems
First (?) is Thomas C. Mendenhall in 1887

3
Mendenhall

Mendenhall was Professor of Physics at Ohio State
and at University of Tokyo, Superintendent of the
USA Coast and Geodetic Survey, and later,
President of Worcester Polytechnic Institute

Mendenhall Glacier, Juneau, Alaska
4
X2 127.2, df12
5

Hamilton versus Madison
Used Naïve Bayes with Poisson and Negative
Binomial model
Out-of-sample predictive performance

6
DIMACS Project Focus
Identification of authors in large collections of
documents

traditional disputed authorship (choose between k
known authors)
clustering of putative authors (e.g., internet
handles termin8r, heyr, KaMaKaZie)
document pair analysis were two documents
written by the same author?
odd-man-out were these documents written by one
of this set of authors or by someone else?

7
Representation

Long tradition in stylometry that seeks a small
number of textual characteristics that
distinguish the texts of authors from one another
(Burrows, Holmes, Binongo, Hoover, Mosteller
Wallace, McMenamin, Tweedie, etc.)
Typically use function words (a, with, as,
were, all, would, etc.) followed by PCA cluster
analysis
Function words aim to be topic-independent
Hoover (2003) shows that using all high-frequency
words does a better job than function words alone

8
What is a Function Word?

ago
ah
ain't
all
almost
along
already
also
although
always
am
among
an
and

another
any
anybody
anyone
anything
anywhere
are
aren't
around
art
as
aside
at
away
ay

a
about
above
according
accordingly
actual
actually
after
afterward
afterwards
again
against

9
Initial RCV-1 Experiments

Reuters RCV-1, 109,433 articles, 2,400 authors
Top 60 authors
1 versus 59
all words average F1 92.5
480 function words average F1 43.1
MW 70 filler words average F1 31.1
Author pair experiments do better with function
words

10
Idiosyncratic Usage

Idiosyncratic usage less formalized in the
literature (misspellings, repeated neologisms,
etc.) but apparently useful. For example,
Fosters unmasking of Klein as the author of
Primary Colors
Klein and Anonymous loved unusual adjectives
ending in -y and inous cartoony, chunky,
crackly, dorky, snarly,, slimetudinous,
vertiginous,
Both Klein and Anonymous added letters to their
interjections ahh, aww, naww.
Both Klein and Anonymous loved to coin words
beginning in hyper-, mega-, post-, quasi-, and
semi-, more than all others put together
Klein and Anonymous use riffle to mean rifle
or rustle, a usage for which the OED provides no
instance in the past thousand years
Google?

11
Koppel and Schler (2003)
12
Document Pairs

Goal classify a pair of documents as same
author or different author
Training data comprise pairs of documents with
known authors
Representation
Comparisons of original features (e.g. do both
documents use a.m. instead of am)
Agreement for single or multiple author
classifiers

13
Clustering Putative Authors

Analogous to the paired-document classifier,
build a paired-putative-author classifier (e.g.,
centroids)
Constrained clustering of original documents
Craft a feature representation for a pair of
putative authors and cluster these (e.g.,
centroids)
Build a classifier for each putative author and
cluster the resulting classifiers
For messages, we need to incorporate recipient
information

14
Odd-Man Out

Training data contains documents by authors
a1,an-1
Test data contains documents by some subset of
authors a1,an
Bayesian hierarchical model incorporates prior
knowledge that model parameters for different
authors differ from each other
Initial success on small-scale simulated examples
Generalizations for more than one new author

15
Conclusion

Large-scale newsgroup dataset
Early days for the Science of Author Attribution
Applications in security
Unabomber could have been identified from his
published writing (Foster, 2000)
Were these intercepted messages all written by
the same person?
Did a known terrorist write this document?

16
Machine Learning

Newer work uses larger numbers of features along
with machine learning/supervised learning
statistical methods (de Vel, Corney, Koppel,
Argamon, Ramyaa, etc.)

Corney in particular has had success with SVMs
and several thousand features for e-mail author
attribution
We are currently expanding his feature set and
also looking at co-locations/interactions, word
n-grams, and character strings, body parts, food,
etc.
Developing algorithms for idiosyncratic feature
detection (e.g., using spelling and grammar
checkers)
We are beginning author classification
experiments with Enron Reuters data

18
Disputed Authorship

1-of-K classification problem
Extend MMSs Bayesian Binary Regression software
to polychotomous case

19
Generative Models

Work with the Smyth et al. KDD project on
generative models for words documents

word dist.
topic-author dist.
topics
topic
word
author
authors
words
documents
20
Generative Models for Genre

Genre personal e-mail, technical writing,
announcements, etc.
Trans-genre prediction
Data spidered from technical newsgroups and from
there to technical publications

author
genre
feature
features
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Datasets

The Enron data
200,871 e-mails
71 authors

Reuters RCV-1
109,433 articles
2,400 authors

Newsgroup postings?

26
Summary

Identify effective features for author
identification. The current state calls for
systematic evaluation and further development.
Evaluate these features in the context of a
traditional author identifcation problem
Develop tools for segmenting texts before feature
extraction (e.g., salutation, close, address
block, quoted text)
Examine clustring algorithms for putative
authors
Investigate the feasabaility of probabilistic
generative models for genre-specific author
identification
Investigate algorithms for predicting whether or
not two documents were written by the same author