Author Attribution Alex Genkin, Paul Kantor, David Lewis, David Madigan, and Fred Roberts - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Author Attribution Alex Genkin, Paul Kantor, David Lewis, David Madigan, and Fred Roberts

Description:

Build a classifier for each putative author and cluster the resulting classifiers ... Examine clustring algorithms for 'putative authors' ... – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 27
Provided by: Madi1
Category:

less

Transcript and Presenter's Notes

Title: Author Attribution Alex Genkin, Paul Kantor, David Lewis, David Madigan, and Fred Roberts


1
Author AttributionAlex Genkin, Paul Kantor,
David Lewis, David Madigan, and Fred Roberts
Presentation for ISI Panel, 11-June-2004
2
Statistical Analysis of Text
  • Statistical text analysis has a long history in
    literary analysis and in solving disputed
    authorship problems
  • First (?) is Thomas C. Mendenhall in 1887

3
Mendenhall
  • Mendenhall was Professor of Physics at Ohio State
    and at University of Tokyo, Superintendent of the
    USA Coast and Geodetic Survey, and later,
    President of Worcester Polytechnic Institute

Mendenhall Glacier, Juneau, Alaska
4
X2 127.2, df12
5
  • Hamilton versus Madison
  • Used Naïve Bayes with Poisson and Negative
    Binomial model
  • Out-of-sample predictive performance

6
DIMACS Project Focus
Identification of authors in large collections of
documents
  • traditional disputed authorship (choose between k
    known authors)
  • clustering of putative authors (e.g., internet
    handles termin8r, heyr, KaMaKaZie)
  • document pair analysis were two documents
    written by the same author?
  • odd-man-out were these documents written by one
    of this set of authors or by someone else?

7
Representation
  • Long tradition in stylometry that seeks a small
    number of textual characteristics that
    distinguish the texts of authors from one another
    (Burrows, Holmes, Binongo, Hoover, Mosteller
    Wallace, McMenamin, Tweedie, etc.)
  • Typically use function words (a, with, as,
    were, all, would, etc.) followed by PCA cluster
    analysis
  • Function words aim to be topic-independent
  • Hoover (2003) shows that using all high-frequency
    words does a better job than function words alone

8
What is a Function Word?
  • ago
  • ah
  • ain't
  • all
  • almost
  • along
  • already
  • also
  • although
  • always
  • am
  • among
  • an
  • and
  • another
  • any
  • anybody
  • anyone
  • anything
  • anywhere
  • are
  • aren't
  • around
  • art
  • as
  • aside
  • at
  • away
  • ay
  • a
  • about
  • above
  • according
  • accordingly
  • actual
  • actually
  • after
  • afterward
  • afterwards
  • again
  • against

9
Initial RCV-1 Experiments
  • Reuters RCV-1, 109,433 articles, 2,400 authors
  • Top 60 authors
  • 1 versus 59
  • all words average F1 92.5
  • 480 function words average F1 43.1
  • MW 70 filler words average F1 31.1
  • Author pair experiments do better with function
    words

10
Idiosyncratic Usage
  • Idiosyncratic usage less formalized in the
    literature (misspellings, repeated neologisms,
    etc.) but apparently useful. For example,
    Fosters unmasking of Klein as the author of
    Primary Colors
  • Klein and Anonymous loved unusual adjectives
    ending in -y and inous cartoony, chunky,
    crackly, dorky, snarly,, slimetudinous,
    vertiginous,
  • Both Klein and Anonymous added letters to their
    interjections ahh, aww, naww.
  • Both Klein and Anonymous loved to coin words
    beginning in hyper-, mega-, post-, quasi-, and
    semi-, more than all others put together
  • Klein and Anonymous use riffle to mean rifle
    or rustle, a usage for which the OED provides no
    instance in the past thousand years
  • Google?

11
Koppel and Schler (2003)
12
Document Pairs
  • Goal classify a pair of documents as same
    author or different author
  • Training data comprise pairs of documents with
    known authors
  • Representation
  • Comparisons of original features (e.g. do both
    documents use a.m. instead of am)
  • Agreement for single or multiple author
    classifiers

13
Clustering Putative Authors
  • Analogous to the paired-document classifier,
    build a paired-putative-author classifier (e.g.,
    centroids)
  • Constrained clustering of original documents
  • Craft a feature representation for a pair of
    putative authors and cluster these (e.g.,
    centroids)
  • Build a classifier for each putative author and
    cluster the resulting classifiers
  • For messages, we need to incorporate recipient
    information

14
Odd-Man Out
  • Training data contains documents by authors
    a1,an-1
  • Test data contains documents by some subset of
    authors a1,an
  • Bayesian hierarchical model incorporates prior
    knowledge that model parameters for different
    authors differ from each other
  • Initial success on small-scale simulated examples
  • Generalizations for more than one new author

15
Conclusion
  • Large-scale newsgroup dataset
  • Early days for the Science of Author Attribution
  • Applications in security
  • Unabomber could have been identified from his
    published writing (Foster, 2000)
  • Were these intercepted messages all written by
    the same person?
  • Did a known terrorist write this document?

16
Machine Learning
  • Newer work uses larger numbers of features along
    with machine learning/supervised learning
    statistical methods (de Vel, Corney, Koppel,
    Argamon, Ramyaa, etc.)

17
  • Corney in particular has had success with SVMs
    and several thousand features for e-mail author
    attribution
  • We are currently expanding his feature set and
    also looking at co-locations/interactions, word
    n-grams, and character strings, body parts, food,
    etc.
  • Developing algorithms for idiosyncratic feature
    detection (e.g., using spelling and grammar
    checkers)
  • We are beginning author classification
    experiments with Enron Reuters data

18
Disputed Authorship
  • 1-of-K classification problem
  • Extend MMSs Bayesian Binary Regression software
    to polychotomous case

19
Generative Models
  • Work with the Smyth et al. KDD project on
    generative models for words documents

word dist.
topic-author dist.
topics
topic
word
author
authors
words
documents
20
Generative Models for Genre
  • Genre personal e-mail, technical writing,
    announcements, etc.
  • Trans-genre prediction
  • Data spidered from technical newsgroups and from
    there to technical publications

author
genre
feature
features
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Datasets
  • The Enron data
  • 200,871 e-mails
  • 71 authors
  • Reuters RCV-1
  • 109,433 articles
  • 2,400 authors
  • Newsgroup postings?

26
Summary
  • Identify effective features for author
    identification. The current state calls for
    systematic evaluation and further development.
  • Evaluate these features in the context of a
    traditional author identifcation problem
  • Develop tools for segmenting texts before feature
    extraction (e.g., salutation, close, address
    block, quoted text)
  • Examine clustring algorithms for putative
    authors
  • Investigate the feasabaility of probabilistic
    generative models for genre-specific author
    identification
  • Investigate algorithms for predicting whether or
    not two documents were written by the same author
Write a Comment
User Comments (0)
About PowerShow.com