Title: AuthorTopic Models for Large Text Corpora
1Author-Topic Models for Large Text Corpora
- Padhraic SmythDepartment of Computer Science
- University of California, Irvine
- In collaboration with Mark Steyvers (UCI)
- Michal Rosen-Zvi (UCI)
- Tom Griffiths (Stanford)
2Outline
- Problem motivation
- Modeling large sets of documents
- Probabilistic approaches
- topic models -gt author-topic models
- Results
- Author-topic results from CiteSeer, NIPS, Enron
data - Applications of the model
- (Demo of author-topic query tool)
- Future directions
3Data Sets of Interest
- Data set of documents
- Large collection of documents 10k, 100k, etc
- Know authors of the documents
- Know years/dates of the documents
-
- (will typically assume bag of words
representation)
4Examples of Data Sets
- CiteSeer
- 160k abstracts, 80k authors, 1986-2002
- NIPS papers
- 2k papers, 1k authors, 1987-1999
- Reuters
- 20k newspaper articles, 114 authors
5Pennsylvania Gazette
1728-1800 80,000 articles 25 million
words www.accessible.com
6Enron email data
500,000 emails 5000 authors 1999-2002
7(No Transcript)
8Problems of Interest
- What topics do these documents span?
- Which documents are about a particular topic?
- How have topics changed over time?
- What does author X write about?
- Who is likely to write about topic Y?
- Who wrote this specific document?
- and so on..
9A topic is represented as a (multinomial)
distribution over words
10Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
11Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
P(probabilistic topic) 0.25 P(learning
topic) 0.50 P(Bayesian topic)
0.25 P(other words topic) 0.00
P(information topic) 0.5 P(retrieval
topic) 0.5 P(other words topic) 0.0
12Graphical Model
z
Cluster Variable
w
Word
n words
13Graphical Model
z
Cluster Variable
w
Word
n words
D documents
14Graphical Model
Cluster Weights
a
z
Cluster Variable
f
Cluster-Word distributions
w
Word
n words
D documents
15Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
16Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
17Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
18History of topic models
- Latent class models in statistics (late 60s)
- Hoffman (1999)
- Original application to documents
- Blei, Ng, and Jordan (2001, 2003)
- Variational methods
- Griffiths and Steyvers (2003, 2004)
- Gibbs sampling approach (very efficient)
19Word/Document countsfor 16 Artificial Documents
documents
Can we recover the original topics and topic
mixtures from this data?
20Example of Gibbs Sampling
- Assign word tokens randomly to topics
- (?topic 1 ?topic 2 )
21After 1 iteration
- Apply sampling equation to each word token
22After 4 iterations
23After 32 iterations
?
?
?
24Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
25Author-Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
26Author-Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
27Approach
- The author-topic model
- a probabilistic model linking authors and topics
- authors -gt topics -gt words
- learned from data
- completely unsupervised, no labels
- generative model
- Different questions or queries can be answered by
appropriate probability calculus - E.g., p(author words in document)
- E.g., p(topic author)
28Graphical Model
x
Author
z
Topic
29Graphical Model
x
Author
z
Topic
w
Word
30Graphical Model
x
Author
z
Topic
w
Word
n
31Graphical Model
a
x
Author
z
Topic
w
Word
n
D
32Graphical Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
33Generative Process
- Lets assume authors A1 and A2 collaborate and
produce a paper - A1 has multinomial topic distribution q1
- A2 has multinomial topic distribution q2
- For each word in the paper
- Sample an author x (uniformly) from A1, A2
- Sample a topic z from qX
- Sample a word w from a multinomial topic
distribution ?z
34Graphical Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
35Learning
- Observed
- W observed words, A sets of known authors
- Unknown
- x, z hidden variables
- T, ? unknown parameters
- Interested in
- p( x, z W, A)
- p( ? , ? W, A)
- But exact inference is not tractable
36Step 1 Gibbs sampling of x and z
a
x
Author
q
Marginalize over unknown parameters
z
Topic
f
w
Word
n
D
37Step 2 MAP estimates of ? and ?
a
x
Author
Condition on particular samples of x and z
q
z
Topic
f
w
Word
n
D
38Step 2 MAP estimates of ? and ?
a
x
Author
q
Point estimates of unknown parameters
z
Topic
f
w
Word
n
D
39More Details on Learning
- Gibbs sampling for x and z
- Typically run 2000 Gibbs iterations
- 1 iteration full pass through all documents
- Estimating ? and ?
- x and z sample -gt point estimates
- non-informative Dirichlet priors for ? and ?
- Computational Efficiency
- Learning is linear in the number of word tokens ?
- Predictions on new documents
- can average over ? and ? (from different
samples, different runs)
40Gibbs Sampling
- Need full conditional distributions for variables
- The probability of assigning the current word i
to topic j and author k given everything else
number of times word w assigned to topic j
number of times topic j assigned to author k
41Experiments on Real Data
- Corpora
- CiteSeer 160K abstracts, 85K authors
- NIPS 1.7K papers, 2K authors
- Enron 115K emails, 5K authors (sender)
- Pubmed 27K abstracts, 50K authors
- Removed stop words no stemming
- Ignore word order, just use word counts
- Processing time
- Nips 2000 Gibbs iterations ? 8 hours
- CiteSeer 2000 Gibbs iterations ? 4 days
42Four example topics from CiteSeer (T300)
43More CiteSeer Topics
44Some topics relate to generic word usage
45What can the Model be used for?
- We can analyze our document set through the
topic lens - Applications
- Queries
- Who writes on this topic?
- e.g., finding experts or reviewers in a
particular area - What topics does this person do research on?
- Discovering trends over time
- Detecting unusual papers and authors
- Interactive browsing of a digital library via
topics - Parsing documents (and parts of documents) by
topic - and more..
46Some likely topics per author (CiteSeer)
- Author Andrew McCallum, U Mass
- Topic 1 classification, training,
generalization, decision, data, - Topic 2 learning, machine, examples,
reinforcement, inductive,.. - Topic 3 retrieval, text, document, information,
content, - Author Hector Garcia-Molina, Stanford
- - Topic 1 query, index, data, join, processing,
aggregate. - - Topic 2 transaction, concurrency, copy,
permission, distributed. - - Topic 3 source, separation, paper,
heterogeneous, merging.. - Author Paul Cohen, USC/ISI
- - Topic 1 agent, multi, coordination,
autonomous, intelligent. - - Topic 2 planning, action, goal, world,
execution, situation - - Topic 3 human, interaction, people,
cognitive, social, natural.
47Temporal patterns in topics hot and cold topics
- We have CiteSeer papers from 1986-2002
- For each year, calculate the fraction of words
assigned to each topic - -gt a time-series for topics
- Hot topics become more prevalent
- Cold topics become less prevalent
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55Four example topics from NIPS (T100)
56NIPS support vector topic
57NIPS neural network topic
58Pennsylvania Gazette Data
(courtesy of David Newman, UC Irvine)
59Enron email data
500,000 emails 5000 authors 1999-2002
60Enron email topics
61Non-work Topics
62Topical Topics
63Enron email California Energy Crisis
Message-ID lt21993848.1075843452041.JavaMail.evans
_at_thymegt Date Fri, 27 Apr 2001 092500 -0700
(PDT) Subject California Update 4/27/01
. FERC price cap decision reflects Bush
political and economic objectives. Politically,
Bush is determined to let the crisis blame fall
on Davis from an economic perspective, he is
unwilling to create disincentives for new power
generation The FERC decision is a holding move
by the Bush administration that looks like
action, but is not. Rather, it allows the
situation in California to continue to develop
virtually unabated. The political strategy
appears to allow the situation to deteriorate to
the point where Davis cannot escape shouldering
the blame. Once they are politically inoculated,
the Administration can begin to look at regional
solutions. Moreover, the Administration has
already made explicit (and will certainly restate
in the forthcoming Cheney commission report) its
opposition to stronger price caps ..
64Enron email US Senate Bill
- Message-ID lt23926374.1075846156491.JavaMail.evans
_at_thymegt - Date Thu, 15 Jun 2000 085900 -0700 (PDT)
- From
- To
- Subject Senate Commerce Committee Pipeline
Safety Markup -
- The Senate Commerce Committee held a markup today
where Senator John McCain's - (R-AZ) pipeline safety legislation, S. 2438, was
approved. The overall - outcome was not unexpected -- the final
legislation contained several - provisions that went a little bit further than
Enron and INGAA would have - liked,
- 2) McCain amendment to Section 13 (b) (on
operator assistance investigations) - -- Approved by voice vote. .
- 3) Sen. John Kerry (D-MA) Amendment on
Enforcement -- Approved by voice - vote. Another confusing vote, in which many
members did not understand the - changes being made, but agreed to it on the
condition that clarifications be - made before Senate floor action. Late last
night, Enron led a group
65Enron email political donations
- 10/16/2000 0441 PM
-
- Subject Ashcroft Senate Campaign Request
- We have received a request from the Ashcroft
Senate campaign for 10,000 in - soft money. This is the race where Governor
Carnahan is the challenger. Enron - PAC has contributed 10,000 and Enron has also
contributed 15,000 soft money - in this campaign to Senator Ashcroft. Ken Lay has
been personally interested - in the Ashcroft campaign. Our polling information
is that Ashcroft is - currently leading 43 to 38 with an undecided of
19 percent. -
- Message-ID lt2546687.1075846182883.JavaMail.evans_at_
thymegt - Date Mon, 16 Oct 2000 141300 -0700 (PDT)
- From
- To
- Subject Re Ashcroft Senate Campaign Request
66(No Transcript)
67PubMed-Query Topics
68PubMed-Query Topics
69PubMed-Query Author Model
- P. M. Lindeque, South Africa
- TOPICS
- Topic 1 water, natural, foci, environmental,
source prob0.33 - Topic 2 anthracis, anthrax, bacillus, spores,
cereus prob0.13 - Topic 3 species, sp, isolated, populations,
tested prob0.06 - Topic 4 epidemic, occurred, outbreak,
persons prob0.06 - Topic 5 positive, samples, negative,
tested prob0.05 - PAPERS
- Vaccine-induced protections against anthrax in
cheetah - Airborne movement of anthrax spores from carcass
sites in the Etosha National Park - Ecology and epidemiology of anthrax in the Etosha
National Park - Serology and anthrax in humans, livestock, and
wildlife
70PubMed-Query Topics by Country
71PubMed-Query Topics by Country
723 of 300 example topics (TASA)
73Word sense disambiguation(numbers colors ?
topic assignments)
74Finding unusual papers for an author
Perplexity exp entropy (words model)
measure of surprise for model on
data Can calculate perplexity of unseen
documents, conditioned on the model for a
particular author
75Papers and Perplexities M_Jordan
76Papers and Perplexities M_Jordan
77Papers and Perplexities M_Jordan
78Papers and Perplexities T_Mitchell
79Papers and Perplexities T_Mitchell
80Papers and Perplexities T_Mitchell
81Author prediction with CiteSeer
- Task predict (single) author of new CiteSeer
abstracts - Results
- For 33 of documents, author guessed correctly
- Median rank of true author 26 (out of 85,000)
82Who wrote what?
- Test of model
- 1) artificially combine abstracts from different
authors - 2) check whether assignment is to correct
original author
- A method1 is described which like the kernel1
trick1 in support1 vector1 machines1 SVMs1 lets
us generalize distance1 based2 algorithms to
operate in feature1 spaces usually nonlinearly
related to the input1 space This is done by
identifying a class of kernels1 which can be
represented as norm1 based2 distances1 in Hilbert
spaces It turns1 out that common kernel1
algorithms such as SVMs1 and kernel1 PCA1 are
actually really distance1 based2 algorithms and
can be run2 with that class of kernels1 too As
well as providing1 a useful new insight1 into how
these algorithms work the present2 work can form
the basis1 for conceiving new algorithms - This paper presents2 a comprehensive approach for
model2 based2 diagnosis2 which includes proposals
for characterizing and computing2 preferred2
diagnoses2 assuming that the system2 description2
is augmented with a system2 structure2 a
directed2 graph2 explicating the interconnections
between system2 components2 Specifically we first
introduce the notion of a consequence2 which is a
syntactically2 unconstrained propositional2
sentence2 that characterizes all consistency2
based2 diagnoses2 and show2 that standard2
characterizations of diagnoses2 such as minimal
conflicts1 correspond to syntactic2 variations1
on a consequence2 Second we propose a new
syntactic2 variation on the consequence2 known as
negation2 normal form NNF and discuss its merits
compared to standard variations Third we
introduce a basic algorithm2 for computing
consequences in NNF given a structured system2
description We show that if the system2
structure2 does not contain cycles2 then there is
always a linear size2 consequence2 in NNF which
can be computed in linear time2 For arbitrary1
system2 structures2 we show a precise connection
between the complexity2 of computing2
consequences and the topology of the underlying
system2 structure2 Finally we present2 an
algorithm2 that enumerates2 the preferred2
diagnoses2 characterized by a consequence2 The
algorithm2 is shown1 to take linear time2 in the
size2 of the consequence2 if the preference
criterion1 satisfies some general conditions
Written by (1) Scholkopf_B
Written by (2) Darwiche_A
83The Author-Topic Browser
Querying on author Pazzani_M
Querying on topic relevant to author
Querying on document written by author
84Stability of Topics
- Content of topics is arbitrary across runs of
model(e.g., topic 1 is not the same across
runs) - However,
- Majority of topics are stable over processing
time - Majority of topics can be aligned across runs
- Topics appear to represent genuine structure in
data
85Comparing NIPS topics from the same Markov chain
BEST KL 0.54
Re-ordered topics at t22000
WORST KL 4.78
KL distance
topics at t11000
86Comparing NIPS topics from two different Markov
chains
BEST KL 1.03
Re-ordered topics from chain 2
KL distance
topics from chain 1
87Gibbs Sampler Stability (NIPS data)
88New Applications/ Future Work
- Reviewer Recommendation
- Find reviewers for this set of grant proposals
who are active in relevant topics and have no
conflicts of interest - Change Detection/Monitoring
- Which authors are on the leading edge of new
topics? - Characterize the topic trajectory of this
author over time - Author Identification
- Who wrote this document? Incorporation of
stylistic information (stylometry) - Additions to the model
- Modeling citations
- Modeling topic persistence in a document
- ..
89Summary
- Topic models are a versatile probabilistic model
for text data - Author-topic models are a very useful
generalization - Equivalent to topics model with 1 different
author per document - Learning has linear time complexity
- Gibbs sampling is practical on very large data
sets - Experimental results
- On multiple large complex data sets, the
resulting topic-word and author-topic models are
quite interpretable - Results appear stable relative to sampling
- Numerous possible applications.
- Current model is quite simple.many extensions
possible
90Further Information
- www.datalab.uci.edu
- Steyvers et al, ACM SIGKDD 2004
- Rosen-Zvi et al, UAI 2004
- www.datalab.uci.edu/author-topic
- JAVA demo of online browser
- additional tables and results
91BACKUP SLIDES
92Author-Topics Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
93Topics Model Topics, no Authors
q
Document-Topic Distributions
x
Author
z
Topic
f
Topic-Word distributions
w
Word
n
D
94Author Model Authors, no Topics
a
a
Author
Author-Word Distributions
f
w
Word
n
D
95Comparison Results
- Train models on part of a new document and
predict remaining words - Without having seen any words from new document,
author-topic information helps in predicting
words from that document - Topics model is more flexible in adapting to new
document after observing a number of words
96Latent Semantic Analysis(Landauer Dumais, 1997)
word/document counts
high dimensional space
STREAM
SVD
RIVER
BANK
MONEY
Words with similar co-occurence patterns across
documents end up with similar vector
representations
97- Topics
- Probabilistic
- Fully generative
- Topic dimensions are often interpretable
- Modular language of bayes nets/ graphical models
- LSA
- Geometric
- Partially generative
- Dimensions are not interpretable
- Little flexibility to expand model (e.g., syntax)
98Modeling syntax and semantics(Steyvers,
Griffiths, Blei, and Tenenbaum)
long-range, document specific, dependencies
short-range dependencies constant across
all documents
semantics probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax 3rd order HMM