AuthorTopic Models for Large Text Corpora - PowerPoint PPT Presentation

About This Presentation

Title:

AuthorTopic Models for Large Text Corpora

Description:

Author-topic results from CiteSeer, NIPS, Enron data. Applications of the model ... certainly restate in the forthcoming Cheney commission report) its opposition to ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 97

Provided by: Informatio367

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: AuthorTopic Models for Large Text Corpora

1
Author-Topic Models for Large Text Corpora

Padhraic SmythDepartment of Computer Science
University of California, Irvine
In collaboration with Mark Steyvers (UCI)
Michal Rosen-Zvi (UCI)
Tom Griffiths (Stanford)

2
Outline

Problem motivation
Modeling large sets of documents
Probabilistic approaches
topic models -gt author-topic models
Results
Author-topic results from CiteSeer, NIPS, Enron
data
Applications of the model
(Demo of author-topic query tool)
Future directions

3
Data Sets of Interest

Data set of documents
Large collection of documents 10k, 100k, etc
Know authors of the documents
Know years/dates of the documents
(will typically assume bag of words
representation)

4
Examples of Data Sets

CiteSeer
160k abstracts, 80k authors, 1986-2002
NIPS papers
2k papers, 1k authors, 1987-1999
Reuters
20k newspaper articles, 114 authors

5
Pennsylvania Gazette
1728-1800 80,000 articles 25 million
words www.accessible.com
6
Enron email data
500,000 emails 5000 authors 1999-2002
7
(No Transcript)
8
Problems of Interest

What topics do these documents span?
Which documents are about a particular topic?
How have topics changed over time?
What does author X write about?
Who is likely to write about topic Y?
Who wrote this specific document?
and so on..

9
A topic is represented as a (multinomial)
distribution over words
10
Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
11
Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
P(probabilistic topic) 0.25 P(learning
topic) 0.50 P(Bayesian topic)
0.25 P(other words topic) 0.00
P(information topic) 0.5 P(retrieval
topic) 0.5 P(other words topic) 0.0
12
Graphical Model
z
Cluster Variable
w
Word
n words
13
Graphical Model
z
Cluster Variable
w
Word
n words
D documents
14
Graphical Model
Cluster Weights
a
z
Cluster Variable
f
Cluster-Word distributions
w
Word
n words
D documents
15
Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
16
Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
17
Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
18
History of topic models

Latent class models in statistics (late 60s)
Hoffman (1999)
Original application to documents
Blei, Ng, and Jordan (2001, 2003)
Variational methods
Griffiths and Steyvers (2003, 2004)
Gibbs sampling approach (very efficient)

19
Word/Document countsfor 16 Artificial Documents
documents
Can we recover the original topics and topic
mixtures from this data?
20
Example of Gibbs Sampling

Assign word tokens randomly to topics
(?topic 1 ?topic 2 )

21
After 1 iteration

Apply sampling equation to each word token

22
After 4 iterations
23
After 32 iterations
?
?
?
24
Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
25
Author-Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
26
Author-Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
27
Approach

The author-topic model
a probabilistic model linking authors and topics
authors -gt topics -gt words
learned from data
completely unsupervised, no labels
generative model
Different questions or queries can be answered by
appropriate probability calculus
E.g., p(author words in document)
E.g., p(topic author)

28
Graphical Model
x
Author
z
Topic
29
Graphical Model
x
Author
z
Topic
w
Word
30
Graphical Model
x
Author
z
Topic
w
Word
n
31
Graphical Model
a
x
Author
z
Topic
w
Word
n
D
32
Graphical Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
33
Generative Process

Lets assume authors A1 and A2 collaborate and
produce a paper
A1 has multinomial topic distribution q1
A2 has multinomial topic distribution q2
For each word in the paper
Sample an author x (uniformly) from A1, A2
Sample a topic z from qX
Sample a word w from a multinomial topic
distribution ?z

34
Graphical Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
35
Learning

Observed
W observed words, A sets of known authors
Unknown
x, z hidden variables
T, ? unknown parameters
Interested in
p( x, z W, A)
p( ? , ? W, A)
But exact inference is not tractable

36
Step 1 Gibbs sampling of x and z
a
x
Author
q
Marginalize over unknown parameters
z
Topic
f
w
Word
n
D
37
Step 2 MAP estimates of ? and ?
a
x
Author
Condition on particular samples of x and z
q
z
Topic
f
w
Word
n
D
38
Step 2 MAP estimates of ? and ?
a
x
Author
q
Point estimates of unknown parameters
z
Topic
f
w
Word
n
D
39
More Details on Learning

Gibbs sampling for x and z
Typically run 2000 Gibbs iterations
1 iteration full pass through all documents
Estimating ? and ?
x and z sample -gt point estimates
non-informative Dirichlet priors for ? and ?
Computational Efficiency
Learning is linear in the number of word tokens ?
Predictions on new documents
can average over ? and ? (from different
samples, different runs)

40
Gibbs Sampling

Need full conditional distributions for variables
The probability of assigning the current word i
to topic j and author k given everything else

number of times word w assigned to topic j
number of times topic j assigned to author k
41
Experiments on Real Data

Corpora
CiteSeer 160K abstracts, 85K authors
NIPS 1.7K papers, 2K authors
Enron 115K emails, 5K authors (sender)
Pubmed 27K abstracts, 50K authors
Removed stop words no stemming
Ignore word order, just use word counts
Processing time
Nips 2000 Gibbs iterations ? 8 hours
CiteSeer 2000 Gibbs iterations ? 4 days

42
Four example topics from CiteSeer (T300)
43
More CiteSeer Topics
44
Some topics relate to generic word usage
45
What can the Model be used for?

We can analyze our document set through the
topic lens
Applications
Queries
Who writes on this topic?
e.g., finding experts or reviewers in a
particular area
What topics does this person do research on?
Discovering trends over time
Detecting unusual papers and authors
Interactive browsing of a digital library via
topics
Parsing documents (and parts of documents) by
topic
and more..

46
Some likely topics per author (CiteSeer)

Author Andrew McCallum, U Mass
Topic 1 classification, training,
generalization, decision, data,
Topic 2 learning, machine, examples,
reinforcement, inductive,..
Topic 3 retrieval, text, document, information,
content,
Author Hector Garcia-Molina, Stanford
- Topic 1 query, index, data, join, processing,
aggregate.
- Topic 2 transaction, concurrency, copy,
permission, distributed.
- Topic 3 source, separation, paper,
heterogeneous, merging..
Author Paul Cohen, USC/ISI
- Topic 1 agent, multi, coordination,
autonomous, intelligent.
- Topic 2 planning, action, goal, world,
execution, situation
- Topic 3 human, interaction, people,
cognitive, social, natural.

47
Temporal patterns in topics hot and cold topics

We have CiteSeer papers from 1986-2002
For each year, calculate the fraction of words
assigned to each topic
-gt a time-series for topics
Hot topics become more prevalent
Cold topics become less prevalent

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
Four example topics from NIPS (T100)
56
NIPS support vector topic
57
NIPS neural network topic
58
Pennsylvania Gazette Data
(courtesy of David Newman, UC Irvine)
59
Enron email data
500,000 emails 5000 authors 1999-2002
60
Enron email topics
61
Non-work Topics
62
Topical Topics
63
Enron email California Energy Crisis
Message-ID lt21993848.1075843452041.JavaMail.evans
_at_thymegt Date Fri, 27 Apr 2001 092500 -0700
(PDT) Subject California Update 4/27/01
. FERC price cap decision reflects Bush
political and economic objectives. Politically,
Bush is determined to let the crisis blame fall
on Davis from an economic perspective, he is
unwilling to create disincentives for new power
generation The FERC decision is a holding move
by the Bush administration that looks like
action, but is not. Rather, it allows the
situation in California to continue to develop
virtually unabated. The political strategy
appears to allow the situation to deteriorate to
the point where Davis cannot escape shouldering
the blame. Once they are politically inoculated,
the Administration can begin to look at regional
solutions. Moreover, the Administration has
already made explicit (and will certainly restate
in the forthcoming Cheney commission report) its
opposition to stronger price caps ..
64
Enron email US Senate Bill

Message-ID lt23926374.1075846156491.JavaMail.evans
_at_thymegt
Date Thu, 15 Jun 2000 085900 -0700 (PDT)
From
To
Subject Senate Commerce Committee Pipeline
Safety Markup
The Senate Commerce Committee held a markup today
where Senator John McCain's
(R-AZ) pipeline safety legislation, S. 2438, was
approved. The overall
outcome was not unexpected -- the final
legislation contained several
provisions that went a little bit further than
Enron and INGAA would have
liked,
2) McCain amendment to Section 13 (b) (on
operator assistance investigations)
-- Approved by voice vote. .
3) Sen. John Kerry (D-MA) Amendment on
Enforcement -- Approved by voice
vote. Another confusing vote, in which many
members did not understand the
changes being made, but agreed to it on the
condition that clarifications be
made before Senate floor action. Late last
night, Enron led a group

65
Enron email political donations

10/16/2000 0441 PM
Subject Ashcroft Senate Campaign Request
We have received a request from the Ashcroft
Senate campaign for 10,000 in
soft money. This is the race where Governor
Carnahan is the challenger. Enron
PAC has contributed 10,000 and Enron has also
contributed 15,000 soft money
in this campaign to Senator Ashcroft. Ken Lay has
been personally interested
in the Ashcroft campaign. Our polling information
is that Ashcroft is
currently leading 43 to 38 with an undecided of
19 percent.
Message-ID lt2546687.1075846182883.JavaMail.evans_at_
thymegt
Date Mon, 16 Oct 2000 141300 -0700 (PDT)
From
To
Subject Re Ashcroft Senate Campaign Request

66
(No Transcript)
67
PubMed-Query Topics
68
PubMed-Query Topics
69
PubMed-Query Author Model

P. M. Lindeque, South Africa
TOPICS
Topic 1 water, natural, foci, environmental,
source prob0.33
Topic 2 anthracis, anthrax, bacillus, spores,
cereus prob0.13
Topic 3 species, sp, isolated, populations,
tested prob0.06
Topic 4 epidemic, occurred, outbreak,
persons prob0.06
Topic 5 positive, samples, negative,
tested prob0.05
PAPERS
Vaccine-induced protections against anthrax in
cheetah
Airborne movement of anthrax spores from carcass
sites in the Etosha National Park
Ecology and epidemiology of anthrax in the Etosha
National Park
Serology and anthrax in humans, livestock, and
wildlife

70
PubMed-Query Topics by Country
71
PubMed-Query Topics by Country
72
3 of 300 example topics (TASA)
73
Word sense disambiguation(numbers colors ?
topic assignments)
74
Finding unusual papers for an author
Perplexity exp entropy (words model)
measure of surprise for model on
data Can calculate perplexity of unseen
documents, conditioned on the model for a
particular author
75
Papers and Perplexities M_Jordan
76
Papers and Perplexities M_Jordan
77
Papers and Perplexities M_Jordan
78
Papers and Perplexities T_Mitchell
79
Papers and Perplexities T_Mitchell
80
Papers and Perplexities T_Mitchell
81
Author prediction with CiteSeer

Task predict (single) author of new CiteSeer
abstracts
Results
For 33 of documents, author guessed correctly
Median rank of true author 26 (out of 85,000)

82
Who wrote what?

Test of model
1) artificially combine abstracts from different
authors
2) check whether assignment is to correct
original author

A method1 is described which like the kernel1
trick1 in support1 vector1 machines1 SVMs1 lets
us generalize distance1 based2 algorithms to
operate in feature1 spaces usually nonlinearly
related to the input1 space This is done by
identifying a class of kernels1 which can be
represented as norm1 based2 distances1 in Hilbert
spaces It turns1 out that common kernel1
algorithms such as SVMs1 and kernel1 PCA1 are
actually really distance1 based2 algorithms and
can be run2 with that class of kernels1 too As
well as providing1 a useful new insight1 into how
these algorithms work the present2 work can form
the basis1 for conceiving new algorithms
This paper presents2 a comprehensive approach for
model2 based2 diagnosis2 which includes proposals
for characterizing and computing2 preferred2
diagnoses2 assuming that the system2 description2
is augmented with a system2 structure2 a
directed2 graph2 explicating the interconnections
between system2 components2 Specifically we first
introduce the notion of a consequence2 which is a
syntactically2 unconstrained propositional2
sentence2 that characterizes all consistency2
based2 diagnoses2 and show2 that standard2
characterizations of diagnoses2 such as minimal
conflicts1 correspond to syntactic2 variations1
on a consequence2 Second we propose a new
syntactic2 variation on the consequence2 known as
negation2 normal form NNF and discuss its merits
compared to standard variations Third we
introduce a basic algorithm2 for computing
consequences in NNF given a structured system2
description We show that if the system2
structure2 does not contain cycles2 then there is
always a linear size2 consequence2 in NNF which
can be computed in linear time2 For arbitrary1
system2 structures2 we show a precise connection
between the complexity2 of computing2
consequences and the topology of the underlying
system2 structure2 Finally we present2 an
algorithm2 that enumerates2 the preferred2
diagnoses2 characterized by a consequence2 The
algorithm2 is shown1 to take linear time2 in the
size2 of the consequence2 if the preference
criterion1 satisfies some general conditions

Written by (1) Scholkopf_B
Written by (2) Darwiche_A
83
The Author-Topic Browser
Querying on author Pazzani_M
Querying on topic relevant to author
Querying on document written by author
84
Stability of Topics

Content of topics is arbitrary across runs of
model(e.g., topic 1 is not the same across
runs)
However,
Majority of topics are stable over processing
time
Majority of topics can be aligned across runs
Topics appear to represent genuine structure in
data

85
Comparing NIPS topics from the same Markov chain
BEST KL 0.54
Re-ordered topics at t22000
WORST KL 4.78
KL distance
topics at t11000
86
Comparing NIPS topics from two different Markov
chains
BEST KL 1.03
Re-ordered topics from chain 2
KL distance
topics from chain 1
87
Gibbs Sampler Stability (NIPS data)
88
New Applications/ Future Work

Reviewer Recommendation
Find reviewers for this set of grant proposals
who are active in relevant topics and have no
conflicts of interest
Change Detection/Monitoring
Which authors are on the leading edge of new
topics?
Characterize the topic trajectory of this
author over time
Author Identification
Who wrote this document? Incorporation of
stylistic information (stylometry)
Additions to the model
Modeling citations
Modeling topic persistence in a document
..

89
Summary

Topic models are a versatile probabilistic model
for text data
Author-topic models are a very useful
generalization
Equivalent to topics model with 1 different
author per document
Learning has linear time complexity
Gibbs sampling is practical on very large data
sets
Experimental results
On multiple large complex data sets, the
resulting topic-word and author-topic models are
quite interpretable
Results appear stable relative to sampling
Numerous possible applications.
Current model is quite simple.many extensions
possible

90
Further Information

www.datalab.uci.edu
Steyvers et al, ACM SIGKDD 2004
Rosen-Zvi et al, UAI 2004
www.datalab.uci.edu/author-topic
JAVA demo of online browser
additional tables and results

91
BACKUP SLIDES
92
Author-Topics Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
93
Topics Model Topics, no Authors

q
Document-Topic Distributions
x
Author
z
Topic
f
Topic-Word distributions
w
Word
n
D
94
Author Model Authors, no Topics
a
a
Author
Author-Word Distributions
f
w
Word
n
D
95
Comparison Results

Train models on part of a new document and
predict remaining words
Without having seen any words from new document,
author-topic information helps in predicting
words from that document
Topics model is more flexible in adapting to new
document after observing a number of words

96
Latent Semantic Analysis(Landauer Dumais, 1997)
word/document counts
high dimensional space
STREAM
SVD
RIVER
BANK
MONEY
Words with similar co-occurence patterns across
documents end up with similar vector
representations
97

Topics
Probabilistic
Fully generative
Topic dimensions are often interpretable
Modular language of bayes nets/ graphical models

LSA
Geometric
Partially generative
Dimensions are not interpretable
Little flexibility to expand model (e.g., syntax)

98
Modeling syntax and semantics(Steyvers,
Griffiths, Blei, and Tenenbaum)
long-range, document specific, dependencies
short-range dependencies constant across
all documents
semantics probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax 3rd order HMM

Write a Comment

User Comments (0)