AuthorTopic Models for Large Text Corpora - PowerPoint PPT Presentation

About This Presentation
Title:

AuthorTopic Models for Large Text Corpora

Description:

Author-topic results from CiteSeer, NIPS, Enron data. Applications of the model ... certainly restate in the forthcoming Cheney commission report) its opposition to ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 97
Provided by: Informatio367
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: AuthorTopic Models for Large Text Corpora


1
Author-Topic Models for Large Text Corpora
  • Padhraic SmythDepartment of Computer Science
  • University of California, Irvine
  • In collaboration with Mark Steyvers (UCI)
  • Michal Rosen-Zvi (UCI)
  • Tom Griffiths (Stanford)

2
Outline
  • Problem motivation
  • Modeling large sets of documents
  • Probabilistic approaches
  • topic models -gt author-topic models
  • Results
  • Author-topic results from CiteSeer, NIPS, Enron
    data
  • Applications of the model
  • (Demo of author-topic query tool)
  • Future directions

3
Data Sets of Interest
  • Data set of documents
  • Large collection of documents 10k, 100k, etc
  • Know authors of the documents
  • Know years/dates of the documents
  • (will typically assume bag of words
    representation)

4
Examples of Data Sets
  • CiteSeer
  • 160k abstracts, 80k authors, 1986-2002
  • NIPS papers
  • 2k papers, 1k authors, 1987-1999
  • Reuters
  • 20k newspaper articles, 114 authors

5
Pennsylvania Gazette
1728-1800 80,000 articles 25 million
words www.accessible.com
6
Enron email data
500,000 emails 5000 authors 1999-2002
7
(No Transcript)
8
Problems of Interest
  • What topics do these documents span?
  • Which documents are about a particular topic?
  • How have topics changed over time?
  • What does author X write about?
  • Who is likely to write about topic Y?
  • Who wrote this specific document?
  • and so on..

9
A topic is represented as a (multinomial)
distribution over words
10
Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
11
Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
P(probabilistic topic) 0.25 P(learning
topic) 0.50 P(Bayesian topic)
0.25 P(other words topic) 0.00
P(information topic) 0.5 P(retrieval
topic) 0.5 P(other words topic) 0.0
12
Graphical Model
z
Cluster Variable
w
Word
n words
13
Graphical Model
z
Cluster Variable
w
Word
n words
D documents
14
Graphical Model
Cluster Weights
a
z
Cluster Variable
f
Cluster-Word distributions
w
Word
n words
D documents
15
Cluster Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
16
Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
17
Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
18
History of topic models
  • Latent class models in statistics (late 60s)
  • Hoffman (1999)
  • Original application to documents
  • Blei, Ng, and Jordan (2001, 2003)
  • Variational methods
  • Griffiths and Steyvers (2003, 2004)
  • Gibbs sampling approach (very efficient)

19
Word/Document countsfor 16 Artificial Documents
documents
Can we recover the original topics and topic
mixtures from this data?
20
Example of Gibbs Sampling
  • Assign word tokens randomly to topics
  • (?topic 1 ?topic 2 )

21
After 1 iteration
  • Apply sampling equation to each word token

22
After 4 iterations
23
After 32 iterations
?
?
?
24
Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
25
Author-Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
26
Author-Topic Models
DOCUMENT 1
DOCUMENT 2
Probabilistic
Information
Learning
Retrieval
Learning
Information
Bayesian
Retrieval
DOCUMENT 3
Probabilistic
Learning
Information
Retrieval
27
Approach
  • The author-topic model
  • a probabilistic model linking authors and topics
  • authors -gt topics -gt words
  • learned from data
  • completely unsupervised, no labels
  • generative model
  • Different questions or queries can be answered by
    appropriate probability calculus
  • E.g., p(author words in document)
  • E.g., p(topic author)

28
Graphical Model
x
Author
z
Topic
29
Graphical Model
x
Author
z
Topic
w
Word
30
Graphical Model
x
Author
z
Topic
w
Word
n
31
Graphical Model
a
x
Author
z
Topic
w
Word
n
D
32
Graphical Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
33
Generative Process
  • Lets assume authors A1 and A2 collaborate and
    produce a paper
  • A1 has multinomial topic distribution q1
  • A2 has multinomial topic distribution q2
  • For each word in the paper
  • Sample an author x (uniformly) from A1, A2
  • Sample a topic z from qX
  • Sample a word w from a multinomial topic
    distribution ?z

34
Graphical Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
35
Learning
  • Observed
  • W observed words, A sets of known authors
  • Unknown
  • x, z hidden variables
  • T, ? unknown parameters
  • Interested in
  • p( x, z W, A)
  • p( ? , ? W, A)
  • But exact inference is not tractable

36
Step 1 Gibbs sampling of x and z
a
x
Author
q
Marginalize over unknown parameters
z
Topic
f
w
Word
n
D
37
Step 2 MAP estimates of ? and ?
a
x
Author
Condition on particular samples of x and z
q
z
Topic
f
w
Word
n
D
38
Step 2 MAP estimates of ? and ?
a
x
Author
q
Point estimates of unknown parameters
z
Topic
f
w
Word
n
D
39
More Details on Learning
  • Gibbs sampling for x and z
  • Typically run 2000 Gibbs iterations
  • 1 iteration full pass through all documents
  • Estimating ? and ?
  • x and z sample -gt point estimates
  • non-informative Dirichlet priors for ? and ?
  • Computational Efficiency
  • Learning is linear in the number of word tokens ?
  • Predictions on new documents
  • can average over ? and ? (from different
    samples, different runs)

40
Gibbs Sampling
  • Need full conditional distributions for variables
  • The probability of assigning the current word i
    to topic j and author k given everything else

number of times word w assigned to topic j
number of times topic j assigned to author k
41
Experiments on Real Data
  • Corpora
  • CiteSeer 160K abstracts, 85K authors
  • NIPS 1.7K papers, 2K authors
  • Enron 115K emails, 5K authors (sender)
  • Pubmed 27K abstracts, 50K authors
  • Removed stop words no stemming
  • Ignore word order, just use word counts
  • Processing time
  • Nips 2000 Gibbs iterations ? 8 hours
  • CiteSeer 2000 Gibbs iterations ? 4 days

42
Four example topics from CiteSeer (T300)
43
More CiteSeer Topics
44
Some topics relate to generic word usage
45
What can the Model be used for?
  • We can analyze our document set through the
    topic lens
  • Applications
  • Queries
  • Who writes on this topic?
  • e.g., finding experts or reviewers in a
    particular area
  • What topics does this person do research on?
  • Discovering trends over time
  • Detecting unusual papers and authors
  • Interactive browsing of a digital library via
    topics
  • Parsing documents (and parts of documents) by
    topic
  • and more..

46
Some likely topics per author (CiteSeer)
  • Author Andrew McCallum, U Mass
  • Topic 1 classification, training,
    generalization, decision, data,
  • Topic 2 learning, machine, examples,
    reinforcement, inductive,..
  • Topic 3 retrieval, text, document, information,
    content,
  • Author Hector Garcia-Molina, Stanford
  • - Topic 1 query, index, data, join, processing,
    aggregate.
  • - Topic 2 transaction, concurrency, copy,
    permission, distributed.
  • - Topic 3 source, separation, paper,
    heterogeneous, merging..
  • Author Paul Cohen, USC/ISI
  • - Topic 1 agent, multi, coordination,
    autonomous, intelligent.
  • - Topic 2 planning, action, goal, world,
    execution, situation
  • - Topic 3 human, interaction, people,
    cognitive, social, natural.

47
Temporal patterns in topics hot and cold topics
  • We have CiteSeer papers from 1986-2002
  • For each year, calculate the fraction of words
    assigned to each topic
  • -gt a time-series for topics
  • Hot topics become more prevalent
  • Cold topics become less prevalent

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
Four example topics from NIPS (T100)
56
NIPS support vector topic
57
NIPS neural network topic
58
Pennsylvania Gazette Data
(courtesy of David Newman, UC Irvine)
59
Enron email data
500,000 emails 5000 authors 1999-2002
60
Enron email topics
61
Non-work Topics
62
Topical Topics
63
Enron email California Energy Crisis
Message-ID lt21993848.1075843452041.JavaMail.evans
_at_thymegt Date Fri, 27 Apr 2001 092500 -0700
(PDT) Subject California Update 4/27/01
. FERC price cap decision reflects Bush
political and economic objectives. Politically,
Bush is determined to let the crisis blame fall
on Davis from an economic perspective, he is
unwilling to create disincentives for new power
generation The FERC decision is a holding move
by the Bush administration that looks like
action, but is not. Rather, it allows the
situation in California to continue to develop
virtually unabated. The political strategy
appears to allow the situation to deteriorate to
the point where Davis cannot escape shouldering
the blame. Once they are politically inoculated,
the Administration can begin to look at regional
solutions. Moreover, the Administration has
already made explicit (and will certainly restate
in the forthcoming Cheney commission report) its
opposition to stronger price caps ..
64
Enron email US Senate Bill
  • Message-ID lt23926374.1075846156491.JavaMail.evans
    _at_thymegt
  • Date Thu, 15 Jun 2000 085900 -0700 (PDT)
  • From
  • To
  • Subject Senate Commerce Committee Pipeline
    Safety Markup
  • The Senate Commerce Committee held a markup today
    where Senator John McCain's
  • (R-AZ) pipeline safety legislation, S. 2438, was
    approved. The overall
  • outcome was not unexpected -- the final
    legislation contained several
  • provisions that went a little bit further than
    Enron and INGAA would have
  • liked,
  • 2) McCain amendment to Section 13 (b) (on
    operator assistance investigations)
  • -- Approved by voice vote. .
  • 3) Sen. John Kerry (D-MA) Amendment on
    Enforcement -- Approved by voice
  • vote. Another confusing vote, in which many
    members did not understand the
  • changes being made, but agreed to it on the
    condition that clarifications be
  • made before Senate floor action. Late last
    night, Enron led a group

65
Enron email political donations
  • 10/16/2000 0441 PM
  • Subject Ashcroft Senate Campaign Request
  • We have received a request from the Ashcroft
    Senate campaign for 10,000 in
  • soft money. This is the race where Governor
    Carnahan is the challenger. Enron
  • PAC has contributed 10,000 and Enron has also
    contributed 15,000 soft money
  • in this campaign to Senator Ashcroft. Ken Lay has
    been personally interested
  • in the Ashcroft campaign. Our polling information
    is that Ashcroft is
  • currently leading 43 to 38 with an undecided of
    19 percent.
  • Message-ID lt2546687.1075846182883.JavaMail.evans_at_
    thymegt
  • Date Mon, 16 Oct 2000 141300 -0700 (PDT)
  • From
  • To
  • Subject Re Ashcroft Senate Campaign Request

66
(No Transcript)
67
PubMed-Query Topics
68
PubMed-Query Topics
69
PubMed-Query Author Model
  • P. M. Lindeque, South Africa
  • TOPICS
  • Topic 1 water, natural, foci, environmental,
    source prob0.33
  • Topic 2 anthracis, anthrax, bacillus, spores,
    cereus prob0.13
  • Topic 3 species, sp, isolated, populations,
    tested prob0.06
  • Topic 4 epidemic, occurred, outbreak,
    persons prob0.06
  • Topic 5 positive, samples, negative,
    tested prob0.05
  • PAPERS
  • Vaccine-induced protections against anthrax in
    cheetah
  • Airborne movement of anthrax spores from carcass
    sites in the Etosha National Park
  • Ecology and epidemiology of anthrax in the Etosha
    National Park
  • Serology and anthrax in humans, livestock, and
    wildlife

70
PubMed-Query Topics by Country
71
PubMed-Query Topics by Country
72
3 of 300 example topics (TASA)
73
Word sense disambiguation(numbers colors ?
topic assignments)
74
Finding unusual papers for an author
Perplexity exp entropy (words model)
measure of surprise for model on
data Can calculate perplexity of unseen
documents, conditioned on the model for a
particular author
75
Papers and Perplexities M_Jordan
76
Papers and Perplexities M_Jordan
77
Papers and Perplexities M_Jordan
78
Papers and Perplexities T_Mitchell
79
Papers and Perplexities T_Mitchell
80
Papers and Perplexities T_Mitchell
81
Author prediction with CiteSeer
  • Task predict (single) author of new CiteSeer
    abstracts
  • Results
  • For 33 of documents, author guessed correctly
  • Median rank of true author 26 (out of 85,000)

82
Who wrote what?
  • Test of model
  • 1) artificially combine abstracts from different
    authors
  • 2) check whether assignment is to correct
    original author
  • A method1 is described which like the kernel1
    trick1 in support1 vector1 machines1 SVMs1 lets
    us generalize distance1 based2 algorithms to
    operate in feature1 spaces usually nonlinearly
    related to the input1 space This is done by
    identifying a class of kernels1 which can be
    represented as norm1 based2 distances1 in Hilbert
    spaces It turns1 out that common kernel1
    algorithms such as SVMs1 and kernel1 PCA1 are
    actually really distance1 based2 algorithms and
    can be run2 with that class of kernels1 too As
    well as providing1 a useful new insight1 into how
    these algorithms work the present2 work can form
    the basis1 for conceiving new algorithms
  • This paper presents2 a comprehensive approach for
    model2 based2 diagnosis2 which includes proposals
    for characterizing and computing2 preferred2
    diagnoses2 assuming that the system2 description2
    is augmented with a system2 structure2 a
    directed2 graph2 explicating the interconnections
    between system2 components2 Specifically we first
    introduce the notion of a consequence2 which is a
    syntactically2 unconstrained propositional2
    sentence2 that characterizes all consistency2
    based2 diagnoses2 and show2 that standard2
    characterizations of diagnoses2 such as minimal
    conflicts1 correspond to syntactic2 variations1
    on a consequence2 Second we propose a new
    syntactic2 variation on the consequence2 known as
    negation2 normal form NNF and discuss its merits
    compared to standard variations Third we
    introduce a basic algorithm2 for computing
    consequences in NNF given a structured system2
    description We show that if the system2
    structure2 does not contain cycles2 then there is
    always a linear size2 consequence2 in NNF which
    can be computed in linear time2 For arbitrary1
    system2 structures2 we show a precise connection
    between the complexity2 of computing2
    consequences and the topology of the underlying
    system2 structure2 Finally we present2 an
    algorithm2 that enumerates2 the preferred2
    diagnoses2 characterized by a consequence2 The
    algorithm2 is shown1 to take linear time2 in the
    size2 of the consequence2 if the preference
    criterion1 satisfies some general conditions

Written by (1) Scholkopf_B
Written by (2) Darwiche_A
83
The Author-Topic Browser
Querying on author Pazzani_M
Querying on topic relevant to author
Querying on document written by author
84
Stability of Topics
  • Content of topics is arbitrary across runs of
    model(e.g., topic 1 is not the same across
    runs)
  • However,
  • Majority of topics are stable over processing
    time
  • Majority of topics can be aligned across runs
  • Topics appear to represent genuine structure in
    data

85
Comparing NIPS topics from the same Markov chain
BEST KL 0.54
Re-ordered topics at t22000
WORST KL 4.78
KL distance
topics at t11000
86
Comparing NIPS topics from two different Markov
chains
BEST KL 1.03
Re-ordered topics from chain 2
KL distance
topics from chain 1
87
Gibbs Sampler Stability (NIPS data)
88
New Applications/ Future Work
  • Reviewer Recommendation
  • Find reviewers for this set of grant proposals
    who are active in relevant topics and have no
    conflicts of interest
  • Change Detection/Monitoring
  • Which authors are on the leading edge of new
    topics?
  • Characterize the topic trajectory of this
    author over time
  • Author Identification
  • Who wrote this document? Incorporation of
    stylistic information (stylometry)
  • Additions to the model
  • Modeling citations
  • Modeling topic persistence in a document
  • ..

89
Summary
  • Topic models are a versatile probabilistic model
    for text data
  • Author-topic models are a very useful
    generalization
  • Equivalent to topics model with 1 different
    author per document
  • Learning has linear time complexity
  • Gibbs sampling is practical on very large data
    sets
  • Experimental results
  • On multiple large complex data sets, the
    resulting topic-word and author-topic models are
    quite interpretable
  • Results appear stable relative to sampling
  • Numerous possible applications.
  • Current model is quite simple.many extensions
    possible

90
Further Information
  • www.datalab.uci.edu
  • Steyvers et al, ACM SIGKDD 2004
  • Rosen-Zvi et al, UAI 2004
  • www.datalab.uci.edu/author-topic
  • JAVA demo of online browser
  • additional tables and results

91
BACKUP SLIDES
92
Author-Topics Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
93
Topics Model Topics, no Authors

q
Document-Topic Distributions
x
Author
z
Topic
f
Topic-Word distributions
w
Word
n
D
94
Author Model Authors, no Topics
a
a
Author
Author-Word Distributions
f
w
Word
n
D
95
Comparison Results
  • Train models on part of a new document and
    predict remaining words
  • Without having seen any words from new document,
    author-topic information helps in predicting
    words from that document
  • Topics model is more flexible in adapting to new
    document after observing a number of words

96
Latent Semantic Analysis(Landauer Dumais, 1997)
word/document counts
high dimensional space
STREAM
SVD
RIVER
BANK
MONEY
Words with similar co-occurence patterns across
documents end up with similar vector
representations
97
  • Topics
  • Probabilistic
  • Fully generative
  • Topic dimensions are often interpretable
  • Modular language of bayes nets/ graphical models
  • LSA
  • Geometric
  • Partially generative
  • Dimensions are not interpretable
  • Little flexibility to expand model (e.g., syntax)

98
Modeling syntax and semantics(Steyvers,
Griffiths, Blei, and Tenenbaum)
long-range, document specific, dependencies
short-range dependencies constant across
all documents
semantics probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax 3rd order HMM
Write a Comment
User Comments (0)
About PowerShow.com