Topic Models for Social Network Analysis and Bibliometrics - PowerPoint PPT Presentation


Title: Topic Models for Social Network Analysis and Bibliometrics


1
Topic Models forSocial Network Analysis and
Bibliometrics
  • Andrew McCallum
  • Computer Science Department
  • University of Massachusetts Amherst

Joint work with ?Xuerui Wang, Natasha
Mohanty, Andres Corrada, Chris Pal, Wei Li, David
Mimno and Gideon Mann.
2
Goal
Mine actionable knowledgefrom unstructured text.
3
From Text to Actionable Knowledge
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
4
Joint Inference
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
5
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
6
(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input seq
said Jones a Microsoft VP
7
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and
noun-phrase in newswire, matching accuracy with
only 50 of the training data.
Inference Tree reparameterization BP
Wainwright et al, 2002
8
2. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004

Senator Joe Green said today .
Green ran for
14 reduction in error on most repeated field in
email seminar announcements.
Inference Tree reparameterization BP
Wainwright et al, 2002
9
3. Joint co-reference among all pairsAffinity
Matrix CRF
Entity resolutionObject correspondence
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
Y/N
-99
Y/N
25 reduction in error on co-reference of proper
nouns in newswire.
11
. . . she . . .
Inference Correlational clustering graph
partitioning
McCallum, Wellner, IJCAI WS 2003, NIPS 2004
Bansal, Blum, Chawla, 2002
10
4. Joint segmentation and co-reference
Extraction from and matching of research paper
citations.
o
s
World Knowledge
Laurel, B. Interface Agents Metaphors with
Character, in The Art of Human-Computer
Interface Design, B. Laurel (ed), Addison-Wesley,
1990.
c
Co-reference decisions
y
y
p
Brenda Laurel. Interface Agents Metaphors with
Character, in Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
Databasefield values
c
y
c
Citation attributes
s
s
Segmentation
o
o
35 reduction in co-reference error by using
segmentation uncertainty.
6-14 reduction in segmentation error by using
co-reference.
Inference Variant of Iterated Conditional Modes
Wellner, McCallum, Peng, Hay, UAI 2004
see also Marthi, Milch, Russell, 2003
Besag, 1986
11
Context
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Joint inference among detailed steps
Actionableknowledge
Leveraging Text in Social Network Analysis
Prediction Outlier detection Decision support
12
Outline
Social Network Analysis with Topic Models
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

Multi-Conditional Mixtures
13
Social Network in an Email Dataset
14
Clustering words into topics withLatent
Dirichlet Allocation
Blei, Ng, Jordan 2003
GenerativeProcess
Example
For each document
70 Iraq war 30 US election
Sample a distributionover topics, ?
For each word in doc
Iraq war
Sample a topic, z
Sample a wordfrom the topic, w
bombing
15
Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
16
Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
17
From LDA to Author-Recipient-Topic
McCallum et al 2005
(ART)
18
Inference and Estimation
  • Gibbs Sampling
  • Easy to implement
  • Reasonably fast

r
19
Enron Email Corpus
  • 250k email messages
  • 23k people

Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
20
Topics, and prominent senders /
receiversdiscovered by ART
Topic names, by hand
21
Topics, and prominent senders /
receiversdiscovered by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice President of
Regulatory Affairs Steffes Vice President of
Government Affairs
22
Comparing Role Discovery
Traditional SNA
Author-Topic
ART
connection strength (A,B)
distribution over recipients
distribution over authored topics
distribution over authored topics
23
Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
24
Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
25
McCallum Email Corpus 2004
  • January - October 2004
  • 23k email messages
  • 825 people

From kate_at_cs.umass.edu Subject NIPS and
.... Date June 14, 2004 22741 PM EDT To
mccallum_at_cs.umass.edu There is pertinent stuff
on the first yellow folder that is completed
either travel or other things, so please sign
that first folder anyway. Then, here is the
reminder of the things I'm still waiting
for NIPS registration receipt. CALO
registration receipt. Thanks, Kate
26
Four most prominent topicsin discussions with
____?
27
(No Transcript)
28
Two most prominent topicsin discussions with
____?
29
(No Transcript)
30
Role-Author-Recipient-Topic Models
31
Results with RARTPeople in Role 3 in
Academic Email
  • olc lead Linux sysadmin
  • gauthier sysadmin for CIIR group
  • irsystem mailing list CIIR sysadmins
  • system mailing list for dept. sysadmins
  • allan Prof., chair of computing committee
  • valerie second Linux sysadmin
  • tech mailing list for dept. hardware
  • steve head of dept. I.T. support

32
Roles for allan (James Allan)
  • Role 3 I.T. support
  • Role 2 Natural Language researcher

Roles for pereira (Fernando Pereira)
  • Role 2 Natural Language researcher
  • Role 4 SRI CALO project participant
  • Role 6 Grant proposal writer
  • Role 10 Grant proposal coordinator
  • Role 8 Guests at McCallums house

33
ART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
34
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

Multi-Conditional Mixtures
35
Groups and Topics
  • Input
  • Observed relations between people
  • Attributes on those relations (text, or
    categorical)
  • Output
  • Attributes clustered into topics
  • Groups of people---varying depending on topic

36
Discovering Groups from Observed Set of Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Admiration relations among six high school
students.
37
Adjacency Matrix Representing Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
A B C D E F
G1 G2 G1 G2 G3 G3
G1
G2
G1
G2
G3
G3
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
A
C
B
D
E
F
38
Group Model Partitioning Entities into Groups
Stochastic Blockstructures for Relations Nowicki,
Snijders 2001
Beta
Dirichlet
Multinomial
S number of entities G number of groups
Binomial
Enhanced with arbitrary number of groups in
Kemp, Griffiths, Tenenbaum 2004
39
Two Relations with Different Attributes
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Social Admiration Soci(A, B) Soci(A, D) Soci(A,
F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B)
Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C)
Soci(D, E) Soci(E, B) Soci(E, D) Soci(E,
F) Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A C E B D F
G1 G1 G1 G2 G2 G2
G1
G1
G1
G2
G2
G2
A
C
E
B
D
F
A
C
B
D
E
F
40
The Group-Topic Model Discovering Groups and
Topics Simultaneously
Wang, Mohanty, McCallum 2006
Beta
Uniform
Dirichlet
Multinomial
Dirichlet
Binomial
Multinomial
41
Inference and Estimation
  • Gibbs Sampling
  • Many r.v.s can be integrated out
  • Easy to implement
  • Reasonably fast

We assume the relationship is symmetric.
42
Dataset 1U.S. Senate
  • 16 years of voting records in the US Senate (1989
    2005)
  • a Senator may respond Yea or Nay to a resolution
  • 3423 resolutions with text attributes (index
    terms)
  • 191 Senators in total across 16 years

S.543 Title An Act to reform Federal deposit
insurance, protect the deposit insurance funds,
recapitalize the Bank Insurance Fund, improve
supervision and regulation of insured depository
institutions, and for other purposes. Sponsor
Sen Riegle, Donald W., Jr. MI (introduced
3/5/1991) Cosponsors (2) Latest Major Action
12/19/1991 Became Public Law No 102-242. Index
terms Banks and banking Accounting
Administrative fees Cost control Credit Deposit
insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen
(D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea
Bradley (D-NJ), Nay Conrad (D-ND), Nay
43
Topics Discovered (U.S. Senate)
Education Energy Military Misc. Economic
education energy government federal
school power military labor
aid water foreign insurance
children nuclear tax aid
drug gas congress tax
students petrol aid business
elementary research law employee
prevention pollution policy care
Mixture of Unigrams
Education Domestic Foreign Economic Social Security Medicare
education foreign labor social
school trade insurance security
federal chemicals tax insurance
aid tariff congress medical
government congress income care
tax drugs minimum medicare
energy communicable wage disability
research diseases business assistance
Group-Topic Model
44
Groups Discovered (US Senate)
Groups from topic Education Domestic
45
Senators Who Change Coalition the most Dependent
on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
46
Dataset 2The UN General Assembly
  • Voting records of the UN General Assembly (1990 -
    2003)
  • A country may choose to vote Yes, No or Abstain
  • 931 resolutions with text attributes (titles)
  • 192 countries in total
  • Also experiments later with resolutions from
    1960-2003

Vote on Permanent Sovereignty of Palestinian
People, 87th plenary meeting The draft
resolution on permanent sovereignty of the
Palestinian people in the occupied Palestinian
territory, including Jerusalem, and of the Arab
population in the occupied Syrian Golan over
their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3
against with 6 abstentions In favour
Afghanistan, Argentina, Belgium, Brazil, Canada,
China, France, Germany, India, Japan, Mexico,
Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey,
and other 126 countries. Against Israel,
Marshall Islands, United States. Abstain
Australia, Cameroon, Georgia, Kazakhstan,
Uzbekistan, Zambia.
47
Topics Discovered (UN)
Everything Nuclear Human Rights Security in Middle East
Everything Nuclear Security in Middle East
nuclear rights occupied
weapons human israel
use palestine syria
implementation situation security
countries israel calls
Mixture of Unigrams
Nuclear Non-proliferation Nuclear Arms Race Human Rights
nuclear nuclear rights
states arms human
united prevention palestine
weapons race occupied
nations space israel
Group-TopicModel
48
GroupsDiscovered(UN)
The countries list for each group are ordered by
their 2005 GDP (PPP) and only 5 countries are
shown in groups that have more than 5 members.
49
Do We Get Better Groups with the GT Model?
Baseline Model GT Model
  1. Cluster bills into topics using mixture of
    unigrams
  2. Apply group model on topic-specific subsets of
    bills.
  1. Jointly cluster topic and groups at the same time
    using the GT model.

Datasets Avg. AI for Baseline Avg. AI for GT p-value
Senate 0.8198 0.8294 lt.01
UN 0.8548 0.8664 lt.01
Agreement Index (AI) measures group cohesion.
Higher, better.
50
Groups and Topics, Trends over Time (UN)
51
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

a
Multi-Conditional Mixtures
52
Latent Dirichlet Allocation
Blei, Ng, Jordan, 2003
a
N
?
n
z
ß
T
w
f
53
Correlated Topic Model
Blei, Lafferty, 2005
?
?
N
logistic normal
?
n
z
ß
T
w
f
Square matrix of pairwise correlations.
54
Pachinko Machine
55
Pachinko Allocation Model
Thanks to Michael Jordan for suggesting the name
Li, McCallum, 2005
?11
Given directed acyclic graph (DAG) at each
interior node a Dirichlet over its children
and words at leaves
Model structure, not the graphical model
?22
?21
For each document Sample a multinomial from
each Dirichlet
?31
?33
?32
For each word in this document Starting from
the root, sample a child from successive
nodes, down to a leaf. Generate the word at the
leaf
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
Like a Polya tree, but DAG shaped, with arbitrary
number of children.
56
Pachinko Allocation Model
Li, McCallum, 2005
?11
  • DAG may have arbitrary structure
  • arbitrary depth
  • any number of children per node
  • sparse connectivity
  • edges may skip layers

Model structure, not the graphical model
?22
?21
?31
?33
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
57
Pachinko Allocation Model
Li, McCallum, 2005
?11
Model structure, not the graphical model
?22
?21
Distributions over distributions over topics...
Distributions over topicsmixtures, representing
topic correlations
?31
?33
?32
?41
?42
?43
?44
?45
Distributions over words (like LDA topics)
word1
word2
word3
word4
word5
word6
word7
word8
Some interior nodes could contain one
multinomial, used for all documents. (i.e. a very
peaked Dirichlet)
58
Pachinko Allocation Model
Li, McCallum, 2005
?11
Estimate all these Dirichlets from
data. Estimate model structure from data.
(number of nodes, and connectivity)
Model structure, not the graphical model
?22
?21
?31
?33
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
59
Pachinko Allocation Special Cases
Latent Dirichlet Allocation
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
60
Pachinko Allocation Special Cases
Hierarchical Latent Dirichlet Allocation (HLDA)
Very low variance Dirichlet at root
?11
Each leaf of the HLDA topic hier. has a distr.
over nodes on path to the root.
?22
?23
?24
?21
?32
?33
?31
?34
TheHLDAhier.
?41
?42
?51
word1
word2
word3
word4
word5
word6
word7
word8
61
Pachinko Allocation on a Topic Hierarchy
Combining best of HLDA and Pachinko Allocation
?00
ThePAMDAG.
?11
?12
...representingcorrelations amongtopic leaves.
?22
?23
?24
?21
?32
?33
?31
?34
TheHLDAhier.
?41
?42
?51
word1
word2
word3
word4
word5
word6
word7
word8
62
Pachinko Allocation Model
... with two layers, no skipping
layers,fully-connected from one layer to the
next.
?11
?21
?23
?22
super-topics
sub-topics
?31
?32
?33
?34
?35
fixed multinomials
word1
word2
word3
word4
word5
word6
word7
word8
Another special case would select only one
super-topic per document.
63
Graphical Models
PAM (with fixed multinomials for topics)
LDA
q
a
a
N
N
q
?
?
n
n

z1
z2
zm
z
ß
ß
T
T
w
f
w
f
64
Pachinko Allocation Model
  • Likelihood
  • Estimate zs by Gibbs sampling
  • Estimate ?s by moment matching.

65
Preliminary Experimental Results
  • Topic Coherence
  • Likelihood on held-out data
  • Document classification

66
NIPS Dataset
NIPS Conference PapersVolumes 0-12 Spanning
1987 1999. Prepared by Sam Roweis.
  • 1740 papers
  • 13649 Words
  • 2,301,375 tokens

67
Topic Coherence Comparison
models, estimation, stopwords
estimation, some junk
LDA 100 estimation likelihood maximum noisy estima
tes mixture scene surface normalization generated
measurements surfaces estimating estimated iterati
ve combined figure divisive sequence ideal
LDA 20 models model parameters distribution bayes
ian probability estimation data gaussian methods l
ikelihood em mixture show approach paper density f
ramework approximation markov
Example super-topic 33 input hidden units
function number 27 estimation bayesian parameters
data methods 24 distribution gaussian markov
likelihood mixture 11 exact kalman full
conditional deterministic 1 smoothing
predictive regularizers intermediate slope
68
Topic Coherence Comparison
images, motion eyes
motion, some junk
motion
eyes
images
LDA 100 motion detection field optical flow sensit
ive moving functional detect contrast light dimens
ional intensity computer mt measures occlusion tem
poral edge real
PAM 100 motion video surface surfaces figure scene
camera noisy sequence activation generated analy
tical pixels measurements assigne advance lated sh
own closed perceptual
LDA 20 visual model motion field object image ima
ges objects fields receptive eye position spatial
direction target vision multiple figure orientatio
n location
PAM 100 eye head vor vestibulo oculomotor vestibul
ar vary reflex vi pan rapid semicircular canals re
sponds streams cholinergic rotation topographicall
y detectors ning
PAM 100 image digit faces pixel surface interpolat
ion scene people viewing neighboring sensors patch
es manifold dataset magnitude transparency rich dy
namical amounts tor
69
Topic Coherence Comparison
neural networks, much less junk
neural networks, some junk
neural networks, some junk
PAM 100 input hidden units function number functio
ns networks output linear layer single results wei
ght inputs basis parameters standard network patte
rns study
LDA 100 network layer multi trained high perceptro
n layers give type nonlinearity perceptrons module
modified matched performed provided designed samp
les study mode
LDA 20 architecture network input output structure
paper level task work sequences sequence multiple
problem shows connectionist networks context perf
orm scale learn
70
Blind Topic Evaluation
  • Randomly select 25 similar pairs of topics
    generated from PAM and LDA
  • 5 people
  • Each asked select the topic in each pair that
    you find most semantically coherent.

Prefer PAM
Topic counts
LDA PAM
5 votes 0 5
gt 4 votes 3 8
gt 3 votes 9 16
71
Example Topic Pairswith Human Evaluation
72
Topic Correlations in PAM
5000 research paper abstracts, from across all CS
Numbers on edges are supertopics Dirichlet
parameters
73
Likelihood on Held Out Data
  • Likelihood comparison
  • NIPS abstracts
  • Train the model with 75 data
  • Calculate likelihood on 25 data
  • Calculate likelihood by
  • Sampling many, many documents from the model
  • Estimating a simple mixture of multinomials from
    these
  • Calculate the likelihood of data under this
    simple mixture.

74
Likelihood Comparison
  • Varying number of topics

75
Document Classification
Comp5 from 20 Newsgroups corpus. Train on 25,
test on 75Like Naive Bayes, but use LDA/PAM
per-class instead of multinomial.
2.5 increase
Test Accuracy ()
76
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

Multi-Conditional Mixtures
77
Want to Model Trends over Time
  • Is prevalence of topic growing or waning?
  • Pattern appears only briefly
  • Capture its statistics in focused way
  • Dont confuse it with patterns elsewhere in time
  • How do roles, groups, influence shift over time?

78
Topics over Time (TOT)
?
Dirichlet
?
multinomialover topics
Uniformprior
Dirichlet prior
topicindex
z
?
?
timestamp
word
w
t
?
?
T
T
Nd
Betaover time
Multinomialover words
D
79
State of the Union Address
208 Addresses delivered between January 8, 1790
and January 29, 2002.
  • To increase the number of documents, we split the
    addresses into paragraphs and treated them as
    documents. One-line paragraphs were excluded.
    Stopping was applied.
  • 17156 documents
  • 21534 words
  • 669,425 tokens

Our scheme of taxation, by means of which this
needless surplus is taken from the people and put
into the public Treasury, consists of a tariff
or duty levied upon importations from abroad and
internal-revenue taxes levied upon the
consumption of tobacco and spirituous and malt
liquors. It must be conceded that none of the
things subjected to internal-revenue
taxation are, strictly speaking, necessaries.
There appears to be no just complaint of this
taxation by the consumers of these articles, and
there seems to be nothing so well able to bear
the burden without hardship to any portion of the
people.
1910
80
Comparing TOT with LDA
81
Sample Topic Cold War
world nations united states peace free economic mi
litary soviet international security strength defe
nse freedom europe force peoples efforts aggressio
n today
82
ComparingTOTagainst LDA
83
TOT on 17 years of NIPS proceedings
84
TOT on 17 years of NIPS proceedings
TOT
LDA
85
TOT versusLDAon my email
86
TOT improves ability to Predict Time
Predicting the year of a State-of-the-Union
address.
L1 distance between predicted year and actual
year.
87
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

a
a
a
a
Multi-Conditional Mixtures
88
Topics Modeling Phrases
  • Topics based only on unigrams often difficult to
    interpret
  • Topic discovery itself is confused because
    important meaning / distinctions carried by
    phrases.
  • Significant opportunity to provide improved
    language models to ASR, MT, IR, etc.

89
Topical N-gram Model
?
?
z1
z2
z3
z4
. . .
y1
y2
y3
y4
. . .
w1
w2
w3
w4
. . .
D
?1
?2
?
?1
?
?2
W
W
T
T
90
LDA Topic
LDA algorithms algorithm genetic problems efficie
nt
Topical N-grams genetic algorithms genetic
algorithm evolutionary computation evolutionary
algorithms fitness function
91
Sample Topical N-gram topics
Sample LDA topics
92
Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
policy action states actions function reward contr
ol agent q-learning optimal goal learning space st
ep environment system problem steps sutton policie
s
learning optimal reinforcement state problems poli
cy dynamic action programming actions function mar
kov methods decision rl continuous spaces step pol
icies planning
reinforcement learning optimal policy dynamic
programming optimal control function
approximator prioritized sweeping finite-state
controller learning system reinforcement
learning_rl function approximators markov
decision problems markov decision processes local
search state-action pair markov decision
process belief states stochastic policy action
selection upright position reinforcement learning
methods
93
Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
motion response direction cells stimulus figure co
ntrast velocity model responses stimuli moving cel
l intensity population image center tuning complex
directions
motion visual field position figure direction fiel
ds eye location retina receptive velocity vision m
oving system flow edge center light local
receptive field spatial frequency temporal
frequency visual motion motion energy tuning
curves horizontal cells motion detection preferred
direction visual processing area mt visual
cortex light intensity directional
selectivity high contrast motion
detectors spatial phase moving stimuli decision
strategy visual stimuli
94
Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
speech word training system recognition hmm speake
r performance phoneme acoustic words context syste
ms frame trained sequence phonetic speakers mlp hy
brid
word system recognition hmm speech training perfor
mance phoneme words context systems frame trained
speaker sequence speakers mlp frames segmentation
models
speech recognition training data neural
network error rates neural net hidden markov
model feature vectors continuous speech training
procedure continuous speech recognition gamma
filter hidden control speech production neural
nets input representation output layers training
algorithm test set speech frames speaker dependent
95
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

a
a
a
a
a
Multi-Conditional Mixtures
96
Social Networks in Research Literature
  • Better understand structure of our own research
    area.
  • Structure helps us learn a new field.
  • Aid collaboration
  • Map how ideas travel through social networks of
    researchers.
  • Aids for hiring and finding reviewers!

97
Traditional Bibliometrics
  • Analyses a small amount of data(e.g. 19 articles
    from a single issue of a journal)
  • Uses journal as a proxy for research
    topic(but there is no journal for information
    extraction)
  • Uses impact measures almost exclusively based on
    simple citation counts.

How can we use topic models to create new,
interesting impact measures?
98
Our Data
  • Over 1 million research papers, gathered as part
    of Rexa.info portal.
  • Cross linked references / citations.

99
Finding Topics with TNG
Traditional unigram LDArun on 1 milliontitles /
abstracts (200 topics) ...select 300k papers
onML, NLP, robotics, vision... Find 200 TNG
topics among those papers.
100
Topical Bibliometric Impact Measures
  • Topical Citation Counts
  • Topical Impact Factors
  • Topical Longevity
  • Topical Diversity
  • Topical Precedence
  • Topical Transfer

101
Topical Diversity
Entropy of the topic distribution among papers
that cite this paper (this topic).
LowDiversity
HighDiversity
102
Topical Diversity
Can also be measured on particular papers...
103
Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
  • Information Retrieval
  • On Relevance, Probabilistic Indexing and
    Information Retrieval, Kuhns and Maron (1960)
  • Expected Search Length A Single Measure of
    Retrieval Effectiveness Based on the Weak
    Ordering Action of Retrieval Systems, Cooper
    (1968)
  • Relevance feedback in information retrieval,
    Rocchio (1971)
  • Relevance feedback and the optimization of
    retrieval effectiveness, Salton (1971)
  • New experiments in relevance feedback, Ide
    (1971)
  • Automatic Indexing of a Sound Database Using
    Self-organizing Neural Nets, Feiten and Gunzel
    (1982)

104
Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
  • Speech Recognition
  • Some experiments on the recognition of speech,
    with one and two ears, E. Colin Cherry (1953)
  • Spectrographic study of vowel reduction, B.
    Lindblom (1963)
  • Automatic Lipreading to enhance speech
    recognition, Eric D. Petajan (1965)
  • Effectiveness of linear prediction
    characteristics of the speech wave for..., B.
    Atal (1974)
  • Automatic Recognition of Speakers from Their
    Voices, B. Atal (1976)

105
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cits Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being Undigital with digital cameras extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase a repository of Web pages
106
Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
107
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

a
a
a
a
a
a
Multi-Conditional Mixtures
108
Want a topic model with the advantages of CRFs
  • Use arbitrary, overlapping features of the input.
  • Undirected graphical model, so we dont have to
    think about avoiding cycles.
  • Integrate naturally with our other CRF
    components.
  • Train discriminatively
  • Natural semi-supervised training

What does this mean? Topic models are
unsupervised!
109
Multi-Conditional MixturesLatent Variable
Models fit by Multi-way Conditional Probability
McCallum, Wang, Pal, 2005, McCallum, Pal,
Wang, 2006
  • For clustering structured data,ala Latent
    Dirichlet Allocation its successors
  • But an undirected model,like the Harmonium
    Welling, Rosen-Zvi, Hinton, 2005
  • But trained by a multi-conditional objective
    O P(AB,C) P(BA,C) P(CA,B)e.g. A,B,C are
    different modalities

110
Objective Functions for Parameter Estimation
Traditional
New, multi-conditional
111
Multi-Conditional Learning (Regularization)
McCallum, Pal, Wang, 2006
112
Predictive Random Fieldsmixture of Gaussians on
synthetic data
McCallum, Wang, Pal, 2005
Data, classify by color
Generatively trained
Multi-Conditional
Conditionally-trained Jebara 1998
113
Multi-Conditional Mixturesvs. Harmoniunon
document retrieval task
McCallum, Wang, Pal, 2005
Multi-Conditional,multi-way conditionally trained
Conditionally-trained,to predict class labels
Harmonium, joint,with class labels and words
Harmonium, joint with words, no labels
114
Outline
Social Network Analysis with Topic Models
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

Multi-Conditional Mixtures
115
Summary
116
Assigning topics to documents
  • Build a 200 topic n-gram topic model on 300k
    documents
  • Remove stopword or methodological topics (e.g.
    efficient, fast, speed)
  • For each document d, if more than 10 of ds
    tokens are assigned to topic t, and that
    comprises more than two tokens, assign d to t
  • Each topic is now an intellectual domain that
    includes some number of documents. We can
    substitute topic for journal in most traditional
    bibliometric indicators. We can also now define
    several new indicators.

117
Impact Factor
Journal Impact Factor Citations from articles
published in 2004 to articles in Cell published
in 2002-3, divided by the number of articles
published in Cell in 2002-3. 2004 Impact
factors from JCR
Nature 32.182
Cell 28.389
JMLR 5.952
Machine Learning 3.258
118
Topic Impact Factor
119
Broad Impact Diffusion
Journal Diffusion of journals citing Cell
divided by the total number of citations to Cell,
over a given time period, times 100 Problem
relatively brittle at low citation counts. If a
topic/journal is cited twice by two different
topics/journals, it will have high diffusion.
120
Broad Impact Diversity
Topic Diversity Entropy of the distribution of
citing topics Better at capturing broad end of
impact spectrum the high diffusion topics are
identical to the least frequently cited topics
121
Broad Impact Diversity
Topic Diversity Entropy of the distribution of
citing topics Topic diversity can also be
measured for papers
122
Longevity Cited Half Life
  • Two views
  • Given a paper, what is the median age of
    citations to that paper?
  • What is the median age of citations from current
    literature?

123
History Topical Precedence
  • Within a topic, what are the earliest papers that
    received more than n citations?
  • Information Retrieval (138)
  • On Relevance, Probabilistic Indexing and
    Information Retrieval, Kuhns and Maron (1960)
  • Expected Search Length A Single Measure of
    Retrieval Effectiveness Based on the Weak
    Ordering Action of Retrieval Systems, Cooper
    (1968)
  • Relevance feedback in information retrieval,
    Rocchio (1971)
  • Relevance feedback and the optimization of
    retrieval effectiveness, Salton (1971)
  • New experiments in relevance feedback, Ide (1971)
  • Automatic Indexing of a Sound Database Using
    Self-organizing Neural Nets, Feiten and Gunzel
    (1982)
View by Category
About This Presentation
Title:

Topic Models for Social Network Analysis and Bibliometrics

Description:

Automatically Building Special Purpose Search Engines with ... – PowerPoint PPT presentation

Number of Views:295
Avg rating:3.0/5.0

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Topic Models for Social Network Analysis and Bibliometrics


1
Topic Models forSocial Network Analysis and
Bibliometrics
  • Andrew McCallum
  • Computer Science Department
  • University of Massachusetts Amherst

Joint work with ?Xuerui Wang, Natasha
Mohanty, Andres Corrada, Chris Pal, Wei Li, David
Mimno and Gideon Mann.
2
Goal
Mine actionable knowledgefrom unstructured text.
3
From Text to Actionable Knowledge
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
4
Joint Inference
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
5
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
6
(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input seq
said Jones a Microsoft VP
7
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and
noun-phrase in newswire, matching accuracy with
only 50 of the training data.
Inference Tree reparameterization BP
Wainwright et al, 2002
8
2. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004

Senator Joe Green said today .
Green ran for
14 reduction in error on most repeated field in
email seminar announcements.
Inference Tree reparameterization BP
Wainwright et al, 2002
9
3. Joint co-reference among all pairsAffinity
Matrix CRF
Entity resolutionObject correspondence
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
Y/N
-99
Y/N
25 reduction in error on co-reference of proper
nouns in newswire.
11
. . . she . . .
Inference Correlational clustering graph
partitioning
McCallum, Wellner, IJCAI WS 2003, NIPS 2004
Bansal, Blum, Chawla, 2002
10
4. Joint segmentation and co-reference
Extraction from and matching of research paper
citations.
o
s
World Knowledge
Laurel, B. Interface Agents Metaphors with
Character, in The Art of Human-Computer
Interface Design, B. Laurel (ed), Addison-Wesley,
1990.
c
Co-reference decisions
y
y
p
Brenda Laurel. Interface Agents Metaphors with
Character, in Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
Databasefield values
c
y
c
Citation attributes
s
s
Segmentation
o
o
35 reduction in co-reference error by using
segmentation uncertainty.
6-14 reduction in segmentation error by using
co-reference.
Inference Variant of Iterated Conditional Modes
Wellner, McCallum, Peng, Hay, UAI 2004
see also Marthi, Milch, Russell, 2003
Besag, 1986
11
Context
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Joint inference among detailed steps
Actionableknowledge
Leveraging Text in Social Network Analysis
Prediction Outlier detection Decision support
12
Outline
Social Network Analysis with Topic Models
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

Multi-Conditional Mixtures
13
Social Network in an Email Dataset
14
Clustering words into topics withLatent
Dirichlet Allocation
Blei, Ng, Jordan 2003
GenerativeProcess
Example
For each document
70 Iraq war 30 US election
Sample a distributionover topics, ?
For each word in doc
Iraq war
Sample a topic, z
Sample a wordfrom the topic, w
bombing
15
Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
16
Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
17
From LDA to Author-Recipient-Topic
McCallum et al 2005
(ART)
18
Inference and Estimation
  • Gibbs Sampling
  • Easy to implement
  • Reasonably fast

r
19
Enron Email Corpus
  • 250k email messages
  • 23k people

Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
20
Topics, and prominent senders /
receiversdiscovered by ART
Topic names, by hand
21
Topics, and prominent senders /
receiversdiscovered by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice President of
Regulatory Affairs Steffes Vice President of
Government Affairs
22
Comparing Role Discovery
Traditional SNA
Author-Topic
ART
connection strength (A,B)
distribution over recipients
distribution over authored topics
distribution over authored topics
23
Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
24
Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
25
McCallum Email Corpus 2004
  • January - October 2004
  • 23k email messages
  • 825 people

From kate_at_cs.umass.edu Subject NIPS and
.... Date June 14, 2004 22741 PM EDT To
mccallum_at_cs.umass.edu There is pertinent stuff
on the first yellow folder that is completed
either travel or other things, so please sign
that first folder anyway. Then, here is the
reminder of the things I'm still waiting
for NIPS registration receipt. CALO
registration receipt. Thanks, Kate
26
Four most prominent topicsin discussions with
____?
27
(No Transcript)
28
Two most prominent topicsin discussions with
____?
29
(No Transcript)
30
Role-Author-Recipient-Topic Models
31
Results with RARTPeople in Role 3 in
Academic Email
  • olc lead Linux sysadmin
  • gauthier sysadmin for CIIR group
  • irsystem mailing list CIIR sysadmins
  • system mailing list for dept. sysadmins
  • allan Prof., chair of computing committee
  • valerie second Linux sysadmin
  • tech mailing list for dept. hardware
  • steve head of dept. I.T. support

32
Roles for allan (James Allan)
  • Role 3 I.T. support
  • Role 2 Natural Language researcher

Roles for pereira (Fernando Pereira)
  • Role 2 Natural Language researcher
  • Role 4 SRI CALO project participant
  • Role 6 Grant proposal writer
  • Role 10 Grant proposal coordinator
  • Role 8 Guests at McCallums house

33
ART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
34
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

Multi-Conditional Mixtures
35
Groups and Topics
  • Input
  • Observed relations between people
  • Attributes on those relations (text, or
    categorical)
  • Output
  • Attributes clustered into topics
  • Groups of people---varying depending on topic

36
Discovering Groups from Observed Set of Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Admiration relations among six high school
students.
37
Adjacency Matrix Representing Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
A B C D E F
G1 G2 G1 G2 G3 G3
G1
G2
G1
G2
G3
G3
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
A
C
B
D
E
F
38
Group Model Partitioning Entities into Groups
Stochastic Blockstructures for Relations Nowicki,
Snijders 2001
Beta
Dirichlet
Multinomial
S number of entities G number of groups
Binomial
Enhanced with arbitrary number of groups in
Kemp, Griffiths, Tenenbaum 2004
39
Two Relations with Different Attributes
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Social Admiration Soci(A, B) Soci(A, D) Soci(A,
F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B)
Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C)
Soci(D, E) Soci(E, B) Soci(E, D) Soci(E,
F) Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A C E B D F
G1 G1 G1 G2 G2 G2
G1
G1
G1
G2
G2
G2
A
C
E
B
D
F
A
C
B
D
E
F
40
The Group-Topic Model Discovering Groups and
Topics Simultaneously
Wang, Mohanty, McCallum 2006
Beta
Uniform
Dirichlet
Multinomial
Dirichlet
Binomial
Multinomial
41
Inference and Estimation
  • Gibbs Sampling
  • Many r.v.s can be integrated out
  • Easy to implement
  • Reasonably fast

We assume the relationship is symmetric.
42
Dataset 1U.S. Senate
  • 16 years of voting records in the US Senate (1989
    2005)
  • a Senator may respond Yea or Nay to a resolution
  • 3423 resolutions with text attributes (index
    terms)
  • 191 Senators in total across 16 years

S.543 Title An Act to reform Federal deposit
insurance, protect the deposit insurance funds,
recapitalize the Bank Insurance Fund, improve
supervision and regulation of insured depository
institutions, and for other purposes. Sponsor
Sen Riegle, Donald W., Jr. MI (introduced
3/5/1991) Cosponsors (2) Latest Major Action
12/19/1991 Became Public Law No 102-242. Index
terms Banks and banking Accounting
Administrative fees Cost control Credit Deposit
insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen
(D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea
Bradley (D-NJ), Nay Conrad (D-ND), Nay
43
Topics Discovered (U.S. Senate)
Education Energy Military Misc. Economic
education energy government federal
school power military labor
aid water foreign insurance
children nuclear tax aid
drug gas congress tax
students petrol aid business
elementary research law employee
prevention pollution policy care
Mixture of Unigrams
Education Domestic Foreign Economic Social Security Medicare
education foreign labor social
school trade insurance security
federal chemicals tax insurance
aid tariff congress medical
government congress income care
tax drugs minimum medicare
energy communicable wage disability
research diseases business assistance
Group-Topic Model
44
Groups Discovered (US Senate)
Groups from topic Education Domestic
45
Senators Who Change Coalition the most Dependent
on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
46
Dataset 2The UN General Assembly
  • Voting records of the UN General Assembly (1990 -
    2003)
  • A country may choose to vote Yes, No or Abstain
  • 931 resolutions with text attributes (titles)
  • 192 countries in total
  • Also experiments later with resolutions from
    1960-2003

Vote on Permanent Sovereignty of Palestinian
People, 87th plenary meeting The draft
resolution on permanent sovereignty of the
Palestinian people in the occupied Palestinian
territory, including Jerusalem, and of the Arab
population in the occupied Syrian Golan over
their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3
against with 6 abstentions In favour
Afghanistan, Argentina, Belgium, Brazil, Canada,
China, France, Germany, India, Japan, Mexico,
Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey,
and other 126 countries. Against Israel,
Marshall Islands, United States. Abstain
Australia, Cameroon, Georgia, Kazakhstan,
Uzbekistan, Zambia.
47
Topics Discovered (UN)
Everything Nuclear Human Rights Security in Middle East
Everything Nuclear Security in Middle East
nuclear rights occupied
weapons human israel
use palestine syria
implementation situation security
countries israel calls
Mixture of Unigrams
Nuclear Non-proliferation Nuclear Arms Race Human Rights
nuclear nuclear rights
states arms human
united prevention palestine
weapons race occupied
nations space israel
Group-TopicModel
48
GroupsDiscovered(UN)
The countries list for each group are ordered by
their 2005 GDP (PPP) and only 5 countries are
shown in groups that have more than 5 members.
49
Do We Get Better Groups with the GT Model?
Baseline Model GT Model
  1. Cluster bills into topics using mixture of
    unigrams
  2. Apply group model on topic-specific subsets of
    bills.
  1. Jointly cluster topic and groups at the same time
    using the GT model.

Datasets Avg. AI for Baseline Avg. AI for GT p-value
Senate 0.8198 0.8294 lt.01
UN 0.8548 0.8664 lt.01
Agreement Index (AI) measures group cohesion.
Higher, better.
50
Groups and Topics, Trends over Time (UN)
51
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

a
Multi-Conditional Mixtures
52
Latent Dirichlet Allocation
Blei, Ng, Jordan, 2003
a
N
?
n
z
ß
T
w
f
53
Correlated Topic Model
Blei, Lafferty, 2005
?
?
N
logistic normal
?
n
z
ß
T
w
f
Square matrix of pairwise correlations.
54
Pachinko Machine
55
Pachinko Allocation Model
Thanks to Michael Jordan for suggesting the name
Li, McCallum, 2005
?11
Given directed acyclic graph (DAG) at each
interior node a Dirichlet over its children
and words at leaves
Model structure, not the graphical model
?22
?21
For each document Sample a multinomial from
each Dirichlet
?31
?33
?32
For each word in this document Starting from
the root, sample a child from successive
nodes, down to a leaf. Generate the word at the
leaf
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
Like a Polya tree, but DAG shaped, with arbitrary
number of children.
56
Pachinko Allocation Model
Li, McCallum, 2005
?11
  • DAG may have arbitrary structure
  • arbitrary depth
  • any number of children per node
  • sparse connectivity
  • edges may skip layers

Model structure, not the graphical model
?22
?21
?31
?33
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
57
Pachinko Allocation Model
Li, McCallum, 2005
?11
Model structure, not the graphical model
?22
?21
Distributions over distributions over topics...
Distributions over topicsmixtures, representing
topic correlations
?31
?33
?32
?41
?42
?43
?44
?45
Distributions over words (like LDA topics)
word1
word2
word3
word4
word5
word6
word7
word8
Some interior nodes could contain one
multinomial, used for all documents. (i.e. a very
peaked Dirichlet)
58
Pachinko Allocation Model
Li, McCallum, 2005
?11
Estimate all these Dirichlets from
data. Estimate model structure from data.
(number of nodes, and connectivity)
Model structure, not the graphical model
?22
?21
?31
?33
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
59
Pachinko Allocation Special Cases
Latent Dirichlet Allocation
?32
?41
?42
?43
?44
?45
word1
word2
word3
word4
word5
word6
word7
word8
60
Pachinko Allocation Special Cases
Hierarchical Latent Dirichlet Allocation (HLDA)
Very low variance Dirichlet at root
?11
Each leaf of the HLDA topic hier. has a distr.
over nodes on path to the root.
?22
?23
?24
?21
?32
?33
?31
?34
TheHLDAhier.
?41
?42
?51
word1
word2
word3
word4
word5
word6
word7
word8
61
Pachinko Allocation on a Topic Hierarchy
Combining best of HLDA and Pachinko Allocation
?00
ThePAMDAG.
?11
?12
...representingcorrelations amongtopic leaves.
?22
?23
?24
?21
?32
?33
?31
?34
TheHLDAhier.
?41
?42
?51
word1
word2
word3
word4
word5
word6
word7
word8
62
Pachinko Allocation Model
... with two layers, no skipping
layers,fully-connected from one layer to the
next.
?11
?21
?23
?22
super-topics
sub-topics
?31
?32
?33
?34
?35
fixed multinomials
word1
word2
word3
word4
word5
word6
word7
word8
Another special case would select only one
super-topic per document.
63
Graphical Models
PAM (with fixed multinomials for topics)
LDA
q
a
a
N
N
q
?
?
n
n

z1
z2
zm
z
ß
ß
T
T
w
f
w
f
64
Pachinko Allocation Model
  • Likelihood
  • Estimate zs by Gibbs sampling
  • Estimate ?s by moment matching.

65
Preliminary Experimental Results
  • Topic Coherence
  • Likelihood on held-out data
  • Document classification

66
NIPS Dataset
NIPS Conference PapersVolumes 0-12 Spanning
1987 1999. Prepared by Sam Roweis.
  • 1740 papers
  • 13649 Words
  • 2,301,375 tokens

67
Topic Coherence Comparison
models, estimation, stopwords
estimation, some junk
LDA 100 estimation likelihood maximum noisy estima
tes mixture scene surface normalization generated
measurements surfaces estimating estimated iterati
ve combined figure divisive sequence ideal
LDA 20 models model parameters distribution bayes
ian probability estimation data gaussian methods l
ikelihood em mixture show approach paper density f
ramework approximation markov
Example super-topic 33 input hidden units
function number 27 estimation bayesian parameters
data methods 24 distribution gaussian markov
likelihood mixture 11 exact kalman full
conditional deterministic 1 smoothing
predictive regularizers intermediate slope
68
Topic Coherence Comparison
images, motion eyes
motion, some junk
motion
eyes
images
LDA 100 motion detection field optical flow sensit
ive moving functional detect contrast light dimens
ional intensity computer mt measures occlusion tem
poral edge real
PAM 100 motion video surface surfaces figure scene
camera noisy sequence activation generated analy
tical pixels measurements assigne advance lated sh
own closed perceptual
LDA 20 visual model motion field object image ima
ges objects fields receptive eye position spatial
direction target vision multiple figure orientatio
n location
PAM 100 eye head vor vestibulo oculomotor vestibul
ar vary reflex vi pan rapid semicircular canals re
sponds streams cholinergic rotation topographicall
y detectors ning
PAM 100 image digit faces pixel surface interpolat
ion scene people viewing neighboring sensors patch
es manifold dataset magnitude transparency rich dy
namical amounts tor
69
Topic Coherence Comparison
neural networks, much less junk
neural networks, some junk
neural networks, some junk
PAM 100 input hidden units function number functio
ns networks output linear layer single results wei
ght inputs basis parameters standard network patte
rns study
LDA 100 network layer multi trained high perceptro
n layers give type nonlinearity perceptrons module
modified matched performed provided designed samp
les study mode
LDA 20 architecture network input output structure
paper level task work sequences sequence multiple
problem shows connectionist networks context perf
orm scale learn
70
Blind Topic Evaluation
  • Randomly select 25 similar pairs of topics
    generated from PAM and LDA
  • 5 people
  • Each asked select the topic in each pair that
    you find most semantically coherent.

Prefer PAM
Topic counts
LDA PAM
5 votes 0 5
gt 4 votes 3 8
gt 3 votes 9 16
71
Example Topic Pairswith Human Evaluation
72
Topic Correlations in PAM
5000 research paper abstracts, from across all CS
Numbers on edges are supertopics Dirichlet
parameters
73
Likelihood on Held Out Data
  • Likelihood comparison
  • NIPS abstracts
  • Train the model with 75 data
  • Calculate likelihood on 25 data
  • Calculate likelihood by
  • Sampling many, many documents from the model
  • Estimating a simple mixture of multinomials from
    these
  • Calculate the likelihood of data under this
    simple mixture.

74
Likelihood Comparison
  • Varying number of topics

75
Document Classification
Comp5 from 20 Newsgroups corpus. Train on 25,
test on 75Like Naive Bayes, but use LDA/PAM
per-class instead of multinomial.
2.5 increase
Test Accuracy ()
76
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

Multi-Conditional Mixtures
77
Want to Model Trends over Time
  • Is prevalence of topic growing or waning?
  • Pattern appears only briefly
  • Capture its statistics in focused way
  • Dont confuse it with patterns elsewhere in time
  • How do roles, groups, influence shift over time?

78
Topics over Time (TOT)
?
Dirichlet
?
multinomialover topics
Uniformprior
Dirichlet prior
topicindex
z
?
?
timestamp
word
w
t
?
?
T
T
Nd
Betaover time
Multinomialover words
D
79
State of the Union Address
208 Addresses delivered between January 8, 1790
and January 29, 2002.
  • To increase the number of documents, we split the
    addresses into paragraphs and treated them as
    documents. One-line paragraphs were excluded.
    Stopping was applied.
  • 17156 documents
  • 21534 words
  • 669,425 tokens

Our scheme of taxation, by means of which this
needless surplus is taken from the people and put
into the public Treasury, consists of a tariff
or duty levied upon importations from abroad and
internal-revenue taxes levied upon the
consumption of tobacco and spirituous and malt
liquors. It must be conceded that none of the
things subjected to internal-revenue
taxation are, strictly speaking, necessaries.
There appears to be no just complaint of this
taxation by the consumers of these articles, and
there seems to be nothing so well able to bear
the burden without hardship to any portion of the
people.
1910
80
Comparing TOT with LDA
81
Sample Topic Cold War
world nations united states peace free economic mi
litary soviet international security strength defe
nse freedom europe force peoples efforts aggressio
n today
82
ComparingTOTagainst LDA
83
TOT on 17 years of NIPS proceedings
84
TOT on 17 years of NIPS proceedings
TOT
LDA
85
TOT versusLDAon my email
86
TOT improves ability to Predict Time
Predicting the year of a State-of-the-Union
address.
L1 distance between predicted year and actual
year.
87
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

a
a
a
a
Multi-Conditional Mixtures
88
Topics Modeling Phrases
  • Topics based only on unigrams often difficult to
    interpret
  • Topic discovery itself is confused because
    important meaning / distinctions carried by
    phrases.
  • Significant opportunity to provide improved
    language models to ASR, MT, IR, etc.

89
Topical N-gram Model
?
?
z1
z2
z3
z4
. . .
y1
y2
y3
y4
. . .
w1
w2
w3
w4
. . .
D
?1
?2
?
?1
?
?2
W
W
T
T
90
LDA Topic
LDA algorithms algorithm genetic problems efficie
nt
Topical N-grams genetic algorithms genetic
algorithm evolutionary computation evolutionary
algorithms fitness function
91
Sample Topical N-gram topics
Sample LDA topics
92
Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
policy action states actions function reward contr
ol agent q-learning optimal goal learning space st
ep environment system problem steps sutton policie
s
learning optimal reinforcement state problems poli
cy dynamic action programming actions function mar
kov methods decision rl continuous spaces step pol
icies planning
reinforcement learning optimal policy dynamic
programming optimal control function
approximator prioritized sweeping finite-state
controller learning system reinforcement
learning_rl function approximators markov
decision problems markov decision processes local
search state-action pair markov decision
process belief states stochastic policy action
selection upright position reinforcement learning
methods
93
Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
motion response direction cells stimulus figure co
ntrast velocity model responses stimuli moving cel
l intensity population image center tuning complex
directions
motion visual field position figure direction fiel
ds eye location retina receptive velocity vision m
oving system flow edge center light local
receptive field spatial frequency temporal
frequency visual motion motion energy tuning
curves horizontal cells motion detection preferred
direction visual processing area mt visual
cortex light intensity directional
selectivity high contrast motion
detectors spatial phase moving stimuli decision
strategy visual stimuli
94
Topic Comparison
LDA
Topical N-grams (2)
Topical N-grams (1)
speech word training system recognition hmm speake
r performance phoneme acoustic words context syste
ms frame trained sequence phonetic speakers mlp hy
brid
word system recognition hmm speech training perfor
mance phoneme words context systems frame trained
speaker sequence speakers mlp frames segmentation
models
speech recognition training data neural
network error rates neural net hidden markov
model feature vectors continuous speech training
procedure continuous speech recognition gamma
filter hidden control speech production neural
nets input representation output layers training
algorithm test set speech frames speaker dependent
95
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

a
a
a
a
a
Multi-Conditional Mixtures
96
Social Networks in Research Literature
  • Better understand structure of our own research
    area.
  • Structure helps us learn a new field.
  • Aid collaboration
  • Map how ideas travel through social networks of
    researchers.
  • Aids for hiring and finding reviewers!

97
Traditional Bibliometrics
  • Analyses a small amount of data(e.g. 19 articles
    from a single issue of a journal)
  • Uses journal as a proxy for research
    topic(but there is no journal for information
    extraction)
  • Uses impact measures almost exclusively based on
    simple citation counts.

How can we use topic models to create new,
interesting impact measures?
98
Our Data
  • Over 1 million research papers, gathered as part
    of Rexa.info portal.
  • Cross linked references / citations.

99
Finding Topics with TNG
Traditional unigram LDArun on 1 milliontitles /
abstracts (200 topics) ...select 300k papers
onML, NLP, robotics, vision... Find 200 TNG
topics among those papers.
100
Topical Bibliometric Impact Measures
  • Topical Citation Counts
  • Topical Impact Factors
  • Topical Longevity
  • Topical Diversity
  • Topical Precedence
  • Topical Transfer

101
Topical Diversity
Entropy of the topic distribution among papers
that cite this paper (this topic).
LowDiversity
HighDiversity
102
Topical Diversity
Can also be measured on particular papers...
103
Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
  • Information Retrieval
  • On Relevance, Probabilistic Indexing and
    Information Retrieval, Kuhns and Maron (1960)
  • Expected Search Length A Single Measure of
    Retrieval Effectiveness Based on the Weak
    Ordering Action of Retrieval Systems, Cooper
    (1968)
  • Relevance feedback in information retrieval,
    Rocchio (1971)
  • Relevance feedback and the optimization of
    retrieval effectiveness, Salton (1971)
  • New experiments in relevance feedback, Ide
    (1971)
  • Automatic Indexing of a Sound Database Using
    Self-organizing Neural Nets, Feiten and Gunzel
    (1982)

104
Topical Precedence
Early-ness
Within a topic, what are the earliest papers
that received more than n citations?
  • Speech Recognition
  • Some experiments on the recognition of speech,
    with one and two ears, E. Colin Cherry (1953)
  • Spectrographic study of vowel reduction, B.
    Lindblom (1963)
  • Automatic Lipreading to enhance speech
    recognition, Eric D. Petajan (1965)
  • Effectiveness of linear prediction
    characteristics of the speech wave for..., B.
    Atal (1974)
  • Automatic Recognition of Speakers from Their
    Voices, B. Atal (1976)

105
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cits Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being Undigital with digital cameras extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase a repository of Web pages
106
Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
107
Outline
Social Network Analysis with Topic Models
a
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

a
a
a
a
a
a
Multi-Conditional Mixtures
108
Want a topic model with the advantages of CRFs
  • Use arbitrary, overlapping features of the input.
  • Undirected graphical model, so we dont have to
    think about avoiding cycles.
  • Integrate naturally with our other CRF
    components.
  • Train discriminatively
  • Natural semi-supervised training

What does this mean? Topic models are
unsupervised!
109
Multi-Conditional MixturesLatent Variable
Models fit by Multi-way Conditional Probability
McCallum, Wang, Pal, 2005, McCallum, Pal,
Wang, 2006
  • For clustering structured data,ala Latent
    Dirichlet Allocation its successors
  • But an undirected model,like the Harmonium
    Welling, Rosen-Zvi, Hinton, 2005
  • But trained by a multi-conditional objective
    O P(AB,C) P(BA,C) P(CA,B)e.g. A,B,C are
    different modalities

110
Objective Functions for Parameter Estimation
Traditional
New, multi-conditional
111
Multi-Conditional Learning (Regularization)
McCallum, Pal, Wang, 2006
112
Predictive Random Fieldsmixture of Gaussians on
synthetic data
McCallum, Wang, Pal, 2005
Data, classify by color
Generatively trained
Multi-Conditional
Conditionally-trained Jebara 1998
113
Multi-Conditional Mixturesvs. Harmoniunon
document retrieval task
McCallum, Wang, Pal, 2005
Multi-Conditional,multi-way conditionally trained
Conditionally-trained,to predict class labels
Harmonium, joint,with class labels and words
Harmonium, joint with words, no labels
114
Outline
Social Network Analysis with Topic Models
  • Role Discovery (Author-Recipient-Topic Model,
    ART)
  • Group Discovery (Group-Topic Model, GT)
  • Enhanced Topic Models
  • Correlations among Topics (Pachinko Allocation,
    PAM)
  • Time Localized Topics (Topics-over-Time Model,
    TOT)
  • Markov Dependencies in Topics (Topical N-Grams
    Model, TNG)
  • Bibliometric Impact Measures enabled by Topics

Multi-Conditional Mixtures
115
Summary
116
Assigning topics to documents
  • Build a 200 topic n-gram topic model on 300k
    documents
  • Remove stopword or methodological topics (e.g.
    efficient, fast, speed)
  • For each document d, if more than 10 of ds
    tokens are assigned to topic t, and that
    comprises more than two tokens, assign d to t
  • Each topic is now an intellectual domain that
    includes some number of documents. We can
    substitute topic for journal in most traditional
    bibliometric indicators. We can also now define
    several new indicators.

117
Impact Factor
Journal Impact Factor Citations from articles
published in 2004 to articles in Cell published
in 2002-3, divided by the number of articles
published in Cell in 2002-3. 2004 Impact
factors from JCR
Nature 32.182
Cell 28.389
JMLR 5.952
Machine Learning 3.258
118
Topic Impact Factor
119
Broad Impact Diffusion
Journal Diffusion of journals citing Cell
divided by the total number of citations to Cell,
over a given time period, times 100 Problem
relatively brittle at low citation counts. If a
topic/journal is cited twice by two different
topics/journals, it will have high diffusion.
120
Broad Impact Diversity
Topic Diversity Entropy of the distribution of
citing topics Better at capturing broad end of
impact spectrum the high diffusion topics are
identical to the least frequently cited topics
121
Broad Impact Diversity
Topic Diversity Entropy of the distribution of
citing topics Topic diversity can also be
measured for papers
122
Longevity Cited Half Life
  • Two views
  • Given a paper, what is the median age of
    citations to that paper?
  • What is the median age of citations from current
    literature?

123
History Topical Precedence
  • Within a topic, what are the earliest papers that
    received more than n citations?
  • Information Retrieval (138)
  • On Relevance, Probabilistic Indexing and
    Information Retrieval, Kuhns and Maron (1960)
  • Expected Search Length A Single Measure of
    Retrieval Effectiveness Based on the Weak
    Ordering Action of Retrieval Systems, Cooper
    (1968)
  • Relevance feedback in information retrieval,
    Rocchio (1971)
  • Relevance feedback and the optimization of
    retrieval effectiveness, Salton (1971)
  • New experiments in relevance feedback, Ide (1971)
  • Automatic Indexing of a Sound Database Using
    Self-organizing Neural Nets, Feiten and Gunzel
    (1982)
About PowerShow.com