Topic modeling - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Topic modeling

Description:

Topic modeling Mark Steyvers Department of Cognitive Sciences University of California, Irvine – PowerPoint PPT presentation

Number of Views:476
Avg rating:3.0/5.0
Slides: 52
Provided by: uci121
Category:

less

Transcript and Presenter's Notes

Title: Topic modeling


1
Topic modeling
Mark Steyvers Department of Cognitive
Sciences University of California, Irvine
2
Some topics we can discuss
  • Introduction to LDA basic topic model
  • Preliminary work on therapy transcripts
  • Extensions to LDA
  • Conditional topic models (for predicting
    behavioral codes)
  • Various topic models for word order
  • Topic models incorporating parse trees
  • Topic models for dialogue
  • Topic models incorporating speech information

3
Most basic topic model LDA(Latent Dirichlet
Allocation)
4
Automatic and unsupervised extraction of semantic
themes from large text collections.
  • Pennsylvania Gazette
  • (1728-1800)
  • 80,000 articles

Enron 250,000 emails
NYT 330,000 articles
NSF/ NIH 100,000 grants
AOL queries 20,000,000 queries 650,000 users
16 million Medline articles
5
Model Input
  • Matrix of counts number of times words occur in
    documents
  • Note
  • word order is lost bag of words approach
  • Some function words are deleted the, a, in

documents
words
6
Basic Assumptions
  • Each topic is a distribution over words
  • Each document a mixture of topics
  • Each word in a document originates from a single
    topic

7
Document mixture of topics
auto car parts cars used ford honda truck toyota
party store wedding birthday jewelry ideas cards c
ake gifts
hannah montana zac efron disney high school
musical miley cyrus hilary duff
webmd cymbalta xanax gout vicodin effexor predniso
ne lexapro ambien
20
Document ------------------------------- --------
--------------------------------------------------
---- ---------------------------------------------
------------------------------------------
80
100
Document ------------------------------- --------
--------------------------------------------------
---- ---------------------------------------------
------------------------------------------
8
Generative Process
  • For each document, choose a mixture of topics
  • ? ? Dirichlet(?)
  • Sample a topic 1..T from the mixture z ?
    Multinomial(?)
  • Sample a word from the topic w ?
    Multinomial(?(z)) ? ? Dirichlet(ß)

Nd
D
T
9
Prior Distributions
  • Dirichlet priors encourage sparsity on topic
    mixtures and topics

Topic 3
Word 3
Topic 1
Topic 2
Word 1
Word 2
? Dirichlet( a )
? Dirichlet( ß )
(darker colors indicate lower probability)
10
Toy Example
MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1 BANK1
MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 ....
1.0
.6
RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1 MONEY1
RIVER2 MONEY1 BANK2 LOAN1 MONEY1 ....
.4
1.0
RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2....
Topics
Topic Weights
Documents and topic assignments
11
Statistical Inference
MONEY? BANK BANK? LOAN? BANK? MONEY? BANK? MONEY?
BANK? LOAN? LOAN? BANK? MONEY? ....
?
?
RIVER? MONEY? BANK? STREAM? BANK? BANK? MONEY?
RIVER? MONEY? BANK? LOAN? MONEY? ....
?
RIVER? BANK? STREAM? BANK? RIVER? BANK?....
Topics
Topic Weights
Documents and topic assignments
12
Statistical Inference
  • Three sets of latent variables
  • document-topic distributions ?
  • topic-word distributions ?
  • topic assignments z
  • Estimate posterior distribution over topic
    assignments
  • P( z w )
  • we collapse over topic mixtures and word
    mixtures
  • we can later infer ? and ?
  • Use approximate methods Markov chain Monte
    Carlo (MCMC) with Gibbs sampling

13
Toy Example Artificial Dataset
Two topics
16 documents
Docs
Can we recover the original topics and topic
mixtures from this data?
14
Initialization assign word tokens randomly to
topics
(?topic 1 ?topic 2 )
15
Gibbs Sampling
count of topic t assigned to doc d
count of word w assigned to topic t
probability that word i is assigned to topic t
16
After 1 iteration
  • Apply sampling equation to each word token

(?topic 1 ?topic 2 )
17
After 4 iterations
(?topic 1 ?topic 2 )
18
After 8 iterations
(?topic 1 ?topic 2 )
19
After 32 iterations
?
(?topic 1 ?topic 2 )
20
Summary of Algorithm
INPUT word-document counts (word order is
irrelevant)
OUTPUT topic assignments to each word P( zi
) likely words in each topic P( w z ) likely
topics in each document (gist) P( z d )
21
Example topics from TASA an educational corpus
  • 37K docs 26K word vocabulary
  • 300 topics e.g.

22
Three documents with the word play(numbers
colors ? topic assignments)
23
LSA
documents
dims
dims
documents
C U D VT
dims
words
words
dims
Topic model
documents
topics
documents
C F Q
topics
words
words
normalized co-occurrence matrix
mixture weights
mixture components
24
Documents as Topics Mixturesa Geometric
Interpretation
P(word1)
1
topic 1
observeddocument
0
topic 2
1
P(word2)
P(word3)
1
P(word1)P(word2)P(word3) 1
25
Some Preliminary Work on Therapy Transcripts
26
Defining documents
  • Can define document in multiple ways
  • all words within a therapy session
  • all words from a particular speaker within a
    session
  • Clearly we need to extend topic model to
    dialogue.

27
(No Transcript)
28
Positive/Negative Topic Usage by Group
29
Positive/ Negative Topic Usage by Changes in
Satisfaction
This graph shows that couples with a decrease in
satisfaction over the course of therapy use
relatively negative language. Those who leave
the therapy with increased satisfaction exhibit
more positive language
30
Topics used by Satisfied/ Unsatisfied Couples
Topic 38
talk divorce problem house along separate separati
on talking agree example
Dissatisfied couples talk relatively more often
about separation and divorce
31
Affect Dynamics
  • Analyze the short-term dynamics of affect usage
  • Do unhappy couples follow up negative language
    with negative language more often than happy
    couples? In other words, are unhappy couples
    involved in a negative feedback loop?
  • Calculated
  • P( z2 z1 )
  • P( z2 z1- )
  • P( z2- z1 )
  • P( z2- z1- )
  • E.g. P( z2- z1 ) is the probability that
    after a positive word the next non-neutral word
    will be a negative word

32
Markov Chain Illustration Base rates
.51
.49

.27
z
Normal Controls
-
-

.73
.72
.28

.45
.55
.33
z
Positive Change
-
-

.73
.67
.27

.38
.62
.37
z
Little Change
-
-

.78
.63
.22
.35
.65

.41
z
Negative Change
-
-

.59
.78
.22
33
Modeling Extensions
34
Extensions
  • Multi-label Document Classification
  • conditional topic models
  • Topic models and word order
  • ngrams/collocations
  • hidden-markov models
  • Some potential model developments
  • topic models incorporating parse trees
  • topic models for dialogue
  • topic models incorporating speech information

35
Conditional Topic Models
Assume there is a topic associated with each
label/behavioral code. Model only is allowed to
assign words to labels that are associated with
the document This model can learn the
distribution of words associated with each
label/behavioral code
36
Vulnerabilityyes Hard Expressionno
Vulnerability
word? word word? word? word? word? word? word?
word? word? word? word? word? ....
?
Vulnerabilityno Hard Expressionyes
word? word? word? word? word? word? word? word?
word? word? word? word? ....
Hard Expression
?
Vulnerabilityyes Hard Expressionyes
word? word? word? word? word? word?....
Topics associated with Behavioral Codes
Topic Weights
Documents and topic assignments
37
Preliminary Results
38
Topic Models for short-range sequential
dependencies
39
Hidden Markov Topics Model
  • Syntactic dependencies ? short range dependencies
  • Semantic dependencies ? long-range

q
Semantic state generate words from topic model
z1
z2
z3
z4
w1
w2
w3
w4
Syntactic states generate words from HMM
s1
s2
s3
s4
(Griffiths, Steyvers, Blei, Tenenbaum, 2004)
40
NIPS Semantics
IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VI
EWS PIXEL VISUAL
KERNEL SUPPORT VECTOR SVM KERNELS SPACE FUNCTION
MACHINES SET
NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUT
S WEIGHTS OUTPUTS
EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEA
RNING MIXTURES FUNCTION GATE
MEMBRANE SYNAPTIC CELL CURRENT DENDRITIC POTENTI
AL NEURON CONDUCTANCE CHANNELS
DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR D
ISTRIBUTION EM BAYESIAN PARAMETERS
STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT L
EARNING CLASSES OPTIMAL
NIPS Syntax
IN WITH FOR ON FROM AT USING INTO OVER WITHIN
I X T N - C F P
IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENT
S EXISTS SEEMS
SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE
DESCRIBE SUGGEST
MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD
APPROACH PAPER PROCESS
HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HE
NCE FINALLY
USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESEN
TED DEFINED GENERATED SHOWN
41
Random sentence generation
LANGUAGE S RESEARCHERS GIVE THE SPEECH S THE
SOUND FEEL NO LISTENERS S WHICH WAS TO BE
MEANING S HER VOCABULARIES STOPPED WORDS S HE
EXPRESSLY WANTED THAT BETTER VOWEL
42
Collocation Topic Model
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT
CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDA
Y DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NA
SDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10
500_STOCK_INDEX
WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS
FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIE
S RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRM
S SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING
INVESTMENT_BANKERS INVESTMENT_BANKS
SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED
AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTAC
K NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATI
ONAL QAEDA TERRORIST_ATTACKS
BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS
COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_C
OURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPA
NIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CA
SE GROUP
43
Potential Model Developments
44
Using parse trees/ pos taggers?
S
S
NP
NP
VP
VP
You complete me
I complete you
45
Modeling Dialogue
46
Topic Segmentation Model
  • Purver, Kording, Griffiths, Tenenbaum, J. B.
    (2006). Unsupervised topic modeling for
    multi-party spoken discourse. Proceedings of the
    21st International Conference on Computational
    Linguistics and 44th Annual Meeting of the
    Association for Computational Linguistics
  • Automatically segments multi-party discourse into
    topically coherent segments
  • Outperforms standard HMMs
  • Model does not incorporate speaker information or
    speaker turns
  • goal is simply to segment long stream of words
    into segments

47
At each utterance, there is a prob. of changing
theta, the topic mixture. If no change is
indicated, words are drawn from the same mixture
of topics. If there is a change, the topic
mixture is resampled from Dirichley
48
Latent Dialogue Structure modelDing et al. (Nips
workshop, 2009)
  • Designed for modeling sequences of messages on
    discussion forums
  • Models the relationship of messages within
    documents a message might relate to any
    previous message within a dialogue
  • It does not incorporate speaker specific
    variables

49
Some details
50
Learning User Intentions in Spoken Dialogue
Systems Chinaei et al. (ICAART, 2009)
  • Applies HTMM model (Gruber et al., 2007) to
    dialogue
  • Assumes that within each talk-turn, words are
    drawn from same topic z (not mixture!). At start
    of new talk-turn, there is some probability (psi
    below) of sampling new topic z from mixture theta

51
Other ideas
  • Can we enhance topic models with non-verbal
    speech information
  • Each topic is a distribution over words as well
    as voicing information (f0, timing, etc)

T
Nd
D
Non-verbal feature
52
Other Extensions
53
Learning Topic Hierarchies(example psych Review
Abstracts)
THE OF AND TO IN A IS
A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS
ACCOUNT
SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES IN
TERPERSONAL PERSONALITY SAMPLING
MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DI
RECTION CONTOURS SURFACES
DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGE
R EXTINCTION PAIN
RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMUL
I RECALL CHOICE CONDITIONING
SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SE
MANTIC
ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIO
NAL THINKING
GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL G
ROUPS MEMBERS
SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HA
NDEDNESS
REASONING ATTITUDE CONSISTENCY SITUATIONAL INFEREN
CE JUDGMENT PROBABILITIES STATISTICAL
IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT
ORIENTATION HOLOGRAPHIC
CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMU
LATION TOLERANCE RESPONSES
Write a Comment
User Comments (0)
About PowerShow.com