ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on - PowerPoint PPT Presentation

Loading...

PPT – ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on PowerPoint presentation | free to download - id: 5bc204-MmNiN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on

Description:

ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the s on topic models were adapted from the presentation by Griffiths and ... – PowerPoint PPT presentation

Number of Views:1145
Avg rating:3.0/5.0

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note: many of the slides on topic models were adapted from the presentation by Griffiths and Steyvers at the Beckman National Academy of Sciences Symposium on


1
ICS 278 Data Mining Lecture 14 Document
Clustering and Topic Extraction Note many of
the slides on topic models were adapted from the
presentation by Griffiths and Steyvers at the
Beckman National Academy of Sciences Symposium on
Mapping Knowledge Domains, Beckman Center, UC
Irvine, May 2003.
  • Padhraic Smyth
  • Department of Information and Computer Science
  • University of California, Irvine

2
Text Mining
  • Information Retrieval
  • Text Classification
  • Text Clustering
  • Information Extraction

3
Document Clustering
  • Set of documents D in term-vector form
  • no class labels this time
  • want to group the documents into K groups or into
    a taxonomy
  • Each cluster hypothetically corresponds to a
    topic
  • Methods
  • Any of the well-known clustering methods
  • K-means
  • E.g., spherical k-means, normalize document
    distances
  • Hierarchical clustering
  • Probabilistic model-based clustering methods
  • e.g., mixtures of multinomials
  • Single-topic versus multiple-topic models
  • Extensions to author-topic models

4
Mixture Model Clustering

5
Mixture Model Clustering

6
Mixture Model Clustering

Conditional Independence model for each
component (often quite useful to first-order)
7
Mixtures of Documents
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
8
Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
9
Treat as Missing
Terms
1
1
1
1
1
C1
1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
Documents
C1
C1
C2
1
1
1
C2
1
1
1
C2
1
1
1
1
C2
1
1
1
C2
1
1
1
1
1
C2
1
1
C2
1
1
1
C2
10
Treat as Missing
Terms
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Documents
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
E-Step estimate component membership
probabilities given current parameter estimates
11
Treat as Missing
Terms
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Documents
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
M-Step use fractional weighted data to get new
estimates of the parameters
12
A Document Cluster
Most Likely Terms in Component 5 weight 0.08
TERM p(tk) write
0.571 drive 0.465 problem
0.369 mail 0.364
articl 0.332 hard
0.323 work 0.319 system
0.303 good 0.296 time
0.273 Highest Lift Terms in
Component 5 weight 0.08 TERM
LIFT p(tk) p(t) scsi 7.7
0.13 0.02 drive 5.7 0.47
0.08 hard 4.9 0.32 0.07
card 4.2 0.23 0.06 format
4.0 0.12 0.03 softwar
3.8 0.21 0.05 memori 3.6
0.14 0.04 install 3.6 0.14
0.04 disk 3.5 0.12 0.03
engin 3.3 0.21 0.06
13
Another Document Cluster
Most Likely Terms in Component 1 weight 0.11
TERM p(tk) articl
0.684 good 0.368 dai
0.363 fact 0.322
god 0.320 claim
0.294 apr 0.279 fbi
0.256 christian 0.256
group 0.239 Highest Lift Terms
in Component 1 weight 0.11 TERM
LIFT p(tk) p(t) fbi
8.3 0.26 0.03 jesu 5.5 0.16
0.03 fire 5.2 0.20 0.04
christian 4.9 0.26 0.05 evid
4.8 0.24 0.05 god
4.6 0.32 0.07 gun 4.2
0.17 0.04 faith 4.2 0.12
0.03 kill 3.8 0.22 0.06
bibl 3.7 0.11 0.03
14
A topic is represented as a (multinomial)
distribution over words
Example topic 1
Example topic 2
SPEECH
.0691
WORDS
.0671
RECOGNITION
.0412
WORD
.0557
SPEAKER
.0288
USER
.0230
PHONEME
.0224
DOCUMENTS
.0205
CLASSIFICATION
.0154
TEXT
.0195
SPEAKERS
.0140
RETRIEVAL
.0152
FRAME
.0135
INFORMATION
.0144
PHONETIC
.0119
DOCUMENT
.0144
PERFORMANCE
.0111
LARGE
.0102
ACOUSTIC
.0099
COLLECTION
.0098
BASED
.0098
KNOWLEDGE
.0087
PHONEMES
.0091
MACHINE
.0080
UTTERANCES
.0091
RELEVANT
.0077
SET
.0089
SEMANTIC
.0076
LETTER
.0088
SIMILARITY
.0071


15
The basic model.
C
X1
X2
Xd
16
A better model.
B
C
A
X1
X2
Xd
17
A better model.
B
C
A
X1
X2
Xd
History - latent class models in statistics -
Hofmann applied to documents (SIGIR 99) -
recent extensions, e.g., Blei, Jordan, Ng (JMLR,
2003) - variously known as factor/aspect/latent
class models
18
A better model.
B
C
A
X1
X2
Xd
Inference can be intractable due to undirected
loops!
19
A better model for documents.
  • Multi-topic model
  • A document is generated from multiple components
  • Multiple components can be active at once
  • Each component multinomial distribution
  • Parameter estimation is tricky
  • Very useful
  • parses into high-level semantic components

20
History of multi-topic models
  • Latent class models in statistics
  • Hoffman 1999
  • Original application to documents
  • Blei, Ng, and Jordan (2001, 2003)
  • Variational methods
  • Griffiths and Steyvers (2003)
  • Gibbs sampling approach (very efficient)

21
1 2
3 4
GROUP 0.057185 DYNAMIC 0.152141
DISTRIBUTED 0.192926 RESEARCH 0.066798
MULTICAST 0.051620 STRUCTURE
0.137964 COMPUTING 0.044376
SUPPORTED 0.043233 INTERNET 0.049499
STRUCTURES 0.088040 SYSTEMS 0.038601
PART 0.035590 PROTOCOL
0.041615 STATIC 0.043452
SYSTEM 0.031797 GRANT 0.034476
RELIABLE 0.020877 PAPER 0.032706
HETEROGENEOUS 0.030996 SCIENCE
0.023250 GROUPS 0.019552
DYNAMICALLY 0.023940 ENVIRONMENT 0.023163
FOUNDATION 0.022653 PROTOCOLS
0.019088 PRESENT 0.015328
PAPER 0.017960 FL 0.021220
IP 0.014980 META 0.015175
SUPPORT 0.016587 WORK
0.021061 TRANSPORT 0.012529
CALLED 0.011669 ARCHITECTURE 0.016416
NATIONAL 0.019947 DRAFT 0.009945
RECURSIVE 0.010145 ENVIRONMENTS
0.013271 NSF 0.018116
Content components
Boilerplate components
22
5 6
7 8
DIMENSIONAL 0.038901 RULES 0.090569
ORDER 0.192759 GRAPH
0.095687 POINTS 0.037263
CLASSIFICATION 0.062699 TERMS
0.048688 PATH 0.061784
SURFACE 0.031438 RULE 0.062174
PARTIAL 0.044907 GRAPHS
0.061217 GEOMETRIC 0.025006
ACCURACY 0.028926 HIGHER 0.041284
PATHS 0.030151 SURFACES 0.020152
ATTRIBUTES 0.023090 REDUCTION
0.035061 EDGE 0.028590
MESH 0.016875 INDUCTION 0.021909
PAPER 0.028602 NUMBER 0.022775
PLANE 0.013902 CLASSIFIER
0.019418 TERM 0.018204
CONNECTED 0.016817 POINT 0.013780
SET 0.018303 ORDERING 0.017652
DIRECTED 0.014405 GEOMETRY
0.013780 ATTRIBUTE 0.016204
SHOW 0.017022 NODES 0.013625
PLANAR 0.012385 CLASSIFIERS 0.015417
MAGNITUDE 0.015526 VERTICES
0.013554
9
10 11
12 INFORMATION 0.281237
SYSTEM 0.143873 PAPER 0.077870
LANGUAGE 0.158786 TEXT 0.048675
FILE 0.054076 CONDITIONS
0.041187 PROGRAMMING 0.097186
RETRIEVAL 0.044046 OPERATING 0.053963
CONCEPT 0.036268 LANGUAGES
0.082410 SOURCES 0.029548
STORAGE 0.039072 CONCEPTS 0.033457
FUNCTIONAL 0.032815 DOCUMENT 0.029000
DISK 0.029957 DISCUSSED
0.027414 SEMANTICS 0.027003
DOCUMENTS 0.026503 SYSTEMS 0.029221
DEFINITION 0.024673 SEMANTIC
0.024341 RELEVANT 0.018523
KERNEL 0.028655 ISSUES 0.024603
NATURAL 0.016410 CONTENT 0.016574
ACCESS 0.018293 PROPERTIES
0.021511 CONSTRUCTS 0.014129
AUTOMATICALLY 0.009326 MANAGEMENT
0.017218 IMPORTANT 0.021370
GRAMMAR 0.013640 DIGITAL 0.008777
UNIX 0.016878 EXAMPLES 0.019754
LISP 0.010326
23
13
14 15
16 MODEL 0.429185
PAPER 0.050411 TYPE 0.088650
KNOWLEDGE 0.212603 MODELS 0.201810
APPROACHES 0.045245 SPECIFICATION
0.051469 SYSTEM 0.090852
MODELING 0.066311 PROPOSED 0.043132
TYPES 0.046571 SYSTEMS
0.051978 QUALITATIVE 0.018417
CHANGE 0.040393 FORMAL 0.036892
BASE 0.042277 COMPLEX 0.009272
BELIEF 0.025835 VERIFICATION
0.029987 EXPERT 0.020172
QUANTITATIVE 0.005662 ALTERNATIVE 0.022470
SPECIFICATIONS 0.024439 ACQUISITION
0.017816 CAPTURE 0.005301
APPROACH 0.020905 CHECKING 0.024439
DOMAIN 0.016638 MODELED 0.005301
ORIGINAL 0.019026 SYSTEM
0.023259 INTELLIGENT 0.015737
ACCURATELY 0.004639 SHOW 0.017852
PROPERTIES 0.018242 BASES
0.015390 REALISTIC 0.004278
PROPOSE 0.016991 ABSTRACT 0.016826
BASED 0.014004
Style components
24
A generative model for documents
  • Each document a mixture of topics
  • Each word chosen from a single topic
  • from parameters
  • from parameters

(Blei, Ng, Jordan, 2003)
25
A generative model for documents
q
z
z
z
w
w
w
  • Called Latent Dirichlet Allocation (LDA)
  • Introduced by Blei, Ng, and Jordan (2003),
    reinterpretation of PLSI (Hofmann, 2001)

26
(Dumais, Landauer)
P(w)
27
A generative model for documents
w P(wz 1) f (1)
w P(wz 2) f (2)
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK
0.0 RESEARCH 0.0 MATHEMATICS 0.0
HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY
0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
topic 1
topic 2
28
Choose mixture weights for each document,
generate bag of words
q P(z 1), P(z 2) 0, 1 0.25,
0.75 0.5, 0.5 0.75, 0.25 1, 0
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
HEART LOVE TEARS KNOWLEDGE HEART
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK
TEARS SOUL KNOWLEDGE HEART
WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE
LOVE SOUL
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
29
Bayesian inference
  • Sum in the denominator over Tn terms
  • Full posterior only tractable to a constant

30
Bayesian sampling
  • Sample from a Markov chain which converges to the
    target distribution of interest
  • Known as Markov chain Monte Carlo in general
  • Simple version is known as Gibbs sampling
  • Say we are interested in estimating p(x, y D)
  • We can approximate this by sampling from
    p(xy,D), p(yx,D) in an iterative fashion
  • Useful when conditionals are known, but joint
    distribution is not easy to work with
  • Converges to true distribution under fairly broad
    assumptions
  • Can compute approximate statistics from
    intractable distributions

31
Gibbs sampling
  • Need full conditional distributions for variables
  • Since we only sample z we need

number of times word w assigned to topic j
number of times topic j used in document d
32
Gibbs sampling
iteration 1
33
Gibbs sampling
iteration 1 2
34
Gibbs sampling
iteration 1 2
35
Gibbs sampling
iteration 1 2
36
Gibbs sampling
iteration 1 2
37
Gibbs sampling
iteration 1 2
38
Gibbs sampling
iteration 1 2
39
Gibbs sampling
iteration 1 2
40
Gibbs sampling
iteration 1 2
1000
41
A visual example Bars
sample each pixel from a mixture of topics
pixel word image document
42
(No Transcript)
43
(No Transcript)
44
Interpretable decomposition
  • SVD gives a basis for the data, but not an
    interpretable one
  • The true basis is not orthogonal, so rotation
    does no good

45
Bayesian model selection
  • How many topics T do we need?
  • A Bayesian would consider the posterior
  • P(wT) involves summing over all possible
    assignments z
  • but it can be approximated by sampling

P(Tw) ? P(wT) P(T)
46
Bayesian model selection
T 10
P( w T )
T 100
Corpus (w)
47
Bayesian model selection
T 10
P( w T )
T 100
Corpus (w)
48
Bayesian model selection
T 10
P( w T )
T 100
Corpus (w)
49
Back to the bars data set
50
PNAS corpus preprocessing
  • Used all D 28,154 abstracts from 1991-2001
  • Used any word occurring in at least five
    abstracts, not on stop list (W 20,551)
  • Segmentation by any delimiting character, total
    of n 3,026,970 word tokens in corpus
  • Also, PNAS class designations for 2001

51
Running the algorithm
  • Memory requirements linear in T(WD), runtime
    proportional to nT
  • T 50, 100, 200, 300, 400, 500, 600, (1000)
  • Ran 8 chains for each T, burn-in of 1000
    iterations, 10 samples/chain at a lag of 100
  • All runs completed in under 30 hours on
    BlueHorizon supercomputer at San Diego

52
A selection of topics
STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STR
UCTURAL RESOLUTION HELIX THREE HELICES DETERMINED
RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIO
NAL INTERACTIONS MOLECULE SURFACE
NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NE
URONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATE
RAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AR
EAS THALAMIC
TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GR
OWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MAL
IGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN
MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR
MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONT
RACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOL
ATED MYOD FAILURE
HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION
HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AI
DS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL
FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCO
PY WATER FORCES PARTICLES STRENGTH POLYMER IONIC A
TOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTION
S BEADS MECHANICAL
53
A selection of topics
STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUD
Y DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USI
NG FINDINGS DEMONSTRATE REPORT INDICATED CONSISTEN
T REPORTS CONTRAST
MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKN
OWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RE
SPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING L
ARGELY KNOWN
MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMP
LE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL
ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PRE
DICTIONS CONSISTENT PARAMETERS
CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROM
OSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MA
PS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES
ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATA
L EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BI
RTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROG
ENESIS ADULTS
PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMOD
IUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFE
CTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA
CRUZI BRUCEI HUMAN HOSTS
MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFS
PRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCT
ION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY
54
A selection of topics
STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUD
Y DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USI
NG FINDINGS DEMONSTRATE REPORT INDICATED CONSISTEN
T REPORTS CONTRAST
MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKN
OWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RE
SPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING L
ARGELY KNOWN
MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMP
LE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL
ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PRE
DICTIONS CONSISTENT PARAMETERS
CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROM
OSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MA
PS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES
ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATA
L EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BI
RTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROG
ENESIS ADULTS
PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMOD
IUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFE
CTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA
CRUZI BRUCEI HUMAN HOSTS
MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFS
PRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCT
ION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY
55
How many topics?
56
(No Transcript)
57
Scientific syntax and semantics
Factorization of language based on statistical
dependency patterns long-range, document
specific dependencies short-range
dependencies constant across all documents
semantics probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax probabilistic regular grammar
58
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
59
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE
60
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE
61
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE OF
62
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE OF RESEARCH
63
Semantic topics
64
Syntactic classes
5
8
14
25
26
30
33
IN
ARE
THE
SUGGEST
LEVELS
RESULTS
BEEN
FOR
WERE
THIS
INDICATE
NUMBER
ANALYSIS
MAY
ON
WAS
ITS
SUGGESTING
LEVEL
DATA
CAN
BETWEEN
IS
THEIR
SUGGESTS
RATE
STUDIES
COULD
DURING
WHEN
AN
SHOWED
TIME
STUDY
WELL
AMONG
REMAIN
EACH
REVEALED
CONCENTRATIONS
FINDINGS
DID
FROM
REMAINS
ONE
SHOW
VARIETY
EXPERIMENTS
DOES
UNDER
REMAINED
ANY
DEMONSTRATE
RANGE
OBSERVATIONS
DO
WITHIN
PREVIOUSLY
INCREASED
INDICATING
CONCENTRATION
HYPOTHESIS
MIGHT
THROUGHOUT
BECOME
EXOGENOUS
PROVIDE
DOSE
ANALYSES
SHOULD
THROUGH
BECAME
OUR
SUPPORT
FAMILY
ASSAYS
WILL
TOWARD
BEING
RECOMBINANT
INDICATES
SET
POSSIBILITY
WOULD
INTO
BUT
ENDOGENOUS
PROVIDES
FREQUENCY
MICROSCOPY
MUST
AT
GIVE
TOTAL
INDICATED
SERIES
PAPER
CANNOT
INVOLVING
MERE
PURIFIED
DEMONSTRATED
AMOUNTS
WORK
REMAINED
AFTER
APPEARED
TILE
SHOWS
RATES
EVIDENCE
ALSO
THEY
ACROSS
APPEAR
FULL
SO
CLASS
FINDING
AGAINST
ALLOWED
CHRONIC
REVEAL
VALUES
MUTAGENESIS
BECOME
WHEN
NORMALLY
ANOTHER
DEMONSTRATES
AMOUNT
OBSERVATION
MAG
ALONG
EACH
EXCESS
SUGGESTED
SITES
MEASUREMENTS
LIKELY
65
(PNAS, 1991, vol. 88, 4874-4876) A23
generalized49 fundamental11 theorem20 of4
natural46 selection46 is32 derived17 for5
populations46 incorporating22 both39 genetic46
and37 cultural46 transmission46. The14
phenotype15 is32 determined17 by42 an23
arbitrary49 number26 of4 multiallelic52 loci40
with22 two39-factor148 epistasis46 and37 an23
arbitrary49 linkage11 map20, as43 well33 as43
by42 cultural46 transmission46 from22 the14
parents46. Generations46 are8 discrete49 but37
partially19 overlapping24, and37 mating46 may33
be44 nonrandom17 at9 either39 the14 genotypic46
or37 the14 phenotypic46 level46 (or37 both39).
I12 show34 that47 cultural46 transmission46 has18
several39 important49 implications6 for5 the14
evolution46 of4 population46 fitness46, most36
notably4 that47 there41 is32 a23 time26 lag7 in22
the14 response28 to31 selection46 such9 that47
the14 future137 evolution46 depends29 on21 the14
past24 selection46 history46 of4 the14
population46.
(graylevel semanticity, the probability of
using LDA over HMM)
66
(PNAS, 1996, vol. 93, 14628-14631) The14
''shape7'' of4 a23 female115 mating115
preference125 is32 the14 relationship7 between4
a23 male115 trait15 and37 the14 probability7 of4
acceptance21 as43 a23 mating115 partner20, The14
shape7 of4 preferences115 is32 important49 in5
many39 models6 of4 sexual115 selection46, mate115
recognition125, communication9, and37
speciation46, yet50 it41 has18 rarely19 been33
measured17 precisely19, Here12 I9 examine34
preference7 shape7 for5 male115 calling115
song125 in22 a23 bushcricket13 (katydid48).
Preferences115 change46 dramatically19 between22
races46 of4 a23 species15, from22 strongly19
directional11 to31 broadly19 stabilizing45 (but50
with21 a23 net49 directional46 effect46),
Preference115 shape46 generally19 matches10 the14
distribution16 of4 the14 male115 trait15, This41
is32 compatible29 with21 a23 coevolutionary46
model20 of4 signal9-preference115 evolution46,
although50 it41 does33 nor37 rule20 out17 an23
alternative11 model20, sensory125
exploitation150. Preference46 shapes40 are8
shown35 to31 be44 genetic11 in5 origin7.
67
(PNAS, 1996, vol. 93, 14628-14631) The14
''shape7'' of4 a23 female115 mating115
preference125 is32 the14 relationship7 between4
a23 male115 trait15 and37 the14 probability7 of4
acceptance21 as43 a23 mating115 partner20, The14
shape7 of4 preferences115 is32 important49 in5
many39 models6 of4 sexual115 selection46, mate115
recognition125, communication9, and37
speciation46, yet50 it41 has18 rarely19 been33
measured17 precisely19, Here12 I9 examine34
preference7 shape7 for5 male115 calling115
song125 in22 a23 bushcricket13 (katydid48).
Preferences115 change46 dramatically19 between22
races46 of4 a23 species15, from22 strongly19
directional11 to31 broadly19 stabilizing45 (but50
with21 a23 net49 directional46 effect46),
Preference115 shape46 generally19 matches10 the14
distribution16 of4 the14 male115 trait15. This41
is32 compatible29 with21 a23 coevolutionary46
model20 of4 signal9-preference115 evolution46,
although50 it41 does33 nor37 rule20 out17 an23
alternative11 model20, sensory125
exploitation150. Preference46 shapes40 are8
shown35 to31 be44 genetic11 in5 origin7.
68
End of presentation on topic models . switch
now to Author-topic model
69
Recent Results on Author-Topic Models
70
Authors
Words
Can we model authors, given documents? (more
generally, build statistical profiles of
entities given sparse observed data)
71
Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions Parameters learned via Bayesian
learning
72
Authors
Hidden Topics
Words
73
Authors
Hidden Topics
Words
74
Authors
Hidden Topics
Words
75
Authors
Hidden Topics
Words
76
Authors
Hidden Topics
Words
77
Authors
Hidden Topics
Words
About PowerShow.com