Title: Concept and Theme Discovery through Probabilistic Models and Clustering
1Concept and Theme Discovery through Probabilistic
Models and Clustering
- Qiaozhu Mei
- Oct. 12, 2005
2Concepts and Themes
- Language units in biology literature mining
- Terms
- Phrases
- Entities
- Concepts (tight groups of terms/entities
representing semantics e.g. Gene Synonyms) - Themes (loose groups of terms representing
topic/subtopics)
3Theme Discovery
- What weve got now
- A Generative Model to extract k themes from a
collection - Each theme as a language model, represented by
top probability words in a theme language model - KL Divergence to model the distance/similarity
between themes - retrieve most similar themes to a term group
4Theme Discovery (cont.)
- What weve got now (cont.)
- Use HMM to segment the whole collection with the
theme extracted - Use MMR to find most representative and least
redundant phrases to represent a theme (currently
using n-gram prob. as and edit distance as
similarity, performance to be tuned..) - Results http//ucair.cs.uiuc.edu/qmei2/ThemeNavig
ation.html
5Some justifications
- Fly collection
- Cluster 0 circadian
- Cluster 1 adh, evolution
- Cluster 2 a mixture of two topics, apoptosis and
promoters - Cluster 6 brain development
- Cluster 8 cell division
- Cluster 12 drosophila immunity
- Cluster 13 nervous systems
- Cluster 14 hedgehog segment Polarity gene
- Cluster 16 Histone, Polycomb
- Cluster 17 visual system
6Theme Discovery (cont.)
- Problems
- How to select k? (how many themes do we believe
are there in the collection bee collection
should have smaller k than fly collection) - Can we find themes in a hierarchical manner?
- This can solve the former problemhowever, when
to cutoff? - How to represent a theme?
- Top words sometimes difficult to tell the
semantics - Phrases?
- Sentences?
- Other possible approaches to extract theme?
(LDAs, Clustering methods)
7Hierarchical Theme Discovery
- A straightforward approach (top down splitting)
- Discover k themes from the initial collection
- Segment the collection by the k themes
- For each theme, build a sub-collection with the
segments in previous step - For each sub-collection, extract k themes
- Do these processes iteratively
- Problem When to stop splitting iteration?
Collection
Theme1
Theme3
Theme2
Theme2.1
Theme2.3
Theme2.2
8Hierarchical Theme Discovery (results)
A bee collection with 929 documents
Level1 5 themes
Level2 3 sub-themes for each higher level theme
9Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
10Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
venom reward patients naja kda proteins wasp prote
in diptera pla2 vespula primates hominidae chordat
a vertebrata mug sting sperm dose quality
african european population populations patterns p
attern genetic discrimination mitochondrial studie
s information are contrast green two bees have der
ived africa subspecies
larvae microorganisms gram bacteria 0 colonies roy
al queen jelly eubacteria non workers queens produ
ction 2 nest italian 5 fraction nestmates
11Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
food foragers dance transfer enzyme biosynthesis r
eceivers contrast nectar flight source flow water
information rates ddt rj caucasian visual green
queen worker workers colonies pollen vibration egg
s foraging development brood signal queens bees an
archistic behavioral iridaceae larvae egg pheromon
e may
mammals vertebrates venom nonhuman l ml models mod
el chordates beeswax mug omega embryo mammalia ver
tebrata has chordata nurse coloured vg
12Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
ecology is species environmental sciences flowerin
g floral terrestrial pollinator visiting reproduct
ion plants c cashew self animalia food insects fab
a size
seed per crop sunflower number cruciferae fruit hy
brid agriculture seeds quality cultivar weight hel
ianthus oilseed compositae annuus yield pollinatio
n set
pollen eep honeybees mating bumblebees sp hive bac
teria scent mimosa brazil undertakers chromatograp
hy marks recently gram eubacteria caraway microorg
anisms propolis
13Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
bees sucrose conditioning response learning extens
ion proboscis pollen foragers performance between
thresholds honeybees solution discrimination strai
n rate foraging concentration low
dopamine levels development age binding pupal brai
n octopamine division adult colonies labor glass t
reated colony ryr pigmentation chromosomes arolium
da
imidacloprid current memory mushroom neurons 1 exp
ressed 4 cells antennal mb bodies currents nervous
brain mv kinase receptors term protein
14Hierarchical Theme Discovery (results)
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
mite varroa mites brood jacobsoni acarina colonies
parasite for worker control a drone formic popula
tion acid host 0 cells treatment
viruses larvae microorganisms virus bacteria anima
l paenibacillus infection molecular pathogen eubac
teria gram forming endospore positives p apv entom
opathogen
pollen bees foragers their or ta heat at hygienic
foraging protein activity behaviour increased resp
onse blood flight strips metabolic removal
15Phrase Representations
biochemistry and molecular biophysics endocrine
system chemical coordination and homeostasis
molecular genetics biochemistry and molecular
biophysics sense organs sensory reception
animals arthropods chordates insects
invertebrates mammals system chemical
coordination and homeostasis vertebrata chordata
animalia honey bee behavior terrestrial
ecology mammalia vertebrata chordata animalia
juvenile hormone queen rodentia mammalia
vertebrata chordata animalia worker laid eggs
vibration signal genetics biochemistry and
molecular biophysics dufour s gland mammals
nonhuman mammals workers egg laying queen
mandibular gland pheromone nonhuman vertebrates
iridaceae ixia arthropoda invertebrata animalia
muridae aves vertebrata chordata animalia mug
ml
african jelly royal european venom
population africanized sting kda feral m
reward subspecies proteins patients
discrimination naja cue characters areas
queen workers worker signal jh vibration pheromone
gland eggs signals hormone juvenile anarchistic q
ueens egg iridaceae policing ixia behavioral age
pollinator plants pollination flowers plantae sper
matophyta angiospermae dicotyledones pollen seed f
ruit angiosperms spermatophytes vascular dicots cr
op plant flower pollinators species
learning brain conditioning olfactory neural neuro
ns mushroom memory sucrose nervous coordination do
pamine extension antennal odor system proboscis bo
dies lobe kenyon
varroa mite mites jacobsoni acarina brood parasite
colonies host control chelicerata chelicerates hy
gienic viruses infestation destructor pest infeste
d parasitology mortality
16Hierarchical Theme Discovery (cont.)
- A bottom up agglomerative approach
- Find many micro-themes
- Group similar micro-themes into larger ones
- Borrow strategy from data mining
- BIRCH incrementally form many micro-clusters,
organized in a tree structure - Macro-clustering based on micro-clusters.
- Problem Again, when to stop?
17Hierarchical Theme Discovery (cont.)
- Model-based approach
- Hofmann, IJCAI 99.
- Assume we know the collection is generated from a
hierarchical structure, use a generative model to
learn the themes. (e.g. make use of GO
hierarchies) - Problem in most cases we dont know the
hierarchies.
18Other Research Problems
- Represent a theme
- Using top words where to cut
- Using phrases have to tune the MMR (many
possible strategies and parameter tuning) - Using sentence? Like summarization
- Themes are interesting but how to make use of
the themes? - How to evaluate themes??
19Concept Extraction
- What we have now
- N-gram algorithm (actually 2-gram) iteratively
group a pair of terms which are most likely to be
replaceable considering the context of one term
before/after it. - Time Complexity O(N3), Space Complexity now
O(N2). Beespace server can deal with lt 9000
terms now (2.4g memory). (performance not
evaluated due to the small data size acceptable). - Problem based on Mutual Information, preferring
2-grams with low frequency. Doesnt make use of
farther context. - Will removing stop words help or turn down the
performance?
20Some finding
- A small dataset (200 abstracts containing gene
synonyms) - Only 600 iterations (merge 600 times)
- Most of them are reasonable, but not really
useful - E.g. head-to-head tail-to-tail
- E.g. within-locus between-locus
- FBgn0000017 Dsrc Dabl
- FBgn0000078 amylase-null AMY-null
- Problem doc-set too small, n-gram too sparse to
find useful concepts.
21Concept Extraction (cont.)
- Other Possible strategy
- Lin et al, KDD 02 Use feature vector to
represent terms, the weights are the mutual
information between term and context feature.
Thus more flexible than n-gram. (if only consider
2-gram as context features, this will be similar
to what we have) - Use committee to represent a cluster, thus
assures the clusters are tight and robust. - Problem not sure how to select features
22Summary
- Theme Extraction
- Generally performs well, if we can find a good k.
- Hierarchical Clustering can solve this problem,
but still need to find a reasonable stop
criteria. - Representation is an interesting problem MMR
phrase extraction should be further tuned - Difficult to evaluate other than expert
justification - Concept extraction
- N-gram has space constraints havent really
tested the performance Generally, the
performance should be better on large data sets - Other clustering algorithms can be explored.