Topic Extraction from Biology Literature: Prior, Labeling, and Switching - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Topic Extraction from Biology Literature: Prior, Labeling, and Switching

Description:

Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 30
Provided by: Qiaoz9
Category:

less

Transcript and Presenter's Notes

Title: Topic Extraction from Biology Literature: Prior, Labeling, and Switching


1
Topic Extraction from Biology Literature Prior,
Labeling, and Switching
  • Qiaozhu Mei

2
A Sample Topic
Word Distribution (language model)
Meaningful labels
labels
actin filaments flight muscle flight muscles
filaments 0.0410238 muscle
0.0327107 actin 0.0287701 z
0.0221623 filament
0.0169888 myosin 0.0153909 thick
0.00968766 thin
0.00926895 sections 0.00924286 er
0.00890264 band
0.00802833 muscles 0.00789018 antibodies
0.00736094 myofibrils 0.00688588 flight
0.00670859 images 0.00649626
Example documents
  • actin filaments in honeybee-flight muscle move
    collectively
  • arrangement of filaments and cross-links in the
    bee flight muscle z disk by image analysis of
    oblique sections
  • identification of a connecting filament protein
    in insect fibrillar flight muscle
  • the invertebrate myosin filament subfilament
    arrangement of the solid filaments of insect
    flight muscles
  • structure of thick filaments from insect flight
    muscle

3
Topic/Theme Extraction
  • A theme/topic is represented with a multinomial
    distribution over words
  • Unigram language models
  • Easier to interpret
  • Easy to add prior
  • Easy for retrieval
  • Assumption
  • K themes in a collection
  • A document covers multiple themes

4
Topic Extraction v.s. Clustering
  • Topic Extraction
  • Effective to reveal the latent topics, and find
    most relevant documents to a topic
  • Better interpretation, worse accuracy
  • Effective to add priors (control the topics)
  • Clustering algorithms
  • Effective to assign documents into non-overlapped
    clusters
  • Better accuracy, worse interpretation
  • Hard to control

5
Topic Extraction (Results)
Related documents 44 biosis199598006316 44
biosis200000292072 44 biosis199293065558 44
biosis199799595920 44 biosis199395062782
corpora   (0.0438967 )allata  
(0.0315774 )hormone   (0.0249687
)juvenile   (0.0184049 )insulin 
 (0.0174549 )embryos   (0.0165997
)neurosecretory  (0.0127734 )embryo 
 (0.0124167 )biosynthesis  (0.0118067
)cardiaca   (0.00969471 )sexta  
(0.0088941 )medium   (0.00865245 )iran  
(0.00703376 )mannose   (0.00668768
)volume   (0.00661038 )synapse  
(0.00652483 )injected   (0.00636151 )
stimulatory effect of octopamine on juvenile
hormone biosynthesis in honey bees (apis
mellifera) physiological and immunocytochemical
evidence
  • May want a more general topic
  • How to tell the algorithm to find a more general
    topic, like behavioral maturation?

6
Topic Extraction (Results cont.)
pollen   (0.467911 )foraging  
(0.0373205 )foragers   (0.0365857
)collected   (0.0318249 )grains  
(0.0314324 )loads   (0.025104
)collection   (0.0208903 )nectar  
(0.0185726 )sources   (0.0113751
)collecting   (0.00999529 )types 
 (0.00978636 )pellets   (0.00942175
)germination  (0.00733012 )load  
(0.00646375 )stored   (0.00599516
)amount   (0.00481306 )trips  
(0.00478013 )
Related Documents 13 biosis200200039990 13
biosis199900297835 13 biosis200100318017 13
biosis199497516580 13 biosis200000045397
the response of the stingless bee melipona
beecheii to experimental pollen stress, worker
loss and different levels of information input
  • Biased towards Pollen
  • Not precisely covering foraging
  • How to tell the algorithm to focus on
    foraging?

7
Topic Extraction (Full Results)
  • 100 topics from biosis-bee http//sifaka.cs.uiuc.
    edu/qmei2/data/beespace/bee-100-basic.html
  • 5 themes for query food in biosis-bee 500
    documents http//sifaka.cs.uiuc.edu/qmei2/data/b
    eespace/bee-food-5-basic.html

8
Incorporating Topic Priors
  • Either topic extraction or clustering
  • Cannot guarantee the themes are expected
  • User exploration usually has preference.
  • E.g., want one topic/cluster is about foraging
    behavior
  • Use prior to guild the theme extraction
  • Prior as a simple language model
  • E.g. forage 0.2 foraging 0.3 food 0.05 etc.

9
Incorporating Topic Priors
Prior
Prior
Original EM
Prior language model interpreted as pseudo
counts
EM with Prior
10
Incorporating Topic Priors (results)
foraging 0.0498044 food
0.0472535 foragers 0.0310718 dance
0.0266078 source
0.0254369 nectar 0.0162739 distance
0.0141869 forage
0.0141503 information 0.0129047 dances
0.012684 hive
0.0124987 landmarks 0.0119087 dancing
0.0109375 waggle
0.0101672 feeder 0.0101266 rate
0.0085641 sources
0.00825884 recruitment 0.00813717 forager
0.00796914
Prior forage 0.1 foraging 0.1 food 0.1 source
0.1
11
Incorporating Topic Priors (results cont.)
age 0.0672687 division
0.0551497 labor 0.052136 colony
0.038305 foraging 0.0357817 foragers
0.0236658 workers 0.0191248 task
0.0190672 behavioral
0.0189017 behavior 0.0168805 older
0.0143466 tasks 0.013823 old
0.011839 individual
0.0114329 ages 0.0102134 young
0.00985875 genotypic
0.00963096 social 0.00883439
Prior labor 0.2 division 0.2
12
Incorporating Topic Priors (results cont.)
gene 0.0648303 expression
0.0486273 sequence 0.0407999 sequences
0.0311126 brain
0.0233977 drosophila 0.020891 cdna
0.0186153 predict
0.0166939 expressed 0.0166521 amino
0.0126359 dna
0.010655 genome 0.0101629 conserved
0.0098135 bp
0.00908649 nucleotide 0.00906794 phylogeneti
c 0.00887771 encoding
0.00866418 melanogaster 0.00798409
Prior brain 0.1 predict
0.1 gene 0.1 expresion 0.1
13
Incorporating Topic Priors (results cont.)
behavioral 0.110674 age
0.0789419 maturation 0.057956 task
0.0318285 division 0.0312101 labor
0.0293371 workers
0.0222682 colony 0.0199028 social
0.0188699 behavior
0.0171008 performance 0.0117176 foragers
0.0110682 genotypic 0.0106029 differences
0.0103761 polyethism 0.00904816 older
0.00808171 plasticity
0.00804363 changes 0.00794045
Prior behavioral 0.2 maturation 0.2
14
Incorporating Topic Priors (Full results)
  • 30 topics from biosis-bee (first 7 topics w/
    prior) http//sifaka.cs.uiuc.edu/qmei2/data/bees
    pace/bee-30-prior.html
  • 30 topics from biosis-bee (first 2 topics w/
    prior) http//sifaka.cs.uiuc.edu/qmei2/data/bees
    pace/bee-30-prior3.html

15
Labeling a Topic
  • Themes (Topic models) can be hard to interpret.
  • Give meaningful labels to a topic is hard

16
What is a Good Label?
  • Suggesting the theme (relevance)
  • Understandable phrases?
  • High coverage inside topic
  • A theme is often a mixture of concepts
  • Discriminative across topics
  • A theme is usually in the context of k topics

17
Our Method
  • Guarantee understandability with a pre-processing
    step
  • Use phrases as candidate topic labels
  • Other possible choices entities
  • Satisfy relevance, coverage, and discriminability
    with a probabilistic framework

Good labels Understandable Relevant High
Coverage Discriminative
18
Labeling a Topic Candidate Labels
  • Phrase generation
  • Statistically significant 2-grams
  • Hypothesis testing
  • T-test used ranked by t-score
  • Other choices?
  • Entities?
  • Behavior ontology?
  • GO hard to use, because they are not real
    phrases from literature.

19
Labeling a Topic Semantic Relevance
  • Zero-order use phrases which well cover the top
    words

20
Labeling a Topic Semantic Relevance (cont.)
  • First-order use phrases with similar context

21
Labeling a Topic (results)
female   (0.0892427 )females   (0.0856834
)male   (0.0854142 )males  
(0.0812643 )sex   (0.0577668
)reproductive  (0.0214618 )ratio  
(0.0142873 )alleles   (0.0133912 )diploid  
(0.0125172 )offspring  (0.0120271 )sexes  
(0.0116374 )investment  (0.0115359
)mating   (0.00902159 )number 
 (0.00823397 )success   (0.00785498 )sexual  
(0.00751456 )determination  (0.00663546
)size   (0.00633002 )
Labels sex ratio (2.49468) (32
)    male female (2.29508) (51
)  sex determination (2.16534) (21 )   female
flowers (1.83686) (23 )    sex alleles
(1.79415) (16 )    multiple mating
(1.72684) (19 )
22
Labeling a Topic (results cont.)
hormone 0.0536175 jh
0.0518038 juvenile 0.0466941 developme
nt 0.0387031 larval
0.0276814 hemolymph 0.0216493 pupal
0.0189934 stage
0.0188286 glands 0.0173832 larvae
0.0169996 adult
0.0154695 instar 0.0149492 haemolymp
h 0.0140053 vitellogenin 0.0131076 caste
0.0124822 protein
0.0116558 glucose 0.0112673 corpora
0.0105111
Labels juvenile hormone 2.44992
117 hormone jh 1.58432
49 larval instar 1.53676
20 worker larvae 1.52398
51 corpora allata 1.50391 34
23
Labeling a Topic (results)
foraging 0.0498044 food
0.0472535 foragers 0.0310718 dance
0.0266078 source
0.0254369 nectar 0.0162739 distance
0.0141869 forage
0.0141503 information 0.0129047 dances
0.012684 hive
0.0124987 landmarks 0.0119087 dancing
0.0109375 waggle
0.0101672 feeder 0.0101266 rate
0.0085641 recruitment
0.00813717 forager 0.00796914
Labels food source -6.72378
107 nectar foraging -7.11784 28 nectar
foragers -7.58965 47 nectar source
-7.78975 16 food sources
-7.8487 72 waggle dance -8.21514 31
Prior 0 forage 0.1 0 foraging
0.1 0 food 0.1 0 source 0.1
24
Labeling a Topic (full results)
  • 100 topics from biosis-bee (w/ labels)
    http//sifaka.cs.uiuc.edu/qmei2/data/beespace/bee
    -100-basic-l.html
  • 100 topics from biosis-fly-genetics (w/ labels)
    http//sifaka.cs.uiuc.edu/qmei2/data/beespace/fly
    -100-l.html

25
Context Switching
  • Utilize topic extraction for concept switching
    (two possible ways)
  • Label the same topic model with phrases in
    another context
  • Use the topic model from context A as prior to
    extract topics from context B

26
foraging 0.142473 foragers
0.0582921 forage 0.0557498 food
0.0393453 nectar
0.03217 colony 0.019416 source
0.0153349 hive
0.0151726 dance 0.013336 forager
0.0127668 information
0.0117961 feeder 0.010944 rate
0.0104752 recruitment
0.00870751 individual 0.0086414 reward
0.00810706 flower
0.00800705 dancing 0.00794827 behavior
0.00789228
Labels with bee context
foraging trip 2.31174
21 nectar foragers 2.23428
47 tremble dance 2.21407
10 returning foragers 2.18954 16 food
sources 2.14453 72 food source
2.13647 107 foraging strategy
2.101 14 individual foraging
2.08334 16 waggle dance
2.07836 31
Labels with fly context
foraging behavior 2.45263 27 age
related 2.29676
20 drosophila larvae 2.15361
67 feeding rate 1.99218
17 apis mellifera 1.9847
23 diptera drosophilidae 1.9 25
27
foraging 0.290076 nectar
0.114508 food 0.106655 forage
0.0734919 colony
0.0660329 pollen 0.0427706 flower
0.0400582 sucrose
0.0334728 source 0.0319787 behavior
0.0283774 individual 0.028029 rate
0.0242806 recruitment
0.0200597 time 0.0197362 reward
0.0196271 task
0.0182461 sitter 0.00604067 rover
0.00582791 rovers
0.00306051
foraging 0.142473 foragers
0.0582921 forage 0.0557498 food
0.0393453 nectar
0.03217 colony 0.019416 source
0.0153349 hive
0.0151726 dance 0.013336 forager
0.0127668 information
0.0117961 feeder 0.010944 rate
0.0104752 recruitment
0.00870751 individual 0.0086414 reward
0.00810706 flower
0.00800705 dancing 0.00794827 behavior
0.00789228
28
Speed of topic extraction
documents themes Running time
500 5 8.3 s
500 10 10.6 s
1000 5 17.6 s
10k 30 350 s
16k 150 4000 s
29
Questions?
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com