Title: Beespace Component: Filtering and Normalization for Biology Literature
1Beespace Component Filtering and
Normalizationfor Biology Literature
2Concept Processing Component for Beespace A Big
Picture
A list of Representative Terms Or phrases
Filtering Module
Relevant documemts
Query terms
Retrieval
entities phrases of interest
Similarity Groups Of Terms and Phrases (Concepts)
Normalization And Clustering Module
Pre-processed Text Collection
3Concept Processing Component for Beespace Input
and Output
- Input texts (indices) with entities and phrases
tagged. - Filtering a group of relevant documents for a
query - Normalization a list of terms, entities or
phrases of interest to be normalized - Output
- Filtering list of highly representative terms
phrases - Normalization
- hierarchical structure of concepts (compacted,
loose) - Concept dictionary
- texts tagged with concepts
4Filtering
5Term Filtering Heuristics
- We want to find a list of representative terms
phrases short enough to enable interactive
selection and navigation. - We want terms with higher frequency in the given
documents, (high Term Frequency), however - Terms too frequent in the whole collection are
considered harmful the, is, cell, bee, (low
Document Frequency)
6Term Filtering TFIDF
- Adding IDF to frequency count
- Weight tf log ((N 1)/df)
- TF-IDF formula in Okapi method
- Weight
IDF
TF part
7Term Filtering (cont.)
- Results 1
- Collection honeybee.biosis 1980
- Query pollen-foraging
- Select top 2 documents
- Results 2
- Collection GENIA (on human blood cell
transcription factor), with noun phrases of
entities tagged - Query il-2
8Normalization
9From Term to Concept Normalization and Theme
Clustering
- Normalization Tight concepts
- Group terms/entities/phrases with similarity so
that one can represent others - Forage forager, forage-bee, foraging, foragers,
pollen-foraging - Theme clustering Looser concepts
- Group terms/entities/phrases representing the
same subtopic (semantically related) - forage, pollen, food, detect, feeding, dance,
- In a hierarchical manner.
10Normalization
- Morphological approach? (stemming)
- Normalize English words of morphological
variations, e.g. - forag forage/foraging/forager/foragers
- Concerns
- Too cruel? one-gton day-gtdai apis-gt api useful
-gt us - Handling biological entities? (some do nothing
when detect -) - Not sufficient to normalize phrases
11Normalization Stemmers
- Porter Stemmer
- does not stem words beginning with an uppercase
letter - Krovetz' Stemmer
- Less aggressive than porter
- Sample results
- Honeybee
- Genia
12Normalization (cont.)
- Semantic and Contextual Approach
- Group the terms which are considered
Replaceable with each other in a context. E.g. - the pollen-foraging activity of a mellifera
- the nectar-foraging activity of a cerana
- Generally handled with clustering approaches
based on statistical information in a large
corpus - Usually in the form of hierarchical clusters
13Normalization A clustering approach
- A N-gram clustering method
- Ideally, if we consider the terms in its N-Gram
context, the replaceable relation would be global
and reliable. - Concerns efficiency
- Computing complexity is high!
- For 2-gram, NV2 even after optimization!
(initially V5) - Space complexity is high!!
- V3
- Compromising use 2-gram (equivalent to computing
the average mutual information of 2-grams and
group two terms which will bring the smallest
loss to this avg. MI)
14Normalization A clustering approach (cont.)
- Toy Example on honeybee
- Vocabulary size 9100 words
- Collection size 5505 abstracts
(honeybee.biosis1980) - Terms to be Clustered 18
- Genia collection, 2000 abstracts
- 200 noun phrases (entities) to be clustered
15nursing
nurseries
nursery
nectar-foraging
pollen-foraging
foraging-related
preforaging
non-foraging
forager
forage
foraging
foragers
queen
worker
queens
workers
bee
honeybee
16Sample clusters on Genia
human_and_mouse_gene mouse_il-2r_alpha_g
ene
i_kappa_b_alpha nf_kappa_b
transcription_factors transcription_factor
saos_2_cells saos-2 human_osteosarcoma_
epstein-barr_virus_ interleukin-2
interleukin-2_ epstein-barr_virus
b_cells jurkat_t_cells hela_cells thp-1
hl60_cells k562_cells thp-1_cells
phorbol_myristate_acetate phorbol_12-myristate_1
3-acetate
2_gene_expression 2_gene
u937_cells monocytic_cells jurkat_cells
human_t_cells ipr_cd4-8-_t_cells
j_delta_k_cells lymphoid_cells
activated_t_cells hematopoietic_cells
17Normalization Clustering Methods
- Other Possible Clustering Approaches
- Cluster terms based on features such as
- Co-occurring terms
- Tends to ignore position information
- Correlation of Nouns and Verbs
- Dependency-based Word Similarity
- Proximity-based Word Similarity
- Depend on highly accurate parsing result, which
may be not easy to get for biology literature.
18Theme Clustering
- Looser Clusters
- Usually in the form of partitioning clusters
- K-Means, Latent Semantic Indexing,
- Probabilistic LSI
- Compute loose clusters of terms, or clusters
represented by term distributions - Example cluster 10
- Sometimes helpful to find normalizations (e.g.,
when clusters are large when no stemming was
done) - Comparative Text Mining for concept switching
19Future Plan
- Customize the stemmers
- Try more morphological approaches.
- e.g. pollen-foraging, nectar-foraging
- Exam more clustering methods
- How to use theme clustering to help normalization
- Find a way to divide the hierarchical clustering
structure into concepts
20