Beespace Component: Filtering and Normalization for Biology Literature - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Beespace Component: Filtering and Normalization for Biology Literature

Description:

Concept Processing Component for Beespace: Input and Output ... TF-IDF formula in Okapi method: Weight = IDF. TF part. Term Filtering (cont.) Results 1: ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 21

Provided by: qia59

Category:

more less

Transcript and Presenter's Notes

Title: Beespace Component: Filtering and Normalization for Biology Literature

1
Beespace Component Filtering and
Normalizationfor Biology Literature

Qiaozhu Mei
03.16.2005

2
Concept Processing Component for Beespace A Big
Picture
A list of Representative Terms Or phrases
Filtering Module
Relevant documemts
Query terms
Retrieval
entities phrases of interest
Similarity Groups Of Terms and Phrases (Concepts)
Normalization And Clustering Module
Pre-processed Text Collection
3
Concept Processing Component for Beespace Input
and Output

Input texts (indices) with entities and phrases
tagged.
Filtering a group of relevant documents for a
query
Normalization a list of terms, entities or
phrases of interest to be normalized
Output
Filtering list of highly representative terms
phrases
Normalization
hierarchical structure of concepts (compacted,
loose)
Concept dictionary
texts tagged with concepts

4
Filtering
5
Term Filtering Heuristics

We want to find a list of representative terms
phrases short enough to enable interactive
selection and navigation.
We want terms with higher frequency in the given
documents, (high Term Frequency), however
Terms too frequent in the whole collection are
considered harmful the, is, cell, bee, (low
Document Frequency)

6
Term Filtering TFIDF

Adding IDF to frequency count
Weight tf log ((N 1)/df)
TF-IDF formula in Okapi method
Weight

IDF
TF part
7
Term Filtering (cont.)

Results 1
Collection honeybee.biosis 1980
Query pollen-foraging
Select top 2 documents
Results 2
Collection GENIA (on human blood cell
transcription factor), with noun phrases of
entities tagged
Query il-2

8
Normalization
9
From Term to Concept Normalization and Theme
Clustering

Normalization Tight concepts
Group terms/entities/phrases with similarity so
that one can represent others
Forage forager, forage-bee, foraging, foragers,
pollen-foraging
Theme clustering Looser concepts
Group terms/entities/phrases representing the
same subtopic (semantically related)
forage, pollen, food, detect, feeding, dance,
In a hierarchical manner.

10
Normalization

Morphological approach? (stemming)
Normalize English words of morphological
variations, e.g.
forag forage/foraging/forager/foragers
Concerns
Too cruel? one-gton day-gtdai apis-gt api useful
-gt us
Handling biological entities? (some do nothing
when detect -)
Not sufficient to normalize phrases

11
Normalization Stemmers

Porter Stemmer
does not stem words beginning with an uppercase
letter
Krovetz' Stemmer
Less aggressive than porter
Sample results
Honeybee
Genia

12
Normalization (cont.)

Semantic and Contextual Approach
Group the terms which are considered
Replaceable with each other in a context. E.g.
the pollen-foraging activity of a mellifera
the nectar-foraging activity of a cerana
Generally handled with clustering approaches
based on statistical information in a large
corpus
Usually in the form of hierarchical clusters

13
Normalization A clustering approach

A N-gram clustering method
Ideally, if we consider the terms in its N-Gram
context, the replaceable relation would be global
and reliable.
Concerns efficiency
Computing complexity is high!
For 2-gram, NV2 even after optimization!
(initially V5)
Space complexity is high!!
V3
Compromising use 2-gram (equivalent to computing
the average mutual information of 2-grams and
group two terms which will bring the smallest
loss to this avg. MI)

14
Normalization A clustering approach (cont.)

Toy Example on honeybee
Vocabulary size 9100 words
Collection size 5505 abstracts
(honeybee.biosis1980)
Terms to be Clustered 18
Genia collection, 2000 abstracts
200 noun phrases (entities) to be clustered

15
nursing
nurseries
nursery
nectar-foraging
pollen-foraging
foraging-related
preforaging
non-foraging
forager
forage
foraging
foragers
queen
worker
queens
workers
bee
honeybee
16
Sample clusters on Genia
human_and_mouse_gene mouse_il-2r_alpha_g
ene
i_kappa_b_alpha nf_kappa_b
transcription_factors transcription_factor
saos_2_cells saos-2 human_osteosarcoma_
epstein-barr_virus_ interleukin-2
interleukin-2_ epstein-barr_virus
b_cells jurkat_t_cells hela_cells thp-1
hl60_cells k562_cells thp-1_cells
phorbol_myristate_acetate phorbol_12-myristate_1
3-acetate
2_gene_expression 2_gene
u937_cells monocytic_cells jurkat_cells
human_t_cells ipr_cd4-8-_t_cells
j_delta_k_cells lymphoid_cells
activated_t_cells hematopoietic_cells
17
Normalization Clustering Methods

Other Possible Clustering Approaches
Cluster terms based on features such as
Co-occurring terms
Tends to ignore position information
Correlation of Nouns and Verbs
Dependency-based Word Similarity
Proximity-based Word Similarity
Depend on highly accurate parsing result, which
may be not easy to get for biology literature.

18
Theme Clustering

Looser Clusters
Usually in the form of partitioning clusters
K-Means, Latent Semantic Indexing,
Probabilistic LSI
Compute loose clusters of terms, or clusters
represented by term distributions
Example cluster 10
Sometimes helpful to find normalizations (e.g.,
when clusters are large when no stemming was
done)
Comparative Text Mining for concept switching

19
Future Plan