Beespace Component: Filtering and Normalization for Biology Literature - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Beespace Component: Filtering and Normalization for Biology Literature

Description:

Concept Processing Component for Beespace: Input and Output ... TF-IDF formula in Okapi method: Weight = IDF. TF part. Term Filtering (cont.) Results 1: ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 21
Provided by: qia59
Category:

less

Transcript and Presenter's Notes

Title: Beespace Component: Filtering and Normalization for Biology Literature


1
Beespace Component Filtering and
Normalizationfor Biology Literature
  • Qiaozhu Mei
  • 03.16.2005

2
Concept Processing Component for Beespace A Big
Picture
A list of Representative Terms Or phrases
Filtering Module
Relevant documemts
Query terms
Retrieval
entities phrases of interest
Similarity Groups Of Terms and Phrases (Concepts)
Normalization And Clustering Module
Pre-processed Text Collection
3
Concept Processing Component for Beespace Input
and Output
  • Input texts (indices) with entities and phrases
    tagged.
  • Filtering a group of relevant documents for a
    query
  • Normalization a list of terms, entities or
    phrases of interest to be normalized
  • Output
  • Filtering list of highly representative terms
    phrases
  • Normalization
  • hierarchical structure of concepts (compacted,
    loose)
  • Concept dictionary
  • texts tagged with concepts

4
Filtering
5
Term Filtering Heuristics
  • We want to find a list of representative terms
    phrases short enough to enable interactive
    selection and navigation.
  • We want terms with higher frequency in the given
    documents, (high Term Frequency), however
  • Terms too frequent in the whole collection are
    considered harmful the, is, cell, bee, (low
    Document Frequency)

6
Term Filtering TFIDF
  • Adding IDF to frequency count
  • Weight tf log ((N 1)/df)
  • TF-IDF formula in Okapi method
  • Weight

IDF
TF part
7
Term Filtering (cont.)
  • Results 1
  • Collection honeybee.biosis 1980
  • Query pollen-foraging
  • Select top 2 documents
  • Results 2
  • Collection GENIA (on human blood cell
    transcription factor), with noun phrases of
    entities tagged
  • Query il-2

8
Normalization
9
From Term to Concept Normalization and Theme
Clustering
  • Normalization Tight concepts
  • Group terms/entities/phrases with similarity so
    that one can represent others
  • Forage forager, forage-bee, foraging, foragers,
    pollen-foraging
  • Theme clustering Looser concepts
  • Group terms/entities/phrases representing the
    same subtopic (semantically related)
  • forage, pollen, food, detect, feeding, dance,
  • In a hierarchical manner.

10
Normalization
  • Morphological approach? (stemming)
  • Normalize English words of morphological
    variations, e.g.
  • forag forage/foraging/forager/foragers
  • Concerns
  • Too cruel? one-gton day-gtdai apis-gt api useful
    -gt us
  • Handling biological entities? (some do nothing
    when detect -)
  • Not sufficient to normalize phrases

11
Normalization Stemmers
  • Porter Stemmer
  • does not stem words beginning with an uppercase
    letter
  • Krovetz' Stemmer
  • Less aggressive than porter
  • Sample results
  • Honeybee
  • Genia

12
Normalization (cont.)
  • Semantic and Contextual Approach
  • Group the terms which are considered
    Replaceable with each other in a context. E.g.
  • the pollen-foraging activity of a mellifera
  • the nectar-foraging activity of a cerana
  • Generally handled with clustering approaches
    based on statistical information in a large
    corpus
  • Usually in the form of hierarchical clusters

13
Normalization A clustering approach
  • A N-gram clustering method
  • Ideally, if we consider the terms in its N-Gram
    context, the replaceable relation would be global
    and reliable.
  • Concerns efficiency
  • Computing complexity is high!
  • For 2-gram, NV2 even after optimization!
    (initially V5)
  • Space complexity is high!!
  • V3
  • Compromising use 2-gram (equivalent to computing
    the average mutual information of 2-grams and
    group two terms which will bring the smallest
    loss to this avg. MI)

14
Normalization A clustering approach (cont.)
  • Toy Example on honeybee
  • Vocabulary size 9100 words
  • Collection size 5505 abstracts
    (honeybee.biosis1980)
  • Terms to be Clustered 18
  • Genia collection, 2000 abstracts
  • 200 noun phrases (entities) to be clustered

15
nursing
nurseries
nursery
nectar-foraging
pollen-foraging
foraging-related
preforaging
non-foraging
forager
forage
foraging
foragers
queen
worker
queens
workers
bee
honeybee
16
Sample clusters on Genia
human_and_mouse_gene mouse_il-2r_alpha_g
ene
i_kappa_b_alpha nf_kappa_b
transcription_factors transcription_factor
saos_2_cells saos-2 human_osteosarcoma_
epstein-barr_virus_ interleukin-2
interleukin-2_ epstein-barr_virus
b_cells jurkat_t_cells hela_cells thp-1
hl60_cells k562_cells thp-1_cells
phorbol_myristate_acetate phorbol_12-myristate_1
3-acetate
2_gene_expression 2_gene
u937_cells monocytic_cells jurkat_cells
human_t_cells ipr_cd4-8-_t_cells
j_delta_k_cells lymphoid_cells
activated_t_cells hematopoietic_cells
17
Normalization Clustering Methods
  • Other Possible Clustering Approaches
  • Cluster terms based on features such as
  • Co-occurring terms
  • Tends to ignore position information
  • Correlation of Nouns and Verbs
  • Dependency-based Word Similarity
  • Proximity-based Word Similarity
  • Depend on highly accurate parsing result, which
    may be not easy to get for biology literature.

18
Theme Clustering
  • Looser Clusters
  • Usually in the form of partitioning clusters
  • K-Means, Latent Semantic Indexing,
  • Probabilistic LSI
  • Compute loose clusters of terms, or clusters
    represented by term distributions
  • Example cluster 10
  • Sometimes helpful to find normalizations (e.g.,
    when clusters are large when no stemming was
    done)
  • Comparative Text Mining for concept switching

19
Future Plan
  • Customize the stemmers
  • Try more morphological approaches.
  • e.g. pollen-foraging, nectar-foraging
  • Exam more clustering methods
  • How to use theme clustering to help normalization
  • Find a way to divide the hierarchical clustering
    structure into concepts

20
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com