Concept Extraction from Biological Corpora - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Concept Extraction from Biological Corpora

Description:

Need for Text Mining tools to help the researcher gather the scattered knowledge. ... instead of the original rows we can represent docs with the rows of matrix U*Sk. ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 23

Provided by: dcsK7

Category:

more less

Transcript and Presenter's Notes

Title: Concept Extraction from Biological Corpora

1
Concept Extraction from Biological Corpora

A Text Mining approach scalable in Parallel
Architectures

Supervised by Christos Makris University of
Patras Department of Computer Science and
Engineering
2
Motivation

The vast literature of biomedical papers.
New discoveries in biomedical sciences.
Need for Text Mining tools to help the researcher
gather the scattered knowledge.

3
Aim of Text Mining Tools

Text Mining Tools concerning biological corpora
can have several targets such as
Extracting gene relations
Extracting evolution paths
Discovering biomolecules interactions
We aim at a general approach capable of
extracting groups of correlated terms from a
collection of documents.

4
Common Text Mining Practices

Some of the popular text mining techniques were
Rule Based
Depending on knowledge databases
Applying statistical measures
Applying Natural Language Processing

5
Our Approach
6
Phase 1 Text Retrieval

Boolean retrieval of biological papers.
The set of retrieved documents is the outcome of
a boolean search in the database.
The full text of each document is stored in a
separate file.

7
Phase 2 Linguistic Proccessing

Input Full Text of Biological Papers
Output Term-by-Document matrix
Stemming is applied (PorterStemmer).
Stoplist used to cut common words.
TF/IDF metric used.
Only terms which occur in a significant number of
different documents are allowed.
So less significant terms are discarded.

8
Phase 3 Latent Semantic Indexing
The SVD decomposition
9
Reasons for using LSI

LSI provides us with the k approximation
Term-by-Document matrix.
Reduces noise representation due to synonymy.
LSI reduces the dimensionality. Let the rows be
the document vectors.
Then instead of the original rows we can
represent docs with the rows of matrix USk.
Docs vectors have fixed dimensionality k.

10
Phase 4 Clustering

Input The new document vectors
Output Document Clusters
The intuition behind clustering
Docs contain the semantic structure.
Clustering will group semantically similar docs
together.
Those groups form different answers of
queries in the vector space model.
We will later have to look for those queries gt
terms.

11
Phase 5 Concept Extraction

For each cluster of documents we compute the
union of indexing terms.
For each term we compute the log-odds formula.
Terms of a cluster exceeding a threshold ? show
specific preference to that cluster.
These terms formulate the query and express the
core concept of a doc cluster.
Under this assumption the query terms are
expected to be correlated.

12
Computational Issues Linguistic Processing

Linguistic processing is major bottleneck in time
? parse every single character.
To cope with this we propose the following
parallelization scheme.

13
Computational Issues Linguistic Processing
14
Computational Issues Linguistic Processing
15
Computational Issues of LSI

LSI constitutes a major time bottleneck due to
the SVD. To increase capacity
We tried to reduce the indexing terms (Stemming
IDF filtering)
We applied a parallelization scheme for the One
Side Jacobi method that functions like this.

Each cell indicates which pair of collums is
being orthogonalized. We notice that diagonal
pairs can be orthgonalized simultaneously. This
gave a speedup of 2.07 on 4 proc.
16
Computational Issues Clustering

Clustering is another bottleneck in time and in
space.
Reducing dimensionality with LSI increases both
space and time efficiency.
k-Windows unsupervised clustering algorithm was
applied.
The algorithm tries iteratively to capture
clusters in a number of d-dimensional rectangles
which are then merged based on some criterion.

17
Computational Issues Clustering

For the d-dimensional operations we used
Range Tree
R-Tree
To cope with time we parallelized k-Windows.
To cope with space of d-dimensional data
structures we chose RTrees ? similar time
behavior with Range Trees.

18
Computational Issues Clustering

Parallel k-Windows
Movement Enlargement of windows are independent
procedures ?Straightforward Parallelization.
Merging was parallelized by distributing the
merge operations for a specific window i to the
processors.
2 single merge operations for window i can affect
only different windows ? can be executed in
parallel.
When the operations for window i are finished we
proceed to the next window and follow the same
parallelization scheme.
This gave a 2.4 speedup on a 4 proc machine.

19
Results

Input originates from the online journal BioMed
(www.biomedcentral.com)
Boolean query
transcription factors AND signaling cascades
Final Input 73 docs of total size 3.7MB

20
Resulted Clusters
21
Remarks about the results

In the previous table we demonstrate the effect
of dimensionality on the quality of the clusters.
We noticed that the results remain the same
despite the increase in the dimensions we keep.
Keeping low dimensionality is vital as high
dimensionality dramatically increases the cost of
clustering.

22
Biological meaning of the results

In the final clusters we distinguish
Yellow Cluster with documents 3 and 25
These documents refer to osteoarthritis and
rheumatoid arthritis respectively, describing
procedures of inhibiting the action of
interleukins, which are responsible for the
deterioration of those diseases.

Write a Comment

User Comments (0)