Concept Extraction from Biological Corpora - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Concept Extraction from Biological Corpora

Description:

Need for Text Mining tools to help the researcher gather the scattered knowledge. ... instead of the original rows we can represent docs with the rows of matrix U*Sk. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 23
Provided by: dcsK7
Category:

less

Transcript and Presenter's Notes

Title: Concept Extraction from Biological Corpora


1
Concept Extraction from Biological Corpora
  • A Text Mining approach scalable in Parallel
    Architectures

Supervised by Christos Makris University of
Patras Department of Computer Science and
Engineering
2
Motivation
  • The vast literature of biomedical papers.
  • New discoveries in biomedical sciences.
  • Need for Text Mining tools to help the researcher
    gather the scattered knowledge.

3
Aim of Text Mining Tools
  • Text Mining Tools concerning biological corpora
    can have several targets such as
  • Extracting gene relations
  • Extracting evolution paths
  • Discovering biomolecules interactions
  • We aim at a general approach capable of
    extracting groups of correlated terms from a
    collection of documents.

4
Common Text Mining Practices
  • Some of the popular text mining techniques were
  • Rule Based
  • Depending on knowledge databases
  • Applying statistical measures
  • Applying Natural Language Processing

5
Our Approach
6
Phase 1 Text Retrieval
  • Boolean retrieval of biological papers.
  • The set of retrieved documents is the outcome of
    a boolean search in the database.
  • The full text of each document is stored in a
    separate file.

7
Phase 2 Linguistic Proccessing
  • Input Full Text of Biological Papers
  • Output Term-by-Document matrix
  • Stemming is applied (PorterStemmer).
  • Stoplist used to cut common words.
  • TF/IDF metric used.
  • Only terms which occur in a significant number of
    different documents are allowed.
  • So less significant terms are discarded.

8
Phase 3 Latent Semantic Indexing
The SVD decomposition
9
Reasons for using LSI
  • LSI provides us with the k approximation
    Term-by-Document matrix.
  • Reduces noise representation due to synonymy.
  • LSI reduces the dimensionality. Let the rows be
    the document vectors.
  • Then instead of the original rows we can
    represent docs with the rows of matrix USk.
  • Docs vectors have fixed dimensionality k.

10
Phase 4 Clustering
  • Input The new document vectors
  • Output Document Clusters
  • The intuition behind clustering
  • Docs contain the semantic structure.
  • Clustering will group semantically similar docs
    together.
  • Those groups form different answers of
    queries in the vector space model.
  • We will later have to look for those queries gt
    terms.

11
Phase 5 Concept Extraction
  • For each cluster of documents we compute the
    union of indexing terms.
  • For each term we compute the log-odds formula.
  • Terms of a cluster exceeding a threshold ? show
    specific preference to that cluster.
  • These terms formulate the query and express the
    core concept of a doc cluster.
  • Under this assumption the query terms are
    expected to be correlated.

12
Computational Issues Linguistic Processing
  • Linguistic processing is major bottleneck in time
    ? parse every single character.
  • To cope with this we propose the following
    parallelization scheme.

13
Computational Issues Linguistic Processing
14
Computational Issues Linguistic Processing
15
Computational Issues of LSI
  • LSI constitutes a major time bottleneck due to
    the SVD. To increase capacity
  • We tried to reduce the indexing terms (Stemming
    IDF filtering)
  • We applied a parallelization scheme for the One
    Side Jacobi method that functions like this.

Each cell indicates which pair of collums is
being orthogonalized. We notice that diagonal
pairs can be orthgonalized simultaneously. This
gave a speedup of 2.07 on 4 proc.
16
Computational Issues Clustering
  • Clustering is another bottleneck in time and in
    space.
  • Reducing dimensionality with LSI increases both
    space and time efficiency.
  • k-Windows unsupervised clustering algorithm was
    applied.
  • The algorithm tries iteratively to capture
    clusters in a number of d-dimensional rectangles
    which are then merged based on some criterion.

17
Computational Issues Clustering
  • For the d-dimensional operations we used
  • Range Tree
  • R-Tree
  • To cope with time we parallelized k-Windows.
  • To cope with space of d-dimensional data
    structures we chose RTrees ? similar time
    behavior with Range Trees.

18
Computational Issues Clustering
  • Parallel k-Windows
  • Movement Enlargement of windows are independent
    procedures ?Straightforward Parallelization.
  • Merging was parallelized by distributing the
    merge operations for a specific window i to the
    processors.
  • 2 single merge operations for window i can affect
    only different windows ? can be executed in
    parallel.
  • When the operations for window i are finished we
    proceed to the next window and follow the same
    parallelization scheme.
  • This gave a 2.4 speedup on a 4 proc machine.

19
Results
  • Input originates from the online journal BioMed
  • (www.biomedcentral.com)
  • Boolean query
  • transcription factors AND signaling cascades
  • Final Input 73 docs of total size 3.7MB

20
Resulted Clusters
21
Remarks about the results
  • In the previous table we demonstrate the effect
    of dimensionality on the quality of the clusters.
  • We noticed that the results remain the same
    despite the increase in the dimensions we keep.
  • Keeping low dimensionality is vital as high
    dimensionality dramatically increases the cost of
    clustering.

22
Biological meaning of the results
  • In the final clusters we distinguish
  • Yellow Cluster with documents 3 and 25
  • These documents refer to osteoarthritis and
    rheumatoid arthritis respectively, describing
    procedures of inhibiting the action of
    interleukins, which are responsible for the
    deterioration of those diseases.
Write a Comment
User Comments (0)
About PowerShow.com