Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Description:

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis. Frizo Janssens, Wolfgang Gl nzel, and Bart De Moor – PowerPoint PPT presentation

Number of Views:243
Avg rating:3.0/5.0
Slides: 34
Provided by: protocols8
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis


1
Dynamic Hybrid Clustering of Bioinformatics by
Incorporating Text Mining and Citation Analysis
  • Frizo Janssens, Wolfgang Glänzel, and Bart De
    Moor

Presented by Cindy Burklow
CS 685 Special Topics in Data Mining Professor
Dr. Jinze Liu University of Kentucky April 17th,
2008
2
Outline
  • Introduction
  • Motivation
  • Related Work
  • Proposed Models
  • Proposed Algorithms
  • Results Hybrid Dynamic Clustering
  • Discussion of Pros and Cons
  • Questions
  • References

3
Introduction
  • Bioinformatics
  • Computer Science
  • Information Technology
  • Solves problems in Biomedicine
  • Goal of Paper Investigate
  • Cognitive structure
  • Dynamics of bioinformatics core
  • Sub-disciplines
  • ISI Web of Science MEDLINE
  • Retrieval of core literature in bioinformatics

4
MeSH Medical Subject Headings
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 360, 368, KDD '07. ACM,
San Jose, CA, August 2007.
5
(No Transcript)
6
(No Transcript)
7
Motivation
  • Bioinformatics field
  • Dynamic
  • Evolving discipline
  • Fast growth rate
  • Monitor current trends
  • Predict future direction
  • Decision Making
  • Grants
  • Business Ventures
  • Research Opportunities

8
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 361, 368, KDD '07. ACM,
San Jose, CA, August 2007.
9
Related Work
  • Web mining
  • Bibliometrics
  • Text mining citation analysis
  • Mapping of knowledge
  • Charting science technology fields
  • Textual graph-based approaches
  • Different perceptions of similarity between
    documents or groups of documents

10
Related Work
  • Establishing the Data Set
  • Patra Mishra Bibliometric Study
  • MeSH term based
  • Liberal delineation strategy with maximal recall
  • Broader interpretation of bioinformatics
  • Less restricted search strategy
  • Broader coverage of underlying database
  • 14,563 journal papers

11
Related Work
  • Hybrid Clustering
  • He Unsupervised spectral clustering of web
    pages
  • Wang Kitsuregawa Contents-linked coupled
    clustering algorithm of web pages
  • Dynamic hybrid clustering
  • Mei Zhai Temporal Text Mining
  • Kullback-Leibler Divergence for coherent themes
    Hidden Markov Models
  • Griffiths Steyvers Latent Dirichlet
    Allocation with hot topics in PNAS abstracts

12
Models Data SetBibliometric Retrieval Strategy
  • Novel subject delineation strategy
  • Retrieve core literature
  • Combines textual components bibliometrics,
    citation-based techniques
  • Web of Science Edition of Thomson Scientific
  • 7401 bioinformatics-related papers
  • 1981 to 2004
  • Titles, abstracts, author keywords, and MeSH
    terms

13
Models Text Analysis
  • All text was indexed with Jakarta Lucene Platform
  • Encoded in Vector Space Model using TF-IDF
    weighting scheme
  • Text-based similarities
  • Cosine of angle between the vector
    representations of two papers
  • No Stop word used during indexing
  • Porter Stemmer
  • All remaining terms from titles and abstracts
  • Bigrams
  • Candidate list of MeSH descriptors, author
    keywords, and noun phrases
  • Latent Semantic Indexing (LSI) 10 terms

14
Models Citation Analysis
  • Citation Graphs
  • Link-based algorithms
  • HITS
  • PageRank

Representative Publications
Combine
Image Reference Google Logo from
http//www.google.com
15
Models Clustering
  • Agglomerative Hierarchical Clustering Algorithm
    with Wards Method
  • Hard Clustering Algorithm
  • Every publication is assigned to exactly 1
    cluster.

Image Reference Clustering Analysis -
http//en.wikipedia.org/wiki/Data_clustering
16
Models Clustering
  • Optimal number of clusters
  • Combine Distance-based Stability-based Methods
    Strategy

Silhouette Curves Mean text and Citation-based
Dendrogram observation
Stability Diagram
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 364, 365, KDD '07. ACM,
San Jose, CA, August 2007.
17
Proposed Algorithm Hybrid Clustering
  • Cluster Input Distances
  • Combining text mining and bibliometrics
  • Integrate text citation info early in mapping
    process before applying of clustering algorithm
  • Weighted linear combination
  • Fishers inverse chi-square method

Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 362, 363, KDD '07. ACM,
San Jose, CA, August 2007.
18
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 363 KDD '07. ACM, San
Jose, CA, August 2007.
19
Proposed Algorithm Dynamic Hybrid Clustering
  • Goal Match track clusters through time
  • Process
  • Separate hybrid clustering for each period
  • Determine optimal number of clusters
  • Dendrogram
  • Silhouette curve
  • Ben-hur stability plot
  • Construct complete graph
  • All cluster centroids from each period as nodes
  • Edge weights as mutual cosine similarities in LSS
  • Form Cluster Chains
  • Keep edge weights gt threshold, T1
  • Allow qualifying clusters to join gt threshold, T2

20
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 367, KDD '07. ACM, San
Jose, CA, August 2007.
21
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 367, KDD '07. ACM, San
Jose, CA, August 2007.
22
Results Hybrid ClusteringSilhouette Curve
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 364, KDD '07. ACM, San
Jose, CA, August 2007.
23
Result Hybrid ClusteringSilhouette Curve
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 364, KDD '07. ACM, San
Jose, CA, August 2007.
24
Result Hybrid ClusteringStability
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 365, KDD '07. ACM, San
Jose, CA, August 2007.
25
Result Hybrid ClusteringDendrogram
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 365, KDD '07. ACM, San
Jose, CA, August 2007.
26
Result Hybrid ClusteringCluster
Characterization
27
Result Dynamics ClusteringHistogram
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 365, KDD '07. ACM, San
Jose, CA, August 2007.
28
Result Dynamics ClusteringCluster Chains
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 367, KDD '07. ACM, San
Jose, CA, August 2007.
29
Yearly Publication Output among Cluster chains
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 368, KDD '07. ACM, San
Jose, CA, August 2007.
30
Dynamic Term Network
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 368, KDD '07. ACM, San
Jose, CA, August 2007.
31
Pros Cons
  • Pros
  • Offers fresh perspective on clustering
  • Integrates various techniques
  • Provides insight into bioinformatics
  • Cons
  • Challenge of selecting the optimal number of
    clusters still exists
  • There are many steps required to implement their
    approach

32
Questions
33
References
  • Janssens, F., Glänzel, W., and De Moor, B.
    2007. Dynamic hybrid clustering of
    bioinformatics by incorporating text mining and
    citation analysis. In Proceedings of the 13th ACM
    SIGKDD international Conference on Knowledge
    Discovery and Data Mining (San Jose, California,
    USA, August 12 - 15, 2007). KDD '07. ACM, New
    York, NY, 360-369. DOI http//doi.acm.org/10.1145
    /1281192.1281233
  • ISI Web of Science Image http//apps.isiknowledge
    .com/WOS_GeneralSearch_input.do?highlighted_tabWO
    SproductWOSlast_prodWOSSID3DamC8GFDKmpBLhFOI
    Msearch_modeGeneralSearch
  • PubMed Image http//www.ncbi.nlm.nih.gov/pubmed/
  • The Apache Jakarta Project http//lucene.apache.
    org/java/1_4_3/
  • Fishers Method http//en.wikipedia.org/wiki/Fis
    her27s_method
  • Data Mining - Concepts and techniques by Han
    and Kamber, Morgan Kaufmann, 2006.
    (ISBN1-55860-901-6)
Write a Comment
User Comments (0)
About PowerShow.com