Title: Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis
1Dynamic Hybrid Clustering of Bioinformatics by
Incorporating Text Mining and Citation Analysis
- Frizo Janssens, Wolfgang Glänzel, and Bart De
Moor
Presented by Cindy Burklow
CS 685 Special Topics in Data Mining Professor
Dr. Jinze Liu University of Kentucky April 17th,
2008
2Outline
- Introduction
- Motivation
- Related Work
- Proposed Models
- Proposed Algorithms
- Results Hybrid Dynamic Clustering
- Discussion of Pros and Cons
- Questions
- References
3Introduction
- Bioinformatics
- Computer Science
- Information Technology
- Solves problems in Biomedicine
- Goal of Paper Investigate
- Cognitive structure
- Dynamics of bioinformatics core
- Sub-disciplines
- ISI Web of Science MEDLINE
- Retrieval of core literature in bioinformatics
4MeSH Medical Subject Headings
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 360, 368, KDD '07. ACM,
San Jose, CA, August 2007.
5(No Transcript)
6(No Transcript)
7Motivation
- Bioinformatics field
- Dynamic
- Evolving discipline
- Fast growth rate
- Monitor current trends
- Predict future direction
- Decision Making
- Grants
- Business Ventures
- Research Opportunities
8Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 361, 368, KDD '07. ACM,
San Jose, CA, August 2007.
9Related Work
- Web mining
- Bibliometrics
- Text mining citation analysis
- Mapping of knowledge
- Charting science technology fields
- Textual graph-based approaches
- Different perceptions of similarity between
documents or groups of documents
10Related Work
- Establishing the Data Set
- Patra Mishra Bibliometric Study
- MeSH term based
- Liberal delineation strategy with maximal recall
- Broader interpretation of bioinformatics
- Less restricted search strategy
- Broader coverage of underlying database
- 14,563 journal papers
11Related Work
- Hybrid Clustering
- He Unsupervised spectral clustering of web
pages - Wang Kitsuregawa Contents-linked coupled
clustering algorithm of web pages - Dynamic hybrid clustering
- Mei Zhai Temporal Text Mining
- Kullback-Leibler Divergence for coherent themes
Hidden Markov Models - Griffiths Steyvers Latent Dirichlet
Allocation with hot topics in PNAS abstracts
12Models Data SetBibliometric Retrieval Strategy
- Novel subject delineation strategy
- Retrieve core literature
- Combines textual components bibliometrics,
citation-based techniques - Web of Science Edition of Thomson Scientific
- 7401 bioinformatics-related papers
- 1981 to 2004
- Titles, abstracts, author keywords, and MeSH
terms
13Models Text Analysis
- All text was indexed with Jakarta Lucene Platform
- Encoded in Vector Space Model using TF-IDF
weighting scheme - Text-based similarities
- Cosine of angle between the vector
representations of two papers - No Stop word used during indexing
- Porter Stemmer
- All remaining terms from titles and abstracts
- Bigrams
- Candidate list of MeSH descriptors, author
keywords, and noun phrases - Latent Semantic Indexing (LSI) 10 terms
14Models Citation Analysis
- Citation Graphs
- Link-based algorithms
- HITS
- PageRank
Representative Publications
Combine
Image Reference Google Logo from
http//www.google.com
15Models Clustering
- Agglomerative Hierarchical Clustering Algorithm
with Wards Method - Hard Clustering Algorithm
- Every publication is assigned to exactly 1
cluster.
Image Reference Clustering Analysis -
http//en.wikipedia.org/wiki/Data_clustering
16Models Clustering
- Optimal number of clusters
- Combine Distance-based Stability-based Methods
Strategy
Silhouette Curves Mean text and Citation-based
Dendrogram observation
Stability Diagram
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 364, 365, KDD '07. ACM,
San Jose, CA, August 2007.
17Proposed Algorithm Hybrid Clustering
- Cluster Input Distances
- Combining text mining and bibliometrics
- Integrate text citation info early in mapping
process before applying of clustering algorithm - Weighted linear combination
- Fishers inverse chi-square method
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 362, 363, KDD '07. ACM,
San Jose, CA, August 2007.
18Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 363 KDD '07. ACM, San
Jose, CA, August 2007.
19Proposed Algorithm Dynamic Hybrid Clustering
- Goal Match track clusters through time
- Process
- Separate hybrid clustering for each period
- Determine optimal number of clusters
- Dendrogram
- Silhouette curve
- Ben-hur stability plot
- Construct complete graph
- All cluster centroids from each period as nodes
- Edge weights as mutual cosine similarities in LSS
- Form Cluster Chains
- Keep edge weights gt threshold, T1
- Allow qualifying clusters to join gt threshold, T2
20Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 367, KDD '07. ACM, San
Jose, CA, August 2007.
21Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 367, KDD '07. ACM, San
Jose, CA, August 2007.
22Results Hybrid ClusteringSilhouette Curve
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 364, KDD '07. ACM, San
Jose, CA, August 2007.
23Result Hybrid ClusteringSilhouette Curve
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 364, KDD '07. ACM, San
Jose, CA, August 2007.
24Result Hybrid ClusteringStability
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 365, KDD '07. ACM, San
Jose, CA, August 2007.
25Result Hybrid ClusteringDendrogram
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 365, KDD '07. ACM, San
Jose, CA, August 2007.
26Result Hybrid ClusteringCluster
Characterization
27Result Dynamics ClusteringHistogram
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 365, KDD '07. ACM, San
Jose, CA, August 2007.
28Result Dynamics ClusteringCluster Chains
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 367, KDD '07. ACM, San
Jose, CA, August 2007.
29Yearly Publication Output among Cluster chains
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 368, KDD '07. ACM, San
Jose, CA, August 2007.
30Dynamic Term Network
Image Reference Wolfgang Glnzel, and Bart De
Moor, Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis, pg. 368, KDD '07. ACM, San
Jose, CA, August 2007.
31Pros Cons
- Pros
- Offers fresh perspective on clustering
- Integrates various techniques
- Provides insight into bioinformatics
- Cons
- Challenge of selecting the optimal number of
clusters still exists - There are many steps required to implement their
approach
32Questions
33References
- Janssens, F., Glänzel, W., and De Moor, B.
2007. Dynamic hybrid clustering of
bioinformatics by incorporating text mining and
citation analysis. In Proceedings of the 13th ACM
SIGKDD international Conference on Knowledge
Discovery and Data Mining (San Jose, California,
USA, August 12 - 15, 2007). KDD '07. ACM, New
York, NY, 360-369. DOI http//doi.acm.org/10.1145
/1281192.1281233 - ISI Web of Science Image http//apps.isiknowledge
.com/WOS_GeneralSearch_input.do?highlighted_tabWO
SproductWOSlast_prodWOSSID3DamC8GFDKmpBLhFOI
Msearch_modeGeneralSearch - PubMed Image http//www.ncbi.nlm.nih.gov/pubmed/
- The Apache Jakarta Project http//lucene.apache.
org/java/1_4_3/ - Fishers Method http//en.wikipedia.org/wiki/Fis
her27s_method - Data Mining - Concepts and techniques by Han
and Kamber, Morgan Kaufmann, 2006.
(ISBN1-55860-901-6)