Title: Delineating the Citation Impact of Scientific Discoveries
 1Delineating the Citation Impact of Scientific 
Discoveries 
- Chaomei Chen1, Jian Zhang1, Weizhong Zhu1, 
Michael Vogeley2  - 1College of Information Science and Technology, 
Drexel University  - 2Department of Physics, Drexel University 
 
This work is supported by the National Science 
Foundation under Grant No. 0612129. Thomson ISI 
provides the bibliographic data for the analysis. 
 2As We May Thinkby Vannevar Bush
-  There is a growing mountain of research. But 
there is increased evidence that we are being 
bogged down today as specialization extends. The 
investigator is staggered by the findings and 
conclusions of thousands of other 
workersconclusions which he cannot find time to 
grasp, much less to remember, as they appear. Yet 
specialization becomes increasingly necessary for 
progress, and the effort to bridge between 
disciplines is correspondingly superficial.  
  3An Increasingly Strong Trend in Science Gray  
Szalay 2004
-  massive scientific data are being collected 
by one group of scientists  -  and 
 -  being analyzed by another group of 
scientists.  -  Two notable examples 
 -  1. The SDSS project in astrophysics 
 -  2. The human genome project in 
biomedicine 
  4Sloan Digital Sky SurveyThe most ambitious 
astronomical survey ever undertaken
There is an increasingly strong trend in science 
that massive scientific data are being collected 
by one group of scientists and being analyzed by 
another group of scientists (Gray  Szalay 2004). 
Two notable examples the SDSS project in 
astrophysics and the human genome project in 
biomedicine. 
- Sloan Survey Data 
 - June, 2006 Data Release Five8000 square 
degrees, 1,048,960 spectra.   - June, 2005 Data Release Four6670 square 
degrees, 806,400 spectra.   - September, 2004 Data Release Three5282 square 
degrees, 528,640 spectra.   - March, 2004 Data Release Two3324 square 
degrees, 367,360 spectra.   - April, 2003 Data Release One2099 square 
degrees, 186,240 spectra.   - June, 2001 Early Data Release462 square 
degrees, 52,896 spectra.  
- SDSS Literature 
 - Total number of articles 1,478 
 - Total citations 47,282 
 - June 18, 2007 H  95 
 - January 30, 2007 H  89 
 
Time Slice Space Node Link
2001-2001 1699 300 7249
2002-2002 2703 519 14808
2003-2003 4294 1036 40133
2004-2004 5580 1218 43398
2005-2005 6692 1685 76009
2006-2006 10279 2815 139300
2007-2007 3136 496 15259 
 5Integrating Microscopic and Macroscopic 
perspectives
- Connecting text-level patterns (microscopic) and 
paper-level citation impacts (macroscopic)  - improve our understanding of science in the 
making  - develop data mining and visual analytics 
algorithms 
  6(No Transcript) 
 7Figure 3. Prominent keywords assigned by authors 
and burst terms extracted from titles and 
abstracts (2002-2006). 
 8Class I
Hc, Ht Split
Class II 
 9Fast-Growing SDSS Literature
- 1,400 papers 
 - 40,000 citations 
 - The total citation number doubled in the past 1.5 
years.  - H-index of SDSS literature  89 95
 
  10As of June 18, 2007, 95 SDSS papers have 95 or 
more citations. It was 89 in January 2007. 
 11Measuring the Citation Impact
- Sc discounts citations accumulated over a long 
period of time.  - Sc is adjusted for publication age. 
 - St measures the recent impact 
 - St gives heavier weights to relatively recent 
citations than earlier citations.  
  12Year Title Cites Sc St
2004 Cosmological parameters from SDSS and WMAP 404 404.00 367.00
1995 THE FIRST SURVEY - FAINT IMAGES OF THE RADIO SKY AT 20 CENTIMETERS 455 140.00 301.64
2003 Stellar population synthesis at the resolution of 2003 371 296.80 263.47
2001 Evidence for reionization at z similar to 6 Detection of a Gunn-Peterson trough in a z6.28 quasar 307 175.43 255.07
2001 The luminosity function of galaxies in SDSS commissioning data 250 142.86 196.73
2003 A survey of z gt 5.7 quasars in the Sloan Digital Sky Survey. II. Discovery of three additional quasars at z gt 6 195 156.00 175.80
2001 A survey of z gt 5.8 quasars in the Sloan Digital Sky Survey. I. Discovery of three new quasars and the spatial density of luminous quasars at z similar to 6 226 129.14 174.87
2002 Evolution of the ionizing background and the epoch of reionization from the spectra of z similar to 6 quasars 211 140.67 170.00
2001 Composite quasar spectra from the Sloan Digital Sky Survey 221 126.29 168.21
2004 The three-dimensional power spectrum of galaxies from the Sloan Digital Sky Survey 224 224.00 167.00 
 13Hg Indices and Splits
- The 1,293 records 
 - H-index  65, including 3 papers have 65 
citations  - Hc index 52 
 - Ht index  53 
 - The H split 
 - 67 papers in the highly cited group 
 - 1,226 remaining papers in the second group
 
  14Class I
Class I
Class II 
 15Significant Noun Phrases
- 22,665 noun phrases identified by a 
part-of-speech tagging and pattern matching 
process.  - 290 of them are selected based on their 
log-likelihood ratios.  
Sc Sc St St
Total terms 22,665 A(Sc) G(Sc) A(Sc) G(Sc)
Pivotal value 11.70 11.06 11.46 8.61
High 379 379 328 401
Low 914 914 965 892 
 16Figure 4. An overview of a decision tree 
generated based on 216 terms selected by 
log-likelihood ratio values (plt0.01) and a 
geometric mean split (74.44 of classification 
accuracy). The tree should be read from the root 
downwards . 
 17Figure 5. A part of the tree shown in Figure 4. 
The presence (gt0) or absence (lt0) of a term is 
associated with a citation status group, i.e. 
highly and timely cited group.  
 18Figure 6. An ADTree derived from the data 
selected with the same selection criteria with 
70.55 of accuracy. 
 19n-
Figure 7. A decision tree of 95.82 
classification accuracy derived from 721 terms 
and 1,267 records. 
 20Figure 10. The citation history of timeliness 
papers shows recently published papers are moved 
up in the rankings.  
 21Future Work
- Unsupervised ontology construction to smooth the 
feature space  - Incremental classification of incoming new data 
and scholarly publications  - Self-directed optimization of existing decision 
trees based on new evidence  - Full-text analysis that can model associative 
relations between hypotheses and evidence and 
between facts and opinions 
  22(No Transcript) 
 23(No Transcript) 
 24(No Transcript) 
 25(No Transcript)