Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals - PowerPoint PPT Presentation


PPT – Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals PowerPoint presentation | free to download - id: 5c402f-YjBmN


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals


Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010 – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 36
Provided by: Sara2221
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals

Data Reuse, Sharing, and Production An
article-centric investigation of data citation
practices in prominent journals
  • Sarah Judson
  • DataONE
  • Summer 2010

  • Many journals have data citation, or at least
    data sharing policies.
  • Most are recommendations
  • Many will soon be mandatory
  • But are they enacted?
  • Multiple depositories exist for data sharing
  • Allow browsing for available data
  • Provide space for data storage
  • Recommend how data reuse should be properly
  • But are they utilized?

  • Report current status of data citation and
    sharing to relevant journals
  • Recommend best practices
  • Increase ability to retrace and reuse data
  • Ease transition to mandatory polices
  • Promote appropriate credit to data author

  • Advent of data sharing/citation policies
  • Continued expression of the need for increased
    data sharing, esp. for meta-analysis and global
    change studies
  • Similar studies in Biomedical journals or
    focused on Genbank, but few in
    Ecological/Evolutionary journals
  • Piwowar and Chapman. Public sharing of research
    datasets A pilot study of associations. Journal
    of Informetrics April 2010 4(2)148-156
  • Noor et al. 2006. Data Sharing How Much
    Doesn't Get Submitted to GenBank? PLoS Biol.

Research Questions
  • What are current practices for data citation
    within articles?
  • Do authors tend to cite that dataset itself or
    related paper?
  • How does the author obtain the dataset?
  • How do these practices vary across discipline,
    journal, data type, data source?
  • Are data citation practices influenced more by
    attitude of the discipline towards data sharing
    or journal policy?
  • How have these practices varied across time?
  • Does increased data reuse/sharing correlate with
    changes in journal policy?
  • Does data reuse/sharing simply increase with time
    since the advent of the internet?

Angles of Attack
  • Snapshot approach
  • 1st issue in 2010 for journals of interest
  • To assess current state
  • To evaluate utility of a particular journal for
    more detailed Time Series investigation
  • Time Series approach
  • Random sample of 25 articles per journal per year
  • To investigate trends over time, especially
    considering changes in journal data/citation

Nitty-Gritty Methods
  • Random sampling
  • Export all articles and accompanying metadata
  • 2005-2010
  • Journal- specific
  • Assign record number to each article
  • Generate random numbers to select 25 articles
  • Data Extraction
  • Recorded on Excel spreadsheet, uploaded weekly to
  • Read Journal Citation/Data Policy in Preparation
    for Extraction
  • Read through articles manually
  • Special attention to the Methods and
    Acknowledgements sections.
  • Identify instances of data reuse and sharing
  • Copy relevant excerpts
  • Code according to established fields
  • Record additional metadata
  • Open access, Discipline, Submission to
    Publication duration, etc.

Extracted Fields
  • ISI metadata
  • DOI
  • Author and affiliation
  • Abstract and keywords
  • Journal and ISSN
  • For each instance of Data Reuse, Sharing, or
  • Depository
  • Type of InText and Bibliographic Citation
  • Author-Year, URL, Accession
  • How dataset acquired
  • Is depository clearly referenced?
  • Was it obtained from a colleague?
  • Is it previous work by one of the authors?
  • Where citation occurs
  • Type of Dataset
  • Gene Sequence, Phylogenetic Tree, Ecological, etc

Selected Journals
  • Dryad Top Three
  • Justification
  • 1. Most currently posted it really
    being reused?
  • 2. Known "High Impact" Journals
  • 3. Cover target disciplines and depositories
  • Systematic Biology (Systematics,
  • American Naturalist (Behavior, Natural History,
  • Molecular Ecology (Genetics, Molecular Evolution)
  • Other options ESA family, Discipline-specific,

  • Only looking at a few journals and disciplines
  • Relying only on the main text
  • Not looking at supplementary material unless
    article extremely unclear
  • Have to assume if it wasnt stated, it wasnt
  • Would have developed automated extraction, text
    coding if time permitted
  • Process more articles
  • Remove bias
  • Standardization

Unresolved Problems (suggestions please!)
  • Data Type Classification
  • Easy Gene Sequences and Phylogenetic Trees
  • Biology vs. Ecology
  • Subdivisions in Biology, Earth, etc
  • Bio Morphology, Behavior
  • Eco Competition, Community
  • Earth Soils, GIS
  • Articles according to ISI
  • AmNat
  • High are models
  • Notes and Comments
  • Natural History Miscellany
  • SysBio
  • Points of View
  • Author Recurrence
  • SysBio only 50 articles per year and multiple
    publications/accreditations to the same people
    (Wiens, Sullivan)
  • AmNat less pronounced problem (Abrams)

  • Qualitative observations
  • Good citation, bad citation
  • Journal Comparison
  • Time Series
  • Reuse
  • Sharing results not presented
  • Data type
  • Depository

Qualitative Observations
  • - Internal (journal) supplementary depositories
    used more as a dump than for reusable data
  • Additional or color figures and tables
  • Statistical outputs
  • InText citations allude to raw data supplement,
  • but often ends up being raw results
  • Defunct data storage
  • Personal URLS
  • Problem retrieving supplementary data (SysBio
  • More data produced than shared
  • Alignments and Trees often not posted to TreeBase
  • Ecological datasets grossly under shared

Haphazard citation practices
  • Accessions cited in Text vs. Table
  • Author vs. Accession
  • Only depository referenced
  • Especially with large datasets
  • Some in Methods, Some in Results
  • Majority of reuses cited in Methods
  • Sharing cited roughly 50/50 between Methods and
  • Crediting self before others
  • Bibliographic citations not given or only for
    same author
  • Give article citation for self, but not
    accession accession for others but not their
  • Disparate citation formats within a single paper

Good Citation
  • Previously published sequence data were used for
    V. velella 18S (Collins, 2002, GenBank AF358087),
    P. porpita 18S (Collins, 2002, AF358086),
    Staurocladia wellingtoni 18S (Collins, 2002,
    AF358084), S. wellingtoni 16S (Schuchert, 2005,
    AJ580934), Hydra circumcincta 18S (Medina et al.,
    2001, AF358080), and H. vulgaris 16S
    (Pont-Kingdon et al., 2000, AF100773).
  • Taxon
  • Gene region
  • Author-Year
  • accompanying bibliographic citation
  • GenBank Accession

Bad Citation
  • Incomplete
  • The sequences, which were all produced in our
    previous studies (Aceto et al. 1999 Cozzolino
    et al. 2001) and are available in GenBank
  • Usually missing accession, sometimes author and
  • Sometimes the info is buried in tables or not
    given for large compilations
  • Unclear
  • During annual aerial surveys, observers sketch
    the extent of defoliation from the air on paper
    or digital maps (Ciesla 2000) that are then
    compiled as a series of polygons in a
    geographical information system (GIS) (Liebhold
    et al. 1997).
  • Who is the original data author?
  • Are these theoretical, methodological or data
  • Bibliographic citations occasionally shed light
  • Where is the data stored?

What is a good citation?
  • Data easily retraceable
  • Proper credit given
  • Criteria
  • Depository mentioned in text
  • Accession mentioned in text
  • Author credit given in Bibliography

Citations Systematic Biology
Citations American Naturalist
Journal Comparison Snapshot
Data Reuse Data Sharing
Systematic Biology Frequent use of Genbank Occasional use of Treebase Often post to Treebase, but often unclear about GA vs. PT Internal Difficult (no unique accession generic URL not accessible pre-2008)
American Naturalist Varied data (biological) Often extracted from literature or used to validate a model Occasional sharing Dryad, Treebase, Genbank, internal
Molecular Ecology Frequent use of Genbank, but steadily drops off after 2009 Some morphological data matricies Posting to Genbank, but alternatively given in Methods and Results Level of accessibility varies widely
Ecology Minor datasets Extensive datasets rarely shared Ecological Archives (accessible but used for excess figures and results)
Journal Comparison Reuse and Sharing in 2010
Percent Reuse over time
Depository Systematic Biology
Depository American Naturalist
Data Types Systematic Biology
Data Types American Naturalist
Back to the big picture
  • Inform journals and depositories about current
    practices vs. policy
  • Best Practices recommendations
  • Continued research on trends in data citation

Suggested Best Practices
  • Accession numbers and Authors of each dataset
    (reused and shared) given in the Methods or
    Supplementary Table referenced in the Methods
  • Authors not charged extra page/online fees
  • Authors allowed to exceed Reference limit to
    credit data
  • Editorial enforcement
  • Checklist
  • Internal Depositories made more accessible
  • Usable formats
  • Unique and Stable URLs

Long Term Best Practices
  • Separate Supplementary Data Section
  • Example Molecular Ecology
  • SysBio added a separate section but it is defunct
  • AmNat has an Online enhancements header
  • For both internal and external deposits
  • Distinguish from data-dump (extra figures,
  • Accompanying References section
  • Unlimited length
  • DATA cited, in addition to publication
  • Could combine into a new reference type
  • Author. Year. Title. Journal. Pages. Depository.
  • Track on par with publications in ISI, etc

Continued Research
  • Snapshot and time series of Molecular Ecology
  • Possibly Ecology if time permits
  • Alternative (suggestions please!) Just snapshots
  • Trends over time
  • Has reuse and sharing increased?
  • Have citation practices improved over time?
  • Is this influenced by journal/depository
    recommendations on citations?
  • Correlation with influential factors
  • Is there more data reuse in articles that are
    also open access or share data?
  • Are certain dataset types or article disciplines
    more inclined to reuse/share data?
  • Data shared vs. data produced
  • Sync with Journal/Depository Metadata (Nic) and
    Search Findings (Valerie)
  • Refine Good citation criteria
  • Journal and depository specific

Additional Exploration
  • Track the cited or shared datasets
  • Look at supplemental data alone
  • Internal (journal repositories)
  • Additional data not cited in text?
  • Data dumps?
  • Ease of access
  • Accuracy of accession numbers
  • Actual data reusability
  • Method/processing metadata
  • File format
  • Software/model reuse and sharing
  • R-packages, GUIs
  • Encouraged by American Naturalist
  • Databases
  • Independent databases vs. depositories
  • utilized out of available
  • Caching/stability options, linking metadata to

Final products
  • Reports to requisite journals/depositories
  • Potential Manuscripts
  • Journal Comparison Citation Practices
  • Treebase Shared vs. Produced
  • Best Practices recommendations
  • Shared dataset!

Thanks for listening!
  • Questions?
  • Suggestions?
  • Unresolved problems
  • Continued Research

  • Determining extracted fields
  • Coding data now vs. later

In light of data/citation policies.
  • Compare performance of sysbio and amnat in
    their depository and journal policy performance
    (do they meet the requirements?) or state this
    in future research section
  • OWW Nic do editor instructions or other
    sections of policy indicate how data/citation
    policies are enforced?