Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format - PowerPoint PPT Presentation

About This Presentation
Title:

Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Description:

Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 22
Provided by: KenD179
Learn more at: http://jats.nlm.nih.gov
Category:

less

Transcript and Presenter's Notes

Title: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format


1
Portico A Case Study in the Migration of
Proprietary Formats to the JATS Archiving Format
  • Sheila Morrissey, John Meyer, Sushil Bhattarai,
    Sachin Kurdikar, Jie Ling, Matthew Stoeffler,
    Umadevi Thanneeru

2
Portico JSTOR Committed to Preserving the
Scholarly Record
Ithaka helps the academic community use digital
technologies to preserve the scholarly record and
to advance research and teaching in sustainable
ways
I T H A K A
Digitization for Preservation Access
Digital Preservation
Light Archive
Dark Archive
3
Portico Archive
  • Porticos objective is to help libraries make a
    secure and reliable transition from print to a
    reliance on e-content.
  • Maintains archiving agreement with publishers to
    collect and preserve content.
  • Receives content directly from publishers.
  • Preserves
  • Current journals (born digital)
  • Back file journals (reborn digital)
  • E-books
  • Digitized historical collections

4
An Insurance Policy for e-Content
  • Provide libraries with access to archived content
    when it becomes lost, orphaned or abandoned
    (regardless of libraries past or current
    subscription)
  • Publisher ceases operation
  • Publisher discontinues title
  • Publisher drops back file
  • Provide libraries with post-cancellation access
    if publisher specifically names Portico
  • About 90 of titles in Archive are covered by
    Portico post-cancellation access rights.
  • Libraries asked to pay annual Archive support
    payment to defray cost of preservation, e.g.
    insurance premium

5
Portico Archive as of July 19, 2010
Category Files
Images 84,215,731 47.93
Publisher Supplied Text 47,393,731 26.98
Portico Created Archival Text 43,689,083 24.87
Application Specific Files 232,732 0.13
Multi-file Packages 140,333 0.08
Videos 20,604 0.01
Audio 570 lt0.00
Executable 6 lt0.00
Total 175,692,826 100
  • 114 publisher participants
  • 11,788 committed journal titles
  • 43,253 committed e-books
  • 13 committed digitized collections
  • gt14 million articles ingested
  • 688 library participants
  • (48 outside US)
  • 4 Trigger events
  • 15 Post-cancellation Access Claims

6
Portico Preservation Infrastructure
  • Publisher supplies XML Source file (including
    the text, images) and PDF page rendition.
  • Best approach for preserving the intellectual
    content of the article or book.
  • Authenticate verify that preserved content is
    what it purports to be.
  • Verify format ensure the file meets syntactic
    and semantic rules of format specification.
  • Repair
  • Normalize (XML)
  • Create preservation metadata
  • Assess archival robustness of file format.
  • Migrate files to ensure future usability of
    content.
  • Replicate objects and metadata to protect
    against bit rot and media deterioration
  • Render articles to meet viewing requirements of
    delivery platform.

7
Key Challenges for an Archival DTD
  • Dec 2001, Ineras E-Journal Archive DTD
    Feasibility Study highlighted these Key
    Challenges for an Archival DTD
  • Use of generated and boilerplate text, especially
    in
  • Label text for figure captions
  • Citation text
  • Author name and affiliation
  • Dates
  • Expression of links between author and
    affiliation
  • Reference elements
  • Expression of non-article and other content
  • Abbreviations and definitions

8
Key Challenges for an Archival DTD
  • Keywords
  • Sections, including handling of sections without
    headers
  • Placement of floating objects, such as figures,
    tables, graphs
  • Tables, including cell formatting issues (cells
    with figures, content alignment, etc.)
  • Math
  • Intra-, inter- and extra-article linking
  • Publisher-specific elements
  • When reviewing the minutes of the Working Group
    and the evolution of the DTD, we can confirm that
    these areas have been the main focus of
    discussion.

9
Some Design Constraints
  • IMPLIED, not REQUIRED attributes
  • CDATA instead of controlled list
  • Optional Elements, or relaxed order of elements
  • Surprising location of Elements
  • No Domain Specific Elements

10
Publisher/Domain Specific Elements
  • Custom-Meta
  • Business Data
  • Allowed in journal-meta, article-meta, front-stub
  • Name/Value pair (may contain 38 different
    Elements)
  • Named-Content
  • Semantic Significance
  • Allowed in 112 Elements
  • May contain 59 different Elements

11
Challenges posed by source DTDs
  • Extended Semantics for Named-Content
  • Price in Citation
  • Becomes ltnamed-content content-typepricegt
  • ltcitation reference"1" id"R1" type"serial"gt
    ltauthor order"1"gt ltnamegtltfirstgtS.
    P.lt/firstgtltlastgtMorganlt/lastgtlt/namegt
    lt/authorgt ltjournalgt ltsertitlegtJ.
    Appl. Phys.lt/sertitlegt ltURI
    type"ISSN"gt0030-3941lt/URIgt
    ltpricegt01.00lt/pricegt ltvolumegt29lt/volumegt
    ltpagesgtltfirstgt1358lt/firstgtltlastgt1368lt/last
    gtlt/pagesgt ltpubdategt1958lt/pubdategt
    lt/journalgt lttitlegtGeneral solution of the
    Luneburg lens problemlt/titlegtlt/citationgt

12
Challenges posed by source DTDs
  • More Extended Semantics for Named-Content
  • Affiliation in Footnotes/P
  • Becomes ltnamed-content content-typeaff
    idAFF2gt
  • ltFOOTNOTE ID"N101" TYPE"AFF"gtltP
    ALPHABET"LATIN" TYPE"INDENT"gt    ltAFF
    ID"AFF2gtltITgtCorresponding author addresslt/ITgt
    Nicholas M. J. Hall, Dept. of Atmospheric and
    Oceanic Sciences, McGill University, 805
    Sherbrooke St. W., Montreal PQ H3A 2K6,
    Canada.lt/AFFgt
  • lt/Pgtlt/FOOTNOTEgt

13
Challenges posed by source DTDs
  • More Extended Semantics for Named-Content
  • Funding in Acknowledgments/P
  • Becomes ltnamed-content content-typefundinggt
  • ltackgtltsectitlegtACKNOWLEDGMENTSlt/sectitlegtltpgtQ.W.
    x2019s research is partially supported by
    AFOSR Grant No. ltfunding source"USAFOSR"gtltcontrac
    tgtF49550-05-1-0025lt/contractgtlt/fundinggt and NSF
    Grants No. ltfunding source"NSF"gtltcontractgtDMS-020
    4243lt/contractgtlt/fundinggt, No. ltfunding
    source"NSF"gtltcontractgtDMS-0605029lt/contractgtlt/fun
    dinggt, and No. ltfunding source"NSF"gtltcontractgtDMS
    -0626180lt/contractgtlt/fundinggt. P.Z. is partially
    supported by the special funds for major State
    Research Projects ltfunding source"UNSPECIFIED"gtltc
    ontractgt2005CB321704lt/contractgtlt/fundinggt and
    National Science Foundation of China for
    Distinguished Young Scholars ltfunding
    source"NSFC"gtltcontractgt10225103lt/contractgtlt/fundi
    nggt. H.Z.x2019s work is supported in part by
    the Naval Postgraduate School Research Initiation
    Program.lt/pgtlt/ackgt

14
Challenges posed by source DTDs
  • More Extended Semantics for Named-Content
  • Organization Division in Affiliation
  • Becomes ltnamed-content content-typedivisiongt
  • ltAffiliation ID"Aff12"gt
    ltOrgDivisiongtOptisches Institutlt/OrgDivisiongt
    ltOrgNamegtTechnische Universität
    Berlinlt/OrgNamegt ltOrgAddressgt
    ltCitygtBerlinlt/Citygt
    ltCountrygtGermanylt/Countrygt
    lt/OrgAddressgt lt/Affiliationgt

15
Challenges posed by source DTDs
  • More Extended Semantics for Named-Content
  • Generic Element (addinfo)
  • Becomes ltnamed-content content-typeaddinfogt
  • ltref-conf id"CIT0045"gtltref-conf-textgtltauthor-ref-
    textgtltsurnamegtBishoplt/surnamegt ltgivennamegtCJlt/give
    nnamegtlt/author-ref-textgt, ltauthor-ref-textgtltsurnam
    egtAanenseslt/surnamegt ltgivennamegtDMlt/givennamegtlt/au
    thor-ref-textgt, ltauthor-ref-textgtltsurnamegtJordanlt/
    surnamegt ltgivennamegtGElt/givennamegtlt/author-ref-tex
    tgt, ltauthor-ref-textgtltsurnamegtKilianlt/surnamegt
    ltgivennamegtMlt/givennamegtlt/author-ref-textgt,
    ltauthor-ref-textgtltsurnamegtHanagelt/surnamegt
    ltgivennamegtWPlt/givennamegtlt/author-ref-textgt,
    ltauthor-ref-textgtltsurnamegtSprattlt/surnamegt
    ltgivennamegtBG.lt/givennamegtlt/author-ref-textgt
    ltpresentationtitlegtElectronic taxonomy assigning
    strains to bacterial species via the
    internetlt/presentationtitlegt. ltcollectworktitlegtBM
    C Biologylt/collectworktitlegt ltpublicationfield-tex
    tgtltyeargt2009lt/yeargt ltyeargt7lt/yeargtlt/publicationfi
    eld-textgt ltfirstpagegt3lt/firstpagegt.
    ltaddinfogtdoi10.1186/1741-7007-7-3lt/addinfogt.lt/ref
    -conf-textgt lt/ref-confgt

16
Challenges posed by source DTDs
  • Target DTD Structural Constraints that force the
    use of Named-Content
  • Table in Table
  • TD contains named-content, which contains a table
  • lttdgtltnamed-content content-typetablegtlttable-w
    rapgt
  • Figure in Table
  • TD contains named-content, which contains a fig
  • lttdgtltnamed-content content-typefiguregtltfiggt
  • Display-Formula in Title
  • Title contains named-content, which contains a
    display-formula
  • lttitlegtltnamed-content content-typedisplay-form
    ulagtltdisplay-formulagt

17
Challenges posed by source DTDs
  • Question/Answer
  • Generic and Structural
  • Is saying ltlist list-contentquestiongt enough?
  • ltQuestion-Answergt ltQgtltPgtltLgt1lt/Lgt. The major
    advantage of amniotic membrane transplantation in
    pterygium surgery islt/Pgtlt/Qgt ltAgtltPgtltLgtAlt/Lgt.
    reduction in surgical timelt/Pgtlt/Agt
    ltAgtltPgtltLgtBlt/Lgt. preservation of
    conjunctivalt/Pgtlt/Agt ltAgtltPgtltLgtClt/Lgt. better
    cosmetic outcomes compared with conjunctival
    autograftinglt/Pgtlt/Agt ltAgtltPgtltLgtDlt/Lgt. lowest
    recurrence rate among the surgical
    techniqueslt/Pgtlt/Agtlt/Question-Answergt

18
Challenges posed by source DTDs
  • Synonymy
  • Domain and Semantic
  • Is saying ltlist list-contentsynonymygt enough?
  • Or ltnamed-content content-typesynonymygt
    because of the semantic meaning?
  • ltSYNONYMYgt
  • ltHEADgtECHINOSTELIALESlt/HEADgt
  • ltITEMgtltPgtltGENSPgtClastoderma
    debaryanumlt/GENSPgt A. Blyttlt/Pgtlt/ITEMgt
  • ltITEMgtltPgtltGENSPgtEchinostelium
    apitectumlt/GENSPgt K.D. Whitney, MClt/Pgtlt/ITEMgt
  • ltITEMgtltPgtltGENSPgtEchinostelium
    coelocephalumlt/GENSPgt T.E. Brooks amp H.W.
    Keller, MClt/Pgtlt/ITEMgt
  • ltITEMgtltPgtltGENSPgtEchinostelium minutumlt/GENSPgt
    de Bary, MClt/Pgtlt/ITEMgt
  • lt/SYNONYMYgt
  • Synonyms are different scientific names that
    pertain to the same taxon

19
Challenges posed by source DTDs
  • Decision Tree (Taxonomic Key)
  • Domain, Semantic, Structural, and Presentation
  • ltKEYgt ltCOUPLETgtltDESCRgtltNOgt1.lt/NOgtHypostomal
    setae (Hy) shorter than half the width of
    labrumlt/DESCRgt ltRESPgtltGENSPgtSycophila
    mellealt/GENSPgt (Curtis, 1831), ltGENSPgtTetramesa
    lt/GENSPgtWalker, 1848lt/RESPgtlt/COUPLETgt
    ltCOUPLETgtltDESCRgtltNOgtlt/NOgt--Hypostomal setae
    longer or about as long as half the width of
    labrumlt/DESCRgt ltRESPgt2lt/RESPgtlt/COUPLETgt
    ltCOUPLETgtltDESCRgtltNOgt2.lt/NOgtMore than two dorsal
    setae (D) present on abdominal segments
    A6-8lt/DESCRgt ltRESPgt3lt/RESPgtlt/COUPLETgt
    ltCOUPLETgtltDESCRgtltNOgtlt/NOgt--At least one of
    abdominal segments A6-8 with only two dorsal
    setaelt/DESCRgt ltRESPgt4lt/RESPgtlt/COUPLETgt
    ltCOUPLETgtltDESCRgtltNOgt3.lt/NOgtMandibles
    bidentatelt/DESCRgt ltRESPgtltGENSPgtE. (Ahtola)
    atralt/GENSPgt (Walker, 1832)lt/RESPgtlt/COUPLETgt
    ltCOUPLETgtltDESCRgtltNOgtlt/NOgt--Mandibles
    unidentatelt/DESCRgt ltRESPgtltGENSPgtE.
    nodularislt/GENSPgt Bohemanlt/RESPgtlt/COUPLETgt
    ltCOUPLETgtltDESCRgtltNOgt4.lt/NOgtMandibles
    bidentatelt/DESCRgt ltRESPgtltGENSPgtEurytoma
    appendigasterlt/GENSPgt grouplt/RESPgtlt/COUPLETgt
    ltCOUPLETgtltDESCRgtltNOgtlt/NOgt--Mandibles
    unidentatelt/DESCRgt ltRESPgtltGENSPgtEurytoma
    heriadilt/GENSPgt Zerovalt/RESPgtlt/COUPLETgtlt/KEYgt
  • tree-like model of decisions and their possible
    outcomes

20
Concluding Question
  • How to support Publisher/Domain Specific
    constructs in the Archival DTD?
  • Continue use of Named-Content
  • New Miscellaneous Element
  • Support for adding namespaced elements
  • Other

21
Questions/Answers?Thank you
  • John Meyer
  • Director of Data Technologies
  • 100 Campus Drive, Suite 100
  • Princeton, NJ 08540
  • 609 986-2220
  • john.meyer_at_ithaka.org
Write a Comment
User Comments (0)
About PowerShow.com