Title: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format
1Portico A Case Study in the Migration of
Proprietary Formats to the JATS Archiving Format
- Sheila Morrissey, John Meyer, Sushil Bhattarai,
Sachin Kurdikar, Jie Ling, Matthew Stoeffler,
Umadevi Thanneeru
2Portico JSTOR Committed to Preserving the
Scholarly Record
Ithaka helps the academic community use digital
technologies to preserve the scholarly record and
to advance research and teaching in sustainable
ways
I T H A K A
Digitization for Preservation Access
Digital Preservation
Light Archive
Dark Archive
3Portico Archive
- Porticos objective is to help libraries make a
secure and reliable transition from print to a
reliance on e-content. - Maintains archiving agreement with publishers to
collect and preserve content. - Receives content directly from publishers.
- Preserves
- Current journals (born digital)
- Back file journals (reborn digital)
- E-books
- Digitized historical collections
4An Insurance Policy for e-Content
- Provide libraries with access to archived content
when it becomes lost, orphaned or abandoned
(regardless of libraries past or current
subscription) - Publisher ceases operation
- Publisher discontinues title
- Publisher drops back file
- Provide libraries with post-cancellation access
if publisher specifically names Portico - About 90 of titles in Archive are covered by
Portico post-cancellation access rights. - Libraries asked to pay annual Archive support
payment to defray cost of preservation, e.g.
insurance premium
5Portico Archive as of July 19, 2010
Category Files
Images 84,215,731 47.93
Publisher Supplied Text 47,393,731 26.98
Portico Created Archival Text 43,689,083 24.87
Application Specific Files 232,732 0.13
Multi-file Packages 140,333 0.08
Videos 20,604 0.01
Audio 570 lt0.00
Executable 6 lt0.00
Total 175,692,826 100
- 114 publisher participants
- 11,788 committed journal titles
- 43,253 committed e-books
- 13 committed digitized collections
- gt14 million articles ingested
- 688 library participants
- (48 outside US)
- 4 Trigger events
- 15 Post-cancellation Access Claims
6Portico Preservation Infrastructure
- Publisher supplies XML Source file (including
the text, images) and PDF page rendition. -
- Best approach for preserving the intellectual
content of the article or book.
- Authenticate verify that preserved content is
what it purports to be. - Verify format ensure the file meets syntactic
and semantic rules of format specification. -
- Repair
- Normalize (XML)
- Create preservation metadata
- Assess archival robustness of file format.
- Migrate files to ensure future usability of
content. - Replicate objects and metadata to protect
against bit rot and media deterioration
- Render articles to meet viewing requirements of
delivery platform.
7Key Challenges for an Archival DTD
- Dec 2001, Ineras E-Journal Archive DTD
Feasibility Study highlighted these Key
Challenges for an Archival DTD - Use of generated and boilerplate text, especially
in - Label text for figure captions
- Citation text
- Author name and affiliation
- Dates
- Expression of links between author and
affiliation - Reference elements
- Expression of non-article and other content
- Abbreviations and definitions
8Key Challenges for an Archival DTD
- Keywords
- Sections, including handling of sections without
headers - Placement of floating objects, such as figures,
tables, graphs - Tables, including cell formatting issues (cells
with figures, content alignment, etc.) - Math
- Intra-, inter- and extra-article linking
- Publisher-specific elements
- When reviewing the minutes of the Working Group
and the evolution of the DTD, we can confirm that
these areas have been the main focus of
discussion.
9Some Design Constraints
- IMPLIED, not REQUIRED attributes
- CDATA instead of controlled list
- Optional Elements, or relaxed order of elements
- Surprising location of Elements
- No Domain Specific Elements
10Publisher/Domain Specific Elements
- Custom-Meta
- Business Data
- Allowed in journal-meta, article-meta, front-stub
- Name/Value pair (may contain 38 different
Elements) - Named-Content
- Semantic Significance
- Allowed in 112 Elements
- May contain 59 different Elements
11Challenges posed by source DTDs
- Extended Semantics for Named-Content
- Price in Citation
- Becomes ltnamed-content content-typepricegt
- ltcitation reference"1" id"R1" type"serial"gt
ltauthor order"1"gt ltnamegtltfirstgtS.
P.lt/firstgtltlastgtMorganlt/lastgtlt/namegt
lt/authorgt ltjournalgt ltsertitlegtJ.
Appl. Phys.lt/sertitlegt ltURI
type"ISSN"gt0030-3941lt/URIgt
ltpricegt01.00lt/pricegt ltvolumegt29lt/volumegt
ltpagesgtltfirstgt1358lt/firstgtltlastgt1368lt/last
gtlt/pagesgt ltpubdategt1958lt/pubdategt
lt/journalgt lttitlegtGeneral solution of the
Luneburg lens problemlt/titlegtlt/citationgt
12Challenges posed by source DTDs
- More Extended Semantics for Named-Content
- Affiliation in Footnotes/P
- Becomes ltnamed-content content-typeaff
idAFF2gt - ltFOOTNOTE ID"N101" TYPE"AFF"gtltP
ALPHABET"LATIN" TYPE"INDENT"gt ltAFF
ID"AFF2gtltITgtCorresponding author addresslt/ITgt
Nicholas M. J. Hall, Dept. of Atmospheric and
Oceanic Sciences, McGill University, 805
Sherbrooke St. W., Montreal PQ H3A 2K6,
Canada.lt/AFFgt - lt/Pgtlt/FOOTNOTEgt
13Challenges posed by source DTDs
- More Extended Semantics for Named-Content
- Funding in Acknowledgments/P
- Becomes ltnamed-content content-typefundinggt
- ltackgtltsectitlegtACKNOWLEDGMENTSlt/sectitlegtltpgtQ.W.
x2019s research is partially supported by
AFOSR Grant No. ltfunding source"USAFOSR"gtltcontrac
tgtF49550-05-1-0025lt/contractgtlt/fundinggt and NSF
Grants No. ltfunding source"NSF"gtltcontractgtDMS-020
4243lt/contractgtlt/fundinggt, No. ltfunding
source"NSF"gtltcontractgtDMS-0605029lt/contractgtlt/fun
dinggt, and No. ltfunding source"NSF"gtltcontractgtDMS
-0626180lt/contractgtlt/fundinggt. P.Z. is partially
supported by the special funds for major State
Research Projects ltfunding source"UNSPECIFIED"gtltc
ontractgt2005CB321704lt/contractgtlt/fundinggt and
National Science Foundation of China for
Distinguished Young Scholars ltfunding
source"NSFC"gtltcontractgt10225103lt/contractgtlt/fundi
nggt. H.Z.x2019s work is supported in part by
the Naval Postgraduate School Research Initiation
Program.lt/pgtlt/ackgt
14Challenges posed by source DTDs
- More Extended Semantics for Named-Content
- Organization Division in Affiliation
- Becomes ltnamed-content content-typedivisiongt
- ltAffiliation ID"Aff12"gt
ltOrgDivisiongtOptisches Institutlt/OrgDivisiongt
ltOrgNamegtTechnische Universität
Berlinlt/OrgNamegt ltOrgAddressgt
ltCitygtBerlinlt/Citygt
ltCountrygtGermanylt/Countrygt
lt/OrgAddressgt lt/Affiliationgt
15Challenges posed by source DTDs
- More Extended Semantics for Named-Content
- Generic Element (addinfo)
- Becomes ltnamed-content content-typeaddinfogt
- ltref-conf id"CIT0045"gtltref-conf-textgtltauthor-ref-
textgtltsurnamegtBishoplt/surnamegt ltgivennamegtCJlt/give
nnamegtlt/author-ref-textgt, ltauthor-ref-textgtltsurnam
egtAanenseslt/surnamegt ltgivennamegtDMlt/givennamegtlt/au
thor-ref-textgt, ltauthor-ref-textgtltsurnamegtJordanlt/
surnamegt ltgivennamegtGElt/givennamegtlt/author-ref-tex
tgt, ltauthor-ref-textgtltsurnamegtKilianlt/surnamegt
ltgivennamegtMlt/givennamegtlt/author-ref-textgt,
ltauthor-ref-textgtltsurnamegtHanagelt/surnamegt
ltgivennamegtWPlt/givennamegtlt/author-ref-textgt,
ltauthor-ref-textgtltsurnamegtSprattlt/surnamegt
ltgivennamegtBG.lt/givennamegtlt/author-ref-textgt
ltpresentationtitlegtElectronic taxonomy assigning
strains to bacterial species via the
internetlt/presentationtitlegt. ltcollectworktitlegtBM
C Biologylt/collectworktitlegt ltpublicationfield-tex
tgtltyeargt2009lt/yeargt ltyeargt7lt/yeargtlt/publicationfi
eld-textgt ltfirstpagegt3lt/firstpagegt.
ltaddinfogtdoi10.1186/1741-7007-7-3lt/addinfogt.lt/ref
-conf-textgt lt/ref-confgt
16Challenges posed by source DTDs
- Target DTD Structural Constraints that force the
use of Named-Content - Table in Table
- TD contains named-content, which contains a table
- lttdgtltnamed-content content-typetablegtlttable-w
rapgt - Figure in Table
- TD contains named-content, which contains a fig
- lttdgtltnamed-content content-typefiguregtltfiggt
- Display-Formula in Title
- Title contains named-content, which contains a
display-formula - lttitlegtltnamed-content content-typedisplay-form
ulagtltdisplay-formulagt
17Challenges posed by source DTDs
- Question/Answer
- Generic and Structural
- Is saying ltlist list-contentquestiongt enough?
- ltQuestion-Answergt ltQgtltPgtltLgt1lt/Lgt. The major
advantage of amniotic membrane transplantation in
pterygium surgery islt/Pgtlt/Qgt ltAgtltPgtltLgtAlt/Lgt.
reduction in surgical timelt/Pgtlt/Agt
ltAgtltPgtltLgtBlt/Lgt. preservation of
conjunctivalt/Pgtlt/Agt ltAgtltPgtltLgtClt/Lgt. better
cosmetic outcomes compared with conjunctival
autograftinglt/Pgtlt/Agt ltAgtltPgtltLgtDlt/Lgt. lowest
recurrence rate among the surgical
techniqueslt/Pgtlt/Agtlt/Question-Answergt
18Challenges posed by source DTDs
- Synonymy
- Domain and Semantic
- Is saying ltlist list-contentsynonymygt enough?
- Or ltnamed-content content-typesynonymygt
because of the semantic meaning? - ltSYNONYMYgt
- ltHEADgtECHINOSTELIALESlt/HEADgt
- ltITEMgtltPgtltGENSPgtClastoderma
debaryanumlt/GENSPgt A. Blyttlt/Pgtlt/ITEMgt - ltITEMgtltPgtltGENSPgtEchinostelium
apitectumlt/GENSPgt K.D. Whitney, MClt/Pgtlt/ITEMgt - ltITEMgtltPgtltGENSPgtEchinostelium
coelocephalumlt/GENSPgt T.E. Brooks amp H.W.
Keller, MClt/Pgtlt/ITEMgt - ltITEMgtltPgtltGENSPgtEchinostelium minutumlt/GENSPgt
de Bary, MClt/Pgtlt/ITEMgt - lt/SYNONYMYgt
- Synonyms are different scientific names that
pertain to the same taxon
19Challenges posed by source DTDs
- Decision Tree (Taxonomic Key)
- Domain, Semantic, Structural, and Presentation
- ltKEYgt ltCOUPLETgtltDESCRgtltNOgt1.lt/NOgtHypostomal
setae (Hy) shorter than half the width of
labrumlt/DESCRgt ltRESPgtltGENSPgtSycophila
mellealt/GENSPgt (Curtis, 1831), ltGENSPgtTetramesa
lt/GENSPgtWalker, 1848lt/RESPgtlt/COUPLETgt
ltCOUPLETgtltDESCRgtltNOgtlt/NOgt--Hypostomal setae
longer or about as long as half the width of
labrumlt/DESCRgt ltRESPgt2lt/RESPgtlt/COUPLETgt
ltCOUPLETgtltDESCRgtltNOgt2.lt/NOgtMore than two dorsal
setae (D) present on abdominal segments
A6-8lt/DESCRgt ltRESPgt3lt/RESPgtlt/COUPLETgt
ltCOUPLETgtltDESCRgtltNOgtlt/NOgt--At least one of
abdominal segments A6-8 with only two dorsal
setaelt/DESCRgt ltRESPgt4lt/RESPgtlt/COUPLETgt
ltCOUPLETgtltDESCRgtltNOgt3.lt/NOgtMandibles
bidentatelt/DESCRgt ltRESPgtltGENSPgtE. (Ahtola)
atralt/GENSPgt (Walker, 1832)lt/RESPgtlt/COUPLETgt
ltCOUPLETgtltDESCRgtltNOgtlt/NOgt--Mandibles
unidentatelt/DESCRgt ltRESPgtltGENSPgtE.
nodularislt/GENSPgt Bohemanlt/RESPgtlt/COUPLETgt
ltCOUPLETgtltDESCRgtltNOgt4.lt/NOgtMandibles
bidentatelt/DESCRgt ltRESPgtltGENSPgtEurytoma
appendigasterlt/GENSPgt grouplt/RESPgtlt/COUPLETgt
ltCOUPLETgtltDESCRgtltNOgtlt/NOgt--Mandibles
unidentatelt/DESCRgt ltRESPgtltGENSPgtEurytoma
heriadilt/GENSPgt Zerovalt/RESPgtlt/COUPLETgtlt/KEYgt - tree-like model of decisions and their possible
outcomes
20Concluding Question
- How to support Publisher/Domain Specific
constructs in the Archival DTD? - Continue use of Named-Content
- New Miscellaneous Element
- Support for adding namespaced elements
- Other
21Questions/Answers?Thank you
- John Meyer
- Director of Data Technologies
- 100 Campus Drive, Suite 100
- Princeton, NJ 08540
- 609 986-2220
- john.meyer_at_ithaka.org