JSTOR - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

JSTOR

Description:

When match is found, software prints word to text file. What does OCR software do? ... Print-Repository Effort Under Way at UCLA and Harvard. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 23
Provided by: kiffr
Category:
Tags: jstor | prints

less

Transcript and Presenter's Notes

Title: JSTOR


1
JSTOR OCR - A Case Study
  • Kiffany Francis

2
What is JSTOR?
  • JSTOR is a not-for-profit organization with a
    dual mission to create and maintain a trusted
    archive of important scholarly journals, and to
    provide access to these journals as widely as
    possible.

3
JSTOR
JSTOR - journal storage.
They are building a digital archive of journal
back runs, Some of which date back to the 1600s.
JSTOR has converted over 10 million paper
journal pages from over 240 journals
representing more than 170 publishers. The JSTOR
archive is available at more than 1,450
libraries.
4
JSTORhttp//www.jstor.org
Each journal page digitized by JSTOR is processed
by an OCR application. The resulting text files
are used to support full-text searching offered
to JSTOR users.
5
What is OCR?
Optical Character Recognition
It is the process that converts the text of a
printed page or image into editable, digital
text.
6
What does OCR software do?
  • The software analyzes the layout of text.
  • The order of the paragraphs is determined.
  • Analysis of characters begin.
  • Compares character groups (words) to dictionary
    in OCR application
  • When match is found, software prints word to text
    file.

7
What does OCR software do?
  • If a match can not be found
  • The software makes a reasonable assumption and
    flags the word with low confidence.
  • If a word or character can not be read at all, a
    default character is inserted as a placeholder.

8
Problems with OCR
Does not handle certain text very well.
  • Non-Arabic text
  • Nonmodern type
  • Small print
  • Certain fonts
  • Complex page layouts

9
JSTOR Production Process
The process begins at JSTOR in Ann Arbor,
Michigan.
Page-by-page examination of journal
run. Preservation concerns are addressed. Scanning
guidelines are created. A production librarian
and serials specialist create indexing
guidelines. Journal is shipped to contractor to
be scanned and described.
10
JSTOR Production Process
At the contractor facility
Physical journals are disbound and separated
Into pages sorted by issue. Each page is scanned
in bitonal TIFF format at 600 dpi
resolution. Page images are checked for marks,
folds, skewing. A table of contents file is
added. If available, abstracts and keywords are
added. All digital files created by contractor,
page images and toc files, are downloaded to
CD-ROM and shipped back to JSTOR - Ann Arbor.
11
JSTOR Production Process
Rich Digital Masters
  • Each page is scanned in bitonal TIFF format at
    600 dpi
  • This is preferred because
  • In 1994, there was some debate about whether 300
    dpi or
  • 600 dpi was better because of storage space.
    600 dpi won
  • out.
  • 2. 600 dpi printers are now standard
  • Resolutions higher than 600 dpi are not
    discernably better
  • for black-and-white text-based images.

12
JSTOR Production Process
Back at JSTOR - Ann Arbor
Files are uploaded from CD-ROM to JSTOR file
servers. Quality control process verifies image
and table of content quality. After quality
check, each page image is processed by
OCR software to create full-text for
searching. After further quality control, the
title is announced to JSTOR participants.
13
JSTOR Production Process
The quality of OCR for journals.
  • JSTOR reports a 97 accuracy rate for their
  • OCR created text-files.
  • Some journals yield OCR files that are 99.95
  • accurate.
  • This level of accuracy is satisfactory for
    searching
  • but not for presentation.

14
Example of JSTOR page.
15
Example of scanned image from JSTOR
16
JSTOR Preservation Issues
A PLAN FOR PRESERVATION. Print repositories of
JSTOR journals are being started at University of
California and Harvard University. The database
is currently housed on servers managed and
maintained at Princeton University, University of
Michigan, and University of Manchester
(UK). Archival cold tapes are also stored at the
OCLC and at the JSTOR offices in New York City.
17
Guidelines Is OCR right for your project?
1. Select the technology that will enhance your
ability to meet the objectives of the project.
From An OCR Case Study by Eileen Gifford Fenton
18
Guidelines Is OCR right for your project?
2. Scale matters -- a lot.
From An OCR Case Study by Eileen Gifford Fenton
19
Guidelines Is OCR right for your project?
3. There is no right answer.
From An OCR Case Study by Eileen Gifford Fenton
20
Guidelines Is OCR right for your project?
4. Costs will be higher than you expect.
From An OCR Case Study by Eileen Gifford Fenton
21
Guidelines Is OCR right for your project?
5. The answer that is right for today may not be
right in the future.
From An OCR Case Study by Eileen Gifford Fenton
22
Sources for Further Investigation
Bibliography Guthrie, Kevin, JSTOR.
Developing a Digital Preservation Strategy For
JSTOR, an interview. http//www.rlg.org/preserv/d
iginews/diginews4-4.htmlfeature1 JSTOR website
http//www.jstor.org/ Kiplinger, John. Director
of Production, JSTOR. Print-Repository Effort
Under Way at UCLA and Harvard.
http//www.clir.org/PUBS/issues/issues47.htmlprin
t Fenton, Eileen Gifford, JSTOR, University of
Michigan. An OCR Case Study. In Handbook for
Digital ProjectsA Management Tool for
Preservation and Access. http//www.nedcc.org/digi
tal/vii.htm3
Write a Comment
User Comments (0)
About PowerShow.com