Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation

About This Presentation

Title:

Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation

Description:

Color TIFFs for color pages, black and white TIFFs for pages with only black and ... Overall assessment was very good; minor problems on a very small number of pages ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 22

Provided by: librar

Category:

more less

Transcript and Presenter's Notes

Title: Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation

1
Economic Growth CenterDigital Librarya project
funded by the Andrew W. Mellon Foundation

Ann Green, Steve Citron-Pousty, and Marcia Ford
Social Science Research Services
Yale ITS Academic Media Technology
Sandy Peterson and Julie Linden
Yale Social Science Libraries and Information
Services
Christopher Udry
Professor, Yale Dept of Economics
Director, Economic Growth Center and Dept of
Economics

2
Selecting the EGC

Why Economic Growth Center collection?
extraordinary wealth of statistical material in
printed form
condition of paper is poor preservation concerns
a range of access problems imposed not only by
non-digital physical formats but also by
inadequate descriptions of the contents of the
publications
Why Mexico?
Faculty connections, library strengths, image
quality and completeness of collection
Publisher Instituto Nacional de Estadística,
Geografía e Informática (Mexico, INEGI) Anuarios
Estatales

3
Goals of the grantImproving access to
statistical resources not born digital

Build a prototype archive of statistics not born
digital.
Implement standard digitization practices and
emerging metadata standards for statistical
tables.
Document the costs and processes of creating a
statistical digital library from print.
Build the collection based upon long range
digital life cycle requirements.
Present the prototype digital library to the
scholarly community for evaluation.

4
Research questions

What effect does online access to the statistical
information have on scholarly use of the
materials?
Are common digitization practices and standards
suited to statistically-intensive documents?
What are the long term preservation requirements
of the EGCDL?
What are the costs of producing high quality
statistical tables with OCR and editing?
How scalable is this process, for what kinds of
collections, and for what purposes?

5
Of interest to developers

Used scripts to produce metadata (both Dublin
Core and Data Documentation Initiative XML
records)
Used scripts to assess the quality of the Excel
tables
Used scripts to make estimates of number or
characters in the files for cost comparisons with
keying in data
User interface for locating PDFs and Excel files
and keyword searches of the metadata

6
Digitization process timeline

Prepared and shipped 221 volumes in Fall 03
All materials received back Feb 04
Quality assessment period completed March 04

7
Deliverables received images and PDFs

300 dpi TIFFs of each page of each volume,
including cover and back of volume
(102,534 TIFF files 415 gigabytes)
Color TIFFs for color pages, black and white
TIFFs for pages with only black and white
PDF image text files of each chapter of each
volume (5,607)
Separate files for each subject chapter, front
matter, indices, and back matter

8
Deliverables received statistical tables in Excel

Excel tables (16,488) of demographic and economic
tables for 1994, 1996, 1998, 2000
Selected OCRd statistical tables converted to
Excel format
DSI operators reviewed each table twice
Custom tagging done by DSI improved our ability
to extract columns and rows into DDI metadata
records

9
Quality assessment of PDFs visual

Overall assessment was very good minor problems
on a very small number of pages
Sampled 5 of the PDFs
subjective review of image
Tilting, cut off text, illegible characters,
bleed through, color evaluation, noise
Vendor did not produce PDFs exactly to our
specification, had to reformat 216 files

10
Quality assessment of PDFs Lucene indexing of
text

Good test of Lucenes ability to index PDF
documents
Produced specific page search results
However, OCR quality of 98.5 wasnt giving
adequate enough search results

11
Quality assessment of Excel tables scripts and
visual

Checksums
Wrote script to find tables with Total columns
Wrote Excel macro to compare column sums with
Total values
Excellent numeric transfer of numbers from print
Visual review

12
Automated Dublin Core metadata production for PDF
and Excel files

Defined metadata format for PDF and Excel
documents
Dublin Core subset (title, date, format,
identifier, source, coverage, subject)
Wrote scripts to produce individual metadata
records for each PDF chapter and each Excel file
Matched chapter numbers with chapter titles
created standard matching tables
Generated subject term for each chapter created
standard tables of chapter titles to topic list
Generated series title list from library online
catalog
Other text generated from file name and directory
information

13
(No Transcript)
14
User interfaceSelect, view, downloadssrs.yale.e
du/egcdl

Features
Select files by year, state, topic, and/or type
of file (PDF or Excel)
Reconstruct the full volume
Based upon Dublin Core metadata

15
Automated production of DDI metadata for Excel
files

Scripts pulled content from Excel files and
Dublin Core records
DDI records have text from titles, column and row
headers, footnotes (much more detail than Dublin
Core records)
No manual editing necessary
DDI records are valid
Challenges with marking up hierarchical tables in
DDI

16
Keyword search added to the UI

Lucene index for keyword searching
Includes text from table titles, column and row
labels, and footnotes
Can use accent marks or not in search terms
Lucene returns appropriate results set

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
New part of projectDigitizing Nigeria data

Many volumes have poor quality paper and print
mimeographed copies of typewritten pages
skewing, bleed through, strikeovers, text cut off
Goals of project
Try to find tipping point for digitizing
collection vs. digitizing selection of tables
Test whether lessons learned from relatively
clean tables apply to lower-quality print
materials
Determine how these materials would fit into
EGCDL UI

21
Repurposing our development work

Common tasks
Repurpose Dublin Core production process for file
level metadata
Digitization process known and workflows in place
Nigerian PDFs can be incorporated into a revised
UI
Differences
Material not conducive to OCR
Topics are not broad as with the Mexican data
Time periods are not uniform across states or
years
The design of the workflows, metadata production,
and UI allow for flexibility and scale across
countries and differences in content.

Write a Comment

User Comments (0)

About PowerShow.com

Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation - PowerPoint PPT Presentation

Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation

Color TIFFs for color pages, black and white TIFFs for pages with only black and ... Overall assessment was very good; minor problems on a very small number of pages ... – PowerPoint PPT presentation