Title: Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation
1Economic Growth CenterDigital Librarya project
funded by the Andrew W. Mellon Foundation
- Ann Green, Steve Citron-Pousty, and Marcia Ford
- Social Science Research Services
- Yale ITS Academic Media Technology
- Sandy Peterson and Julie Linden
- Yale Social Science Libraries and Information
Services - Christopher Udry
- Professor, Yale Dept of Economics
- Director, Economic Growth Center and Dept of
Economics
2Selecting the EGC
- Why Economic Growth Center collection?
- extraordinary wealth of statistical material in
printed form - condition of paper is poor preservation concerns
- a range of access problems imposed not only by
non-digital physical formats but also by
inadequate descriptions of the contents of the
publications - Why Mexico?
- Faculty connections, library strengths, image
quality and completeness of collection - Publisher Instituto Nacional de Estadística,
Geografía e Informática (Mexico, INEGI) Anuarios
Estatales
3Goals of the grantImproving access to
statistical resources not born digital
- Build a prototype archive of statistics not born
digital. - Implement standard digitization practices and
emerging metadata standards for statistical
tables. - Document the costs and processes of creating a
statistical digital library from print. - Build the collection based upon long range
digital life cycle requirements. - Present the prototype digital library to the
scholarly community for evaluation.
4Research questions
- What effect does online access to the statistical
information have on scholarly use of the
materials? - Are common digitization practices and standards
suited to statistically-intensive documents? - What are the long term preservation requirements
of the EGCDL? - What are the costs of producing high quality
statistical tables with OCR and editing? - How scalable is this process, for what kinds of
collections, and for what purposes?
5Of interest to developers
- Used scripts to produce metadata (both Dublin
Core and Data Documentation Initiative XML
records) - Used scripts to assess the quality of the Excel
tables - Used scripts to make estimates of number or
characters in the files for cost comparisons with
keying in data - User interface for locating PDFs and Excel files
and keyword searches of the metadata
6Digitization process timeline
- Prepared and shipped 221 volumes in Fall 03
- All materials received back Feb 04
- Quality assessment period completed March 04
7Deliverables received images and PDFs
- 300 dpi TIFFs of each page of each volume,
including cover and back of volume - (102,534 TIFF files 415 gigabytes)
- Color TIFFs for color pages, black and white
TIFFs for pages with only black and white - PDF image text files of each chapter of each
volume (5,607) - Separate files for each subject chapter, front
matter, indices, and back matter
8Deliverables received statistical tables in Excel
- Excel tables (16,488) of demographic and economic
tables for 1994, 1996, 1998, 2000 - Selected OCRd statistical tables converted to
Excel format - DSI operators reviewed each table twice
- Custom tagging done by DSI improved our ability
to extract columns and rows into DDI metadata
records
9Quality assessment of PDFs visual
- Overall assessment was very good minor problems
on a very small number of pages - Sampled 5 of the PDFs
- subjective review of image
- Tilting, cut off text, illegible characters,
bleed through, color evaluation, noise - Vendor did not produce PDFs exactly to our
specification, had to reformat 216 files
10Quality assessment of PDFs Lucene indexing of
text
- Good test of Lucenes ability to index PDF
documents - Produced specific page search results
- However, OCR quality of 98.5 wasnt giving
adequate enough search results
11Quality assessment of Excel tables scripts and
visual
- Checksums
- Wrote script to find tables with Total columns
- Wrote Excel macro to compare column sums with
Total values - Excellent numeric transfer of numbers from print
- Visual review
12Automated Dublin Core metadata production for PDF
and Excel files
- Defined metadata format for PDF and Excel
documents - Dublin Core subset (title, date, format,
identifier, source, coverage, subject) - Wrote scripts to produce individual metadata
records for each PDF chapter and each Excel file - Matched chapter numbers with chapter titles
created standard matching tables - Generated subject term for each chapter created
standard tables of chapter titles to topic list - Generated series title list from library online
catalog - Other text generated from file name and directory
information
13(No Transcript)
14User interfaceSelect, view, downloadssrs.yale.e
du/egcdl
- Features
- Select files by year, state, topic, and/or type
of file (PDF or Excel) - Reconstruct the full volume
- Based upon Dublin Core metadata
15Automated production of DDI metadata for Excel
files
- Scripts pulled content from Excel files and
Dublin Core records - DDI records have text from titles, column and row
headers, footnotes (much more detail than Dublin
Core records) - No manual editing necessary
- DDI records are valid
- Challenges with marking up hierarchical tables in
DDI
16Keyword search added to the UI
- Lucene index for keyword searching
- Includes text from table titles, column and row
labels, and footnotes - Can use accent marks or not in search terms
Lucene returns appropriate results set
17(No Transcript)
18(No Transcript)
19(No Transcript)
20New part of projectDigitizing Nigeria data
- Many volumes have poor quality paper and print
- mimeographed copies of typewritten pages
- skewing, bleed through, strikeovers, text cut off
- Goals of project
- Try to find tipping point for digitizing
collection vs. digitizing selection of tables - Test whether lessons learned from relatively
clean tables apply to lower-quality print
materials - Determine how these materials would fit into
EGCDL UI
21Repurposing our development work
- Common tasks
- Repurpose Dublin Core production process for file
level metadata - Digitization process known and workflows in place
- Nigerian PDFs can be incorporated into a revised
UI - Differences
- Material not conducive to OCR
- Topics are not broad as with the Mexican data
- Time periods are not uniform across states or
years - The design of the workflows, metadata production,
and UI allow for flexibility and scale across
countries and differences in content.