Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation

Description:

Color TIFFs for color pages, black and white TIFFs for pages with only black and ... Overall assessment was very good; minor problems on a very small number of pages ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 22
Provided by: librar
Category:

less

Transcript and Presenter's Notes

Title: Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation


1
Economic Growth CenterDigital Librarya project
funded by the Andrew W. Mellon Foundation
  • Ann Green, Steve Citron-Pousty, and Marcia Ford
  • Social Science Research Services
  • Yale ITS Academic Media Technology
  • Sandy Peterson and Julie Linden
  • Yale Social Science Libraries and Information
    Services
  • Christopher Udry
  • Professor, Yale Dept of Economics
  • Director, Economic Growth Center and Dept of
    Economics

2
Selecting the EGC
  • Why Economic Growth Center collection?
  • extraordinary wealth of statistical material in
    printed form
  • condition of paper is poor preservation concerns
  • a range of access problems imposed not only by
    non-digital physical formats but also by
    inadequate descriptions of the contents of the
    publications
  • Why Mexico?
  • Faculty connections, library strengths, image
    quality and completeness of collection
  • Publisher Instituto Nacional de Estadística,
    Geografía e Informática (Mexico, INEGI) Anuarios
    Estatales

3
Goals of the grantImproving access to
statistical resources not born digital
  • Build a prototype archive of statistics not born
    digital.
  • Implement standard digitization practices and
    emerging metadata standards for statistical
    tables.
  • Document the costs and processes of creating a
    statistical digital library from print.
  • Build the collection based upon long range
    digital life cycle requirements.
  • Present the prototype digital library to the
    scholarly community for evaluation.

4
Research questions
  • What effect does online access to the statistical
    information have on scholarly use of the
    materials?
  • Are common digitization practices and standards
    suited to statistically-intensive documents?
  • What are the long term preservation requirements
    of the EGCDL?
  • What are the costs of producing high quality
    statistical tables with OCR and editing?
  • How scalable is this process, for what kinds of
    collections, and for what purposes?

5
Of interest to developers
  • Used scripts to produce metadata (both Dublin
    Core and Data Documentation Initiative XML
    records)
  • Used scripts to assess the quality of the Excel
    tables
  • Used scripts to make estimates of number or
    characters in the files for cost comparisons with
    keying in data
  • User interface for locating PDFs and Excel files
    and keyword searches of the metadata

6
Digitization process timeline
  • Prepared and shipped 221 volumes in Fall 03
  • All materials received back Feb 04
  • Quality assessment period completed March 04

7
Deliverables received images and PDFs
  • 300 dpi TIFFs of each page of each volume,
    including cover and back of volume
  • (102,534 TIFF files 415 gigabytes)
  • Color TIFFs for color pages, black and white
    TIFFs for pages with only black and white
  • PDF image text files of each chapter of each
    volume (5,607)
  • Separate files for each subject chapter, front
    matter, indices, and back matter

8
Deliverables received statistical tables in Excel
  • Excel tables (16,488) of demographic and economic
    tables for 1994, 1996, 1998, 2000
  • Selected OCRd statistical tables converted to
    Excel format
  • DSI operators reviewed each table twice
  • Custom tagging done by DSI improved our ability
    to extract columns and rows into DDI metadata
    records

9
Quality assessment of PDFs visual
  • Overall assessment was very good minor problems
    on a very small number of pages
  • Sampled 5 of the PDFs
  • subjective review of image
  • Tilting, cut off text, illegible characters,
    bleed through, color evaluation, noise
  • Vendor did not produce PDFs exactly to our
    specification, had to reformat 216 files

10
Quality assessment of PDFs Lucene indexing of
text
  • Good test of Lucenes ability to index PDF
    documents
  • Produced specific page search results
  • However, OCR quality of 98.5 wasnt giving
    adequate enough search results

11
Quality assessment of Excel tables scripts and
visual
  • Checksums
  • Wrote script to find tables with Total columns
  • Wrote Excel macro to compare column sums with
    Total values
  • Excellent numeric transfer of numbers from print
  • Visual review

12
Automated Dublin Core metadata production for PDF
and Excel files
  • Defined metadata format for PDF and Excel
    documents
  • Dublin Core subset (title, date, format,
    identifier, source, coverage, subject)
  • Wrote scripts to produce individual metadata
    records for each PDF chapter and each Excel file
  • Matched chapter numbers with chapter titles
    created standard matching tables
  • Generated subject term for each chapter created
    standard tables of chapter titles to topic list
  • Generated series title list from library online
    catalog
  • Other text generated from file name and directory
    information

13
(No Transcript)
14
User interfaceSelect, view, downloadssrs.yale.e
du/egcdl
  • Features
  • Select files by year, state, topic, and/or type
    of file (PDF or Excel)
  • Reconstruct the full volume
  • Based upon Dublin Core metadata

15
Automated production of DDI metadata for Excel
files
  • Scripts pulled content from Excel files and
    Dublin Core records
  • DDI records have text from titles, column and row
    headers, footnotes (much more detail than Dublin
    Core records)
  • No manual editing necessary
  • DDI records are valid
  • Challenges with marking up hierarchical tables in
    DDI

16
Keyword search added to the UI
  • Lucene index for keyword searching
  • Includes text from table titles, column and row
    labels, and footnotes
  • Can use accent marks or not in search terms
    Lucene returns appropriate results set

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
New part of projectDigitizing Nigeria data
  • Many volumes have poor quality paper and print
  • mimeographed copies of typewritten pages
  • skewing, bleed through, strikeovers, text cut off
  • Goals of project
  • Try to find tipping point for digitizing
    collection vs. digitizing selection of tables
  • Test whether lessons learned from relatively
    clean tables apply to lower-quality print
    materials
  • Determine how these materials would fit into
    EGCDL UI

21
Repurposing our development work
  • Common tasks
  • Repurpose Dublin Core production process for file
    level metadata
  • Digitization process known and workflows in place
  • Nigerian PDFs can be incorporated into a revised
    UI
  • Differences
  • Material not conducive to OCR
  • Topics are not broad as with the Mexican data
  • Time periods are not uniform across states or
    years
  • The design of the workflows, metadata production,
    and UI allow for flexibility and scale across
    countries and differences in content.
Write a Comment
User Comments (0)
About PowerShow.com