Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation - PowerPoint PPT Presentation


PPT – Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation PowerPoint presentation | free to download - id: 321d2-Mjk1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation


... TIFFs for color pages, black and white TIFFs for pages with only black and white ... in Europe, World Bank, Health Canada; in review by ICPSR and The ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 37
Provided by: ssrs
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation

Economic Growth CenterDigital Librarya project
funded by the Andrew W. Mellon Foundation
  • Ann Green, Steve Citron-Pousty, and Marcia Ford
  • Social Science Research Services
  • Yale ITS Academic Media Technology
  • Sandy Peterson and Julie Linden
  • Yale Social Science Libraries and Information
  • Christopher Udry
  • Professor, Yale Dept of Economics
  • Director, Economic Growth Center and Dept of

Goals of the grantImproving access to
statistical resources not born digital
  • Build a prototype archive of statistics not born
  • Implement standard digitization practices and
    emerging metadata standards for statistical
  • Document the costs and processes of creating a
    statistical digital library from print.
  • Build the collection based upon long range
    digital life cycle requirements.
  • Present the prototype digital library to the
    scholarly community for evaluation.

Research questions
  • What effect does online access to the statistical
    information have on scholarly use of the
  • Are common digitization practices and standards
    suited to statistically-intensive documents?
  • What are the long term preservation requirements
    of the EGCDL?
  • What are the costs of producing high quality
    statistical tables with OCR and editing?
  • How scalable is this process, for what kinds of
    collections, and for what purposes?

Selecting the EGC
  • Why Economic Growth Center collection?
  • extraordinary wealth of statistical material in
    printed form
  • condition of paper is poor preservation concerns
  • a range of access problems imposed not only by
    non-digital physical formats but also by
    inadequate descriptions of the contents of the
  • Why Mexico?
  • Faculty connections, library strengths, image
    quality and completeness of collection
  • Publisher Instituto Nacional de Estadística,
    Geografía e Informática (Mexico, INEGI) Anuarios

Digitization process
  • Vendor review
  • Part one request for proposals (13 vendors)
  • Part two production of samples and extended
    proposals (6 vendors)
  • Part three final review and budget evaluation
  • Part four contract negotiation and processing
  • Prepared and shipped 221 volumes in Fall 03
  • All materials received back at Yale Feb 04
  • Quality assessment period complete March 04

Deliverables received images and PDFs
  • 300 dpi TIFFs of each page of each volume,
    including cover and back of volume
  • (103,115 TIFF files 460 gigabytes)
  • Color TIFFs for color pages, black and white
    TIFFs for pages with only black and white
  • PDF image text files of each chapter of each
    volume (5,607 files)
  • Separate files for each subject chapter, front
    matter, indices, and back matter

Deliverables received statistical tables in Excel
  • Excel tables (16,488) of demographic and economic
    tables for 1994, 1996, 1998, 2000
  • Selected OCRd statistical tables converted to
    Excel format
  • DSI operators reviewed each table twice
  • Custom tagging done by DSI improved ability to
    extract columns and rows into online database and
    build XML metadata

Quality assessment of PDFs
  • Overall assessment was very good minor problems
    on a very small number of pages
  • Sampled 5 of the PDFs
  • subjective review of image
  • Tilting, cut off text, illegible characters,
    bleed through, color evaluation, noise
  • Vendor did not produce PDFs exactly to our
    specification, had to reformat 216 files

Quality assessment of Excel tables
  • Checksums
  • Wrote script to find tables with Total columns
  • Wrote Excel macro to compare column sums with
    Total values
  • Excellent numeric transfer of numbers from print
  • Visual review

Automated Dublin Core metadata production for PDF
and Excel files
  • Defined metadata format for PDF and Excel
  • Dublin Core subset (title, date, format,
    identifier, source, coverage, subject)
  • Wrote scripts to produce individual metadata
    records for each PDF chapter and each Excel file
  • Matched chapter numbers with chapter titles
    created standard matching tables
  • Generated subject term for each chapter created
    standard tables of chapter titles to topic list
  • Generated series title list from library online
  • Other text generated from file name and directory

(No Transcript)
First generation interfaceSelect, view,
  • Features
  • Select files by year, state, topic, and/or type
    of file (PDF or Excel)
  • Reconstruct the full volume
  • Based upon Dublin Core metadata

(No Transcript)
Next generation interface NESSTAR
  • Features
  • Search across tables
  • View tables, select columns and rows
  • Create graphs and charts
  • Download and extract
  • Based upon the Data Documentation Initiative
    (DDI) metadata standard for statistical tables
  • Uses statistical software similar to SPSS
  • Nesstar is in use by data archives in Europe,
    World Bank, Health Canada in review by ICPSR and
    The Roper Center

Metadata production and data publishing Nesstar
  • Script creates DDI XML file from Dublin Core
    record and CSV file from Excel table
  • Staff publish each pair of DDI XML file and CSV
    file using Nesstar cube builder software
  • Some editing needed at this stage
  • Add measure (what the table is
    measuringpersons, events, pesos, etc)
  • Add column header name
  • Publish metadata and table data to Nesstar server

(No Transcript)
(No Transcript)
(No Transcript)
(No Transcript)
Evaluation of Nesstar interface
  • No major advantage over Excel in terms of viewing
    or manipulating data
  • Metadata and table cant be viewed together
  • Cant search within specific fields (e.g. state,
    year, column headers, etc.)
  • De-contextualization of tables (search results
    dont indicate what volumes the tables belong to)
  • Lack of flexibility in customization

(No Transcript)
Evaluation of Nesstar-produced metadata
  • Labor-intensive to create and edit no batch
  • Have to interpret table elements (add column
    header and measure labels)
  • Some data lost (textual codes for missing data
    footnotes within cells)
  • DDI records not valid some deviations from DDI

Automated DDI production script-based process
  • Scripts pulled elements from Excel files and
    Dublin Core records
  • No manual editing necessary
  • DDI records are valid
  • Metadata describes table as is without our
    interpretation imposed on it
  • Challenges with marking up hierarchical tables in

Search/browse UI for PDFs and Excel
  • Lucene index for keyword searching
  • Includes text from table titles, column and row
    labels, and footnotes
  • Can use accent marks or not in search terms
    Lucene returns appropriate results set

(No Transcript)
(No Transcript)
Transforming paper to digital resourcesGeneratio
ns of table presentation
  • Paper
  • Images of paper
  • PDF surrogates for paper publications
  • Excel files for downloading
  • Nesstar online presentation
  • Keyword search on PDF/Excel interface

What we are learning and documenting
  • Costs and processes of digitizing paper and
    building a statistical digital library
  • Scanning requirements (TIFFs, PDF/a, etc)
  • OCR of Spanish text
  • OCR of numbers into spreadsheets (zoned scanning)
  • Quality assessment
  • Is it less expensive to key in the tables? What
    is the tipping point?

(No Transcript)
Cost comparison of keying in the data
  • Sent sample tables to two vendors and asked for
    cost estimates for keying
  • Varied widely different pricing structures (per
    1000 characters vs. per Kb)
  • Both require shipping volumes (or photocopies)

Costs of digitizing Nigeria data
  • Many volumes have poor quality paper and print
  • mimeographed copies of typewritten pages
  • skewing, bleed through, strikeovers, text cut off
  • Goals of project
  • Try to find tipping point for digitizing
    collection vs. digitizing selection of tables
  • Test whether lessons learned from relatively
    clean tables apply to lower-quality print
  • Determine how these materials would fit into
  • Cost estimates
  • Considering TIFF and PDF for all volumes,
    flagging good quality volumes for automated
    conversion to Excel

What we are learning and documenting
  • Digital Life Cycle considerations
  • Formats and metadata standards
  • Long term home for digital assets and the Rescue
  • Exactly what assets do we preserve?
  • Questions about versioning
  • XML to Excel as potential preservation format

What we are learning and documenting
  • Metadata production extensive use of automated
  • Dublin core for file level metadata
  • DDI for table level metadata
  • User interface development and Spanish character
  • Nesstar implementation for aggregate data
  • Developed own UI for more flexibility, ability to
    include PDFs and Excel tables

What we are learning and documenting
  • Scholarly use of the EGCDL Finding and
    accessing tables
  • Great advantage in locating data content by the
    full text of the tables value in online access
    to individual tables.
  • Integrate searching and access into existing
  • Are investments in online analysis and
    visualization worth it? Are Excel tables what
    faculty and students want?

What we are learning and documenting
  • Production of more tables
  • Do faculty value a collections based digitization
    effort or is there more value in on demand
  • What other countries in the EGC print collection
    are of interest?
  • Cost comparisons of building full volume
    equivalents vs selected tables.
  • The process can be leveraged to facilitate
    production for other research and learning

Contact information
  • Project web site