Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation - PowerPoint PPT Presentation

Loading...

PPT – Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation PowerPoint presentation | free to download - id: 321d2-Mjk1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation

Description:

... TIFFs for color pages, black and white TIFFs for pages with only black and white ... in Europe, World Bank, Health Canada; in review by ICPSR and The ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 37
Provided by: ssrs
Learn more at: http://ssrs.yale.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation


1
Economic Growth CenterDigital Librarya project
funded by the Andrew W. Mellon Foundation
  • Ann Green, Steve Citron-Pousty, and Marcia Ford
  • Social Science Research Services
  • Yale ITS Academic Media Technology
  • Sandy Peterson and Julie Linden
  • Yale Social Science Libraries and Information
    Services
  • Christopher Udry
  • Professor, Yale Dept of Economics
  • Director, Economic Growth Center and Dept of
    Economics

2
Goals of the grantImproving access to
statistical resources not born digital
  • Build a prototype archive of statistics not born
    digital.
  • Implement standard digitization practices and
    emerging metadata standards for statistical
    tables.
  • Document the costs and processes of creating a
    statistical digital library from print.
  • Build the collection based upon long range
    digital life cycle requirements.
  • Present the prototype digital library to the
    scholarly community for evaluation.

3
Research questions
  • What effect does online access to the statistical
    information have on scholarly use of the
    materials?
  • Are common digitization practices and standards
    suited to statistically-intensive documents?
  • What are the long term preservation requirements
    of the EGCDL?
  • What are the costs of producing high quality
    statistical tables with OCR and editing?
  • How scalable is this process, for what kinds of
    collections, and for what purposes?

4
Selecting the EGC
  • Why Economic Growth Center collection?
  • extraordinary wealth of statistical material in
    printed form
  • condition of paper is poor preservation concerns
  • a range of access problems imposed not only by
    non-digital physical formats but also by
    inadequate descriptions of the contents of the
    publications
  • Why Mexico?
  • Faculty connections, library strengths, image
    quality and completeness of collection
  • Publisher Instituto Nacional de Estadística,
    Geografía e Informática (Mexico, INEGI) Anuarios
    Estatales

5
Digitization process
  • Vendor review
  • Part one request for proposals (13 vendors)
  • Part two production of samples and extended
    proposals (6 vendors)
  • Part three final review and budget evaluation
  • Part four contract negotiation and processing
  • Prepared and shipped 221 volumes in Fall 03
  • All materials received back at Yale Feb 04
  • Quality assessment period complete March 04

6
Deliverables received images and PDFs
  • 300 dpi TIFFs of each page of each volume,
    including cover and back of volume
  • (103,115 TIFF files 460 gigabytes)
  • Color TIFFs for color pages, black and white
    TIFFs for pages with only black and white
  • PDF image text files of each chapter of each
    volume (5,607 files)
  • Separate files for each subject chapter, front
    matter, indices, and back matter

7
Deliverables received statistical tables in Excel
  • Excel tables (16,488) of demographic and economic
    tables for 1994, 1996, 1998, 2000
  • Selected OCRd statistical tables converted to
    Excel format
  • DSI operators reviewed each table twice
  • Custom tagging done by DSI improved ability to
    extract columns and rows into online database and
    build XML metadata

8
Quality assessment of PDFs
  • Overall assessment was very good minor problems
    on a very small number of pages
  • Sampled 5 of the PDFs
  • subjective review of image
  • Tilting, cut off text, illegible characters,
    bleed through, color evaluation, noise
  • Vendor did not produce PDFs exactly to our
    specification, had to reformat 216 files

9
Quality assessment of Excel tables
  • Checksums
  • Wrote script to find tables with Total columns
  • Wrote Excel macro to compare column sums with
    Total values
  • Excellent numeric transfer of numbers from print
  • Visual review

10
Automated Dublin Core metadata production for PDF
and Excel files
  • Defined metadata format for PDF and Excel
    documents
  • Dublin Core subset (title, date, format,
    identifier, source, coverage, subject)
  • Wrote scripts to produce individual metadata
    records for each PDF chapter and each Excel file
  • Matched chapter numbers with chapter titles
    created standard matching tables
  • Generated subject term for each chapter created
    standard tables of chapter titles to topic list
  • Generated series title list from library online
    catalog
  • Other text generated from file name and directory
    information

11
(No Transcript)
12
First generation interfaceSelect, view,
downloadssrs.yale.edu/egcdl
  • Features
  • Select files by year, state, topic, and/or type
    of file (PDF or Excel)
  • Reconstruct the full volume
  • Based upon Dublin Core metadata

13
(No Transcript)
14
Next generation interface NESSTAR
  • Features
  • Search across tables
  • View tables, select columns and rows
  • Create graphs and charts
  • Download and extract
  • Based upon the Data Documentation Initiative
    (DDI) metadata standard for statistical tables
  • Uses statistical software similar to SPSS
  • Nesstar is in use by data archives in Europe,
    World Bank, Health Canada in review by ICPSR and
    The Roper Center

15
Metadata production and data publishing Nesstar
process
  • Script creates DDI XML file from Dublin Core
    record and CSV file from Excel table
  • Staff publish each pair of DDI XML file and CSV
    file using Nesstar cube builder software
  • Some editing needed at this stage
  • Add measure (what the table is
    measuringpersons, events, pesos, etc)
  • Add column header name
  • Publish metadata and table data to Nesstar server

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Evaluation of Nesstar interface
  • No major advantage over Excel in terms of viewing
    or manipulating data
  • Metadata and table cant be viewed together
  • Cant search within specific fields (e.g. state,
    year, column headers, etc.)
  • De-contextualization of tables (search results
    dont indicate what volumes the tables belong to)
  • Lack of flexibility in customization

21
(No Transcript)
22
Evaluation of Nesstar-produced metadata
  • Labor-intensive to create and edit no batch
    processing
  • Have to interpret table elements (add column
    header and measure labels)
  • Some data lost (textual codes for missing data
    footnotes within cells)
  • DDI records not valid some deviations from DDI
    specification

23
Automated DDI production script-based process
  • Scripts pulled elements from Excel files and
    Dublin Core records
  • No manual editing necessary
  • DDI records are valid
  • Metadata describes table as is without our
    interpretation imposed on it
  • Challenges with marking up hierarchical tables in
    DDI

24
Search/browse UI for PDFs and Excel
  • Lucene index for keyword searching
  • Includes text from table titles, column and row
    labels, and footnotes
  • Can use accent marks or not in search terms
    Lucene returns appropriate results set

25
(No Transcript)
26
(No Transcript)
27
Transforming paper to digital resourcesGeneratio
ns of table presentation
  • Paper
  • Images of paper
  • PDF surrogates for paper publications
  • Excel files for downloading
  • Nesstar online presentation
  • Keyword search on PDF/Excel interface

28
What we are learning and documenting
  • Costs and processes of digitizing paper and
    building a statistical digital library
  • Scanning requirements (TIFFs, PDF/a, etc)
  • OCR of Spanish text
  • OCR of numbers into spreadsheets (zoned scanning)
  • Quality assessment
  • Is it less expensive to key in the tables? What
    is the tipping point?

29
(No Transcript)
30
Cost comparison of keying in the data
  • Sent sample tables to two vendors and asked for
    cost estimates for keying
  • Varied widely different pricing structures (per
    1000 characters vs. per Kb)
  • Both require shipping volumes (or photocopies)
    overseas

31
Costs of digitizing Nigeria data
  • Many volumes have poor quality paper and print
  • mimeographed copies of typewritten pages
  • skewing, bleed through, strikeovers, text cut off
  • Goals of project
  • Try to find tipping point for digitizing
    collection vs. digitizing selection of tables
  • Test whether lessons learned from relatively
    clean tables apply to lower-quality print
    materials
  • Determine how these materials would fit into
    EGCDL UI
  • Cost estimates
  • Considering TIFF and PDF for all volumes,
    flagging good quality volumes for automated
    conversion to Excel

32
What we are learning and documenting
  • Digital Life Cycle considerations
  • Formats and metadata standards
  • Long term home for digital assets and the Rescue
    Repository
  • Exactly what assets do we preserve?
  • Questions about versioning
  • XML to Excel as potential preservation format

33
What we are learning and documenting
  • Metadata production extensive use of automated
    processes
  • Dublin core for file level metadata
  • DDI for table level metadata
  • User interface development and Spanish character
    challenges
  • Nesstar implementation for aggregate data
  • Developed own UI for more flexibility, ability to
    include PDFs and Excel tables

34
What we are learning and documenting
  • Scholarly use of the EGCDL Finding and
    accessing tables
  • Great advantage in locating data content by the
    full text of the tables value in online access
    to individual tables.
  • Integrate searching and access into existing
    resources
  • Are investments in online analysis and
    visualization worth it? Are Excel tables what
    faculty and students want?

35
What we are learning and documenting
  • Production of more tables
  • Do faculty value a collections based digitization
    effort or is there more value in on demand
    service?
  • What other countries in the EGC print collection
    are of interest?
  • Cost comparisons of building full volume
    equivalents vs selected tables.
  • The process can be leveraged to facilitate
    production for other research and learning
    projects.

36
Contact information
  • ann.green_at_yale.edu
  • julie.linden_at_yale.edu
  • sandra.k.peterson_at_yale.edu
  • Project web site
  • ssrs.yale.edu/egcdl
About PowerShow.com