Title: Building the Digital Library: Setting the Standards, Building the Tool Kit
1Building the Digital Library Setting the
Standards, Building the Tool Kit
- Roy Tennant
- California Digital Library
2Outline
- Project Planning
- Selecting Material to Digitize
- Basic Imaging Principles
- Capturing Images
- Editing Images
- Conversion to Text
- Best Practices
- Metadata
- Case Studies
- Skills Required of Staff
- Final Thoughts
3Project Planning
- What are the project goals and objectives?
- Which audiences do you wish to serve and how?
- Who will do the work?
- What systems will be required?
- What are the specifications for images and
metadata? - How much will the project cost?
- Who will own and manage the digital products that
will be produced?
See Handbook for Digital Projects, section by
Steve Chapman
4Start at the End
- What do you want the end user experience to be
like? - For example, do you want them to be able to
- View thumbnail images that lead to one or more
larger sizes? How big? - Search full-text?
- See text as page images or fully converted text?
- Etc.
- Answers to questions like these will dictate how
you will need to digitize, what metadata you must
capture, etc.
5Images w/Descriptive Text
- Benefits
- Relatively easy
- Can provide an enjoyable and instructive overview
of a collection - Drawbacks
- Does not scale
- Lack of metadata limits uses
- Requires
- HTML
6Images w/Metadata
- Benefits
- Can provide reasonable access to many images
- Can repurpose and combine with other collections
- Drawbacks
- Can be expensive to produce and maintain
- Requires
- Item-level metadata
- Database
7Page Images
- Benefits
- Depicts page accurately
- Retains historical accuracy
- Drawbacks
- Unsearchable
- Cannot repurpose (e.g., other access formats)
- Requires
- Structural metadata
- Page-turning system
8Page Images w/OCR
- Benefits
- Same as w/page images only, but with
- Searchable text
- Drawbacks
- Typically, OCR is left dirty (uncorrected), so
searching is not 100 - Cannot repurpose
- Requires
- Structural metadata
- Page turning system
- Full-text index
9Full Text Basic
- Benefits
- Searchable
- Lower cost to produce than enriched texts
- Drawbacks
- Difficult to repurpose
- Loses fidelity to original
- Requires
- HTML or database (browse)
- Full-text index (search)
10Full Text Enriched
- Benefits
- Searchable
- Repurposable
- Drawbacks
- Loses fidelity to original (although less than
basic full text) - Expensive to produce
- Requires
- An XML-serving infrastructure
- HTML or database (browse)
- Full-text index (search)
11Selecting Material to Digitize
- Publishing rights
- Available support/funding opportunity
- Critical mass
- Uniqueness
- Reputation
- Audience and potential use
- Diversity of material type
- Ability to stand on its own and fit in with other
collections
12Types of Materials
Printed text/ Simple line art
Mixed
Halftones
Manuscripts
Continuous Tone
From Anne Kenney, et.al., Moving Theory into
Practice
13Benchmarking
- The process whereby you determine your
digitization requirements using the material you
will digitize
14Resolution
The number of pixels in a given area defines the
resolution of an image
One pixel
1
500 x 1,000 pixels
15Dynamic Range (bit-depth)
1 bit 8 bit grayscale 8 bit
color 24 bit color
(GIF)
(GIF) (JPEG)
1 bit black or white 8 bits 256 shades 16
bits thousands 24 bits millions 36 bits
billions
16RGB Color Space
8 bits per channel 24 bit color image
Red
Color Channels
Green
Blue
12 bits per channel 36 bit color image
17Image Compression
- Lossless the image is unchanged after
compression (no image data is lost) - Typical file size 50 of original
- Example LZW compression
- Lossy the image is altered after compression
(image data is lost) - Example JPEG
18TIFF
- Tagged Image File Format
- Most often used to save master versions of
images (unedited) - Can be compressed or uncompressed (typically
lossless)
19Compuserve GIF
- Graphic Interchange Format (GIF)
- Maximum 8 bits/pixel 256 colors (shades)
- Good for
- Text and line art
- Thumbnails
- Not good for
- Full-color pictures
- Anything that requires more than 256 colors
20JPEG
- Joint Photographic Engineers Group
- JPEG is actually a compression scheme the image
file format is JFIF (JPEG File Image Format) - Good for
- Full-color pictures
- Anything that requires more than 256 colors
- Not good for
- Text or line art
21New Image Formats
- Portable Network Graphics (PNG) - from the W3C to
replace the Compuserve GIF format and provide
more capabilities - JPEG2000 - An upgrade of the JPEG format
- Flashpix - from a consortium of commercial
companies, to provide much higher-resolution
images in a way that allows speedy network
delivery - MrSID - From LizardTech, good for large format
materials (maps, panoramic photos, etc.)
22Capturing Images
- Rendering Intent
- Technologies
- Digital Cameras
- Flatbed Scanners
- Film Scanners
- Kodak PhotoCD
- Outsourcing
- Standards and Best Practices
23What is Your Rendering Intent?
- The Artifact
- The look and feel
- The experience of interacting with a specific
object - Possible consequences
- Choices for providing access are limited
- Time and money spent on recreating the artifact
may be better spent on increasing access - In some cases, preserving the look and feel
actually harms other uses - The Intellectual Object
- The content and the use of that content is
central - Possible consequences
- The experience of interacting with a specific
object may be lost - The look and feel of a specific object may be
lost - The user may not be aware of the actual physical
state of the original
24(No Transcript)
25Digital Cameras
Phase One PowerPhase FX 10,500 x 12,600 pixels,
760MB (48 bit RGB)
BetterLight Super6K 6,000 x 8,000 pixels, 136MB
(24bit RGB) 16,990
26Flatbed Scanners
- Minimum requirements
- 2400 dpi optical resolution
- 42-bit color
- Not for slides or transparencies, best for
81/2x11 or 81/2x14 originals - Sheet feeder (often optional) helpful for
digitizing text
27Film Scanners
- For 35mm slides and negativesothers available
for larger formats - 400 - 3,000
- Most around 2700-4000 dpi,30-36 bit color
28Kodak PhotoCD
- Take pictures with a normal camera, but have your
pictures developed onto a PhotoCD - A proprietary image format ImagePAC, but very
high resolution (4 different resolutions)
29Outsourcing Pros and Cons
- Benefits
- No ramp-up costs (both time and money)
- Probably higher quality, at least to begin with
- High volume capability
- Drawbacks
- May be more costly if you have underutilized
staff time - No internal capability or experience developed
(that is, when the money runs out, so does your
chance to do anything more) - Rare items may require in-house digitization
30Outsourcing How
- Write an RFQ (Request for Quote) outlining
- Type and amount of material being digitized
- Quality requirements
- Volume per unit of time requirements
- For RFQ guidance and samples, see RLG Tools for
Digital Imaging - www.rlg.org/preserv/RLGtools.html
31Digital Image Work Flow
Rotate, Crop, Retouch, Brightness/ Contrast
Resize, Sharpen
Original TIFF or PCD 10-100MB
JPEG 100K
GIF 10K
Indexed Color Space
RGB Color Space
Stored offline
Stored online
32Editing Images
- Rotating
- Cropping
- Retouching
- Adjusting
- Resizing
- Sharpening
- Saving
33Image Editing Demonstration
34Conversion to Text
- Optical Character Recognition (OCR) software is
required (Abbyy FineReader, Caere OmniPage Pro,
Xerox TextBridge, etc.) - Quality and typography of originals is key
- Less than 99.5 accuracy is less expensive to
have re-keyed offshore - For some applications, uncorrected text is
sufficient
35Imaging Best Practices
- General guidelines for archival versions
- Photos, illustrations, maps, etc.
- 300-600dpi
- 24-36 bit color
- B/W Text document
- 300-600dpi
- 8 bit grayscale
- Negatives and Slides
- 3000-4000 pixels in longest dimension
- 24-36 bit color for color 8 bit grayscale for B/W
36Imaging Best Practices
The key to image quality is not to capture at
the highest resolution or bit depth possible, but
to match the conversion process to the
informational content of the original, and to
scan at that level--no more, no less. Moving
Theory Into Practice
37The Importance of Metadata
- First definition Cataloging by those paid better
than librarians - Second definition Structured description of an
object or collection of objects - No matter what access system you use, having the
right metadata is essential - The services you want to offer will define the
metadata you must capture - The storage format is not that important as long
as you lose nothing and you can output it in all
the ways you wish to support - Capturing it at the correct granularity is key
38Metadata Granularity
- The degree to which you segment or chop up your
metadata - Gross ltnamegtJohn Doelt/namegt
- Fine ltnamegt ltgivengtJohnlt/givengt ltfamilygtDoelt/f
amilygtlt/namegt
39Metadata Qualification
- ltname rolecreatorgtWilliam Randolph
Hearstlt/namegt - ltsubject schemeLCSHgtBuilder -- Castles --
Southern Californialt/subjectgt
40Metadata Machine Parseability
- The ability to pull apart and reconstruct
metadata via software - For example, this
- Can easily become this
ltnamegt ltfirstgtWilliamlt/firstgt ltmiddlegtRandolphlt/
middlegt ltlastgtHearstlt/lastgtlt/namegt
ltDC.creatorgtHearst, William Randolphlt/DC.creatorgt
41Metadata Types
- descriptive - e.g., title, creator, subject -
used for discovery - administrative - e.g., resolution, bit depth -
used for managing the collection - structural - e.g., table of contents page, page
34, etc. - used for navigation - preservation - metadata useful for preserving the
item
42Item v. Collection Metadata
- Collection-level metadata
- Discovery metadata describes the collection
- Example Kentuckiana Digital Library see
www.kyvl.org/kentuckiana/digilibcoll/digilibcoll.s
html - Item-level metadata
- Discovery metadata describes the item
- Example MARC or Dublin Core records for each
item see californiadigitallibrary.org - Both types may be appropriate
- Doing both often takes very little extra effort
43(No Transcript)
44(No Transcript)
45Key Metadata Standards
- Encoded Archival Description (EAD)
- Used to describe archival collections
- www.loc.gov/ead/
- Expressed in SGML, or increasingly XML
- Benefits
- Allows for the description of large collections
without individual item cataloging - Is the only relevant standard in the field and
has the support of the key professional
association - Drawbacks
- Is often used to encapsulate individual digitized
items as the only method of access to those items
46Key Metadata Standards
- Metadata Object Description Schema (MODS)
- Purpose is not completely clear, but it is a
bibliographic record standard similar to MARC - www.loc.gov/standards/mods/
- Expressed in XML
- Benefits
- May be our best bet for leaving some of the
baggage of MARC/AACR2 behind - Has the backing and support of the Library of
Congress - Drawbacks
- Appears to be under the control of the Library of
Congress, which is finding it difficult to think
outside of the MARC box - Is not yet fully described
47Key Metadata Standards
- Dublin Core (DC)
- Used to provide a lowest common denominator for
resource discovery among collections with more
complex and unique metadata formats - dublincore.org
- Expressed in various ways HTML, XML, RDF, etc.
- Benefits
- Provides a useful way to unify resource discovery
for disparate collections - Is the only standard addressing this need and has
the support of major players in digital libraries - Drawbacks
- Is not yet fully described
- There is often a mistaken assumption that
adherence to DC for internal metadata needs is
both sufficient and desirable
48Key Metadata Standards
- Metadata Encoding and Transfer Syntax (METS)
- Used to encapsulate digital objects (both simple
and complex) - www.loc.gov/standards/mets/
- Expressed in XML
- Benefits
- Provides a method for unifying access to one or
more packages of descriptive metadata, all of the
relevant digital files, and structural
information - Is the only standard addressing this need and has
the support of major players in digital libraries - Drawbacks
- Is at an early stage of development, although
projects are using it now
49Key Metadata Standards
- Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH) - Used to provide a method of harvesting metadata
from compliant repositories - www.openarchives.org
- Is both a protocol and a syntax (expressed in
XML) - Benefits
- Provides a useful way to unify resource discovery
for disparate collections - Is the only standard addressing this need and has
the support of major players in digital libraries - Drawbacks
- Is in an early state, with only one standard
metadata format (DC) - It is a harvesting protocol, not a searching
protocol, and therefore your ability to get only
those records that interest you is limited
50Databases Pick Your Poison
- Virtually any database or indexing product will
in most cases work - Key considerations
- What do you already have?
- Which platform are you on?
- Which product will your IT staff be willing to
support? - Are there search features you must have?
- How much money do you have to spend?
51Databases Examples
- Targeted to the market and purpose e.g.,
ContentDM from OCLC - General purpose commercial e.g., Oracle, Sybase,
SQL Server - General purpose open source e.g., MySQL, SWISH-E
- Shrink-wrapped consumer e.g., MS Access
52Case Study SWISH-E
- Free, open source indexing software for Unix
(including Mac OS X) and Windows - Is HTML and XML aware (you can limit searches to
specific tags)
53http//escholarship.cdlib.org/ucpress/
54File System
Encodedin TEIXML
Stored
Search Index
Full Text
55(No Transcript)
56File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
MODS record
UC Press record
Library Catalog
UC PressDatabase
57(No Transcript)
58File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
Userqueries
MODS record
UC Press record
Library Catalog
UC PressDatabase
59File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
Search Results
MODS record
User requests book
UC Press record
Library Catalog
UC PressDatabase
60File System
Encodedin TEIXML
Stored
Search Index
Full Text
Javaservlet
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
User requestsbook segment
Project Profile
MODS record
METS record in XML
UC Press record
XSLT
Library Catalog
UC PressDatabase
61File System
Encodedin TEIXML
Stored
Search Index
Full Text
Javaservlet
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
XSLT
Project Profile
Booksegmentreturned
MODS record
UC Press record
Library Catalog
UC PressDatabase
62File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
Userqueries
MODS record
UC Press record
Library Catalog
UC PressDatabase
63File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
Resultsreturned
MODS record
UC Press record
Library Catalog
UC PressDatabase
64File System
Encodedin TEIXML
Stored
Search Index
Full Text
Javaservlet
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
User wants to seesearch wordsin context
Project Profile
MODS record
UC Press record
Library Catalog
UC PressDatabase
65File System
Encodedin TEIXML
Stored
Search Index
Full Text
Javaservlet
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Booksegmentreturnedw/termshighlighted
XSLT
Project Profile
MODS record
UC Press record
Library Catalog
UC PressDatabase
66Skills Required of Staff
- Imaging
- OCR
- Markup languages (HTML, XML)
- Cataloging metadata
- Indexing and database technology
- User interface design
- Programming
- Web technology
- Project management
67Final Thoughts
- Be careful what you wish for Once you start a
digital project, you are committed to it for
life - Peter Hirtle - Hardware is cheap, people are expensive
- For any given project, there are several ways it
can succeed (there is no one right answer) - Never forget for whom you are doing this! (its
the customer, stupid)