Building the Digital Library: Setting the Standards, Building the Tool Kit - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Building the Digital Library: Setting the Standards, Building the Tool Kit

Description:

Who will own and manage the digital products that will be produced? ... MrSID - From LizardTech, good for large format materials (maps, panoramic photos, etc. ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 68
Provided by: royten2
Category:

less

Transcript and Presenter's Notes

Title: Building the Digital Library: Setting the Standards, Building the Tool Kit


1
Building the Digital Library Setting the
Standards, Building the Tool Kit
  • Roy Tennant
  • California Digital Library

2
Outline
  • Project Planning
  • Selecting Material to Digitize
  • Basic Imaging Principles
  • Capturing Images
  • Editing Images
  • Conversion to Text
  • Best Practices
  • Metadata
  • Case Studies
  • Skills Required of Staff
  • Final Thoughts

3
Project Planning
  • What are the project goals and objectives?
  • Which audiences do you wish to serve and how?
  • Who will do the work?
  • What systems will be required?
  • What are the specifications for images and
    metadata?
  • How much will the project cost?
  • Who will own and manage the digital products that
    will be produced?

See Handbook for Digital Projects, section by
Steve Chapman
4
Start at the End
  • What do you want the end user experience to be
    like?
  • For example, do you want them to be able to
  • View thumbnail images that lead to one or more
    larger sizes? How big?
  • Search full-text?
  • See text as page images or fully converted text?
  • Etc.
  • Answers to questions like these will dictate how
    you will need to digitize, what metadata you must
    capture, etc.

5
Images w/Descriptive Text
  • Benefits
  • Relatively easy
  • Can provide an enjoyable and instructive overview
    of a collection
  • Drawbacks
  • Does not scale
  • Lack of metadata limits uses
  • Requires
  • HTML

6
Images w/Metadata
  • Benefits
  • Can provide reasonable access to many images
  • Can repurpose and combine with other collections
  • Drawbacks
  • Can be expensive to produce and maintain
  • Requires
  • Item-level metadata
  • Database

7
Page Images
  • Benefits
  • Depicts page accurately
  • Retains historical accuracy
  • Drawbacks
  • Unsearchable
  • Cannot repurpose (e.g., other access formats)
  • Requires
  • Structural metadata
  • Page-turning system

8
Page Images w/OCR
  • Benefits
  • Same as w/page images only, but with
  • Searchable text
  • Drawbacks
  • Typically, OCR is left dirty (uncorrected), so
    searching is not 100
  • Cannot repurpose
  • Requires
  • Structural metadata
  • Page turning system
  • Full-text index

9
Full Text Basic
  • Benefits
  • Searchable
  • Lower cost to produce than enriched texts
  • Drawbacks
  • Difficult to repurpose
  • Loses fidelity to original
  • Requires
  • HTML or database (browse)
  • Full-text index (search)

10
Full Text Enriched
  • Benefits
  • Searchable
  • Repurposable
  • Drawbacks
  • Loses fidelity to original (although less than
    basic full text)
  • Expensive to produce
  • Requires
  • An XML-serving infrastructure
  • HTML or database (browse)
  • Full-text index (search)

11
Selecting Material to Digitize
  • Publishing rights
  • Available support/funding opportunity
  • Critical mass
  • Uniqueness
  • Reputation
  • Audience and potential use
  • Diversity of material type
  • Ability to stand on its own and fit in with other
    collections

12
Types of Materials
Printed text/ Simple line art
Mixed
Halftones
Manuscripts
Continuous Tone
From Anne Kenney, et.al., Moving Theory into
Practice
13
Benchmarking
  • The process whereby you determine your
    digitization requirements using the material you
    will digitize

14
Resolution
The number of pixels in a given area defines the
resolution of an image
One pixel
1
500 x 1,000 pixels
15
Dynamic Range (bit-depth)
1 bit 8 bit grayscale 8 bit
color 24 bit color
(GIF)
(GIF) (JPEG)
1 bit black or white 8 bits 256 shades 16
bits thousands 24 bits millions 36 bits
billions
16
RGB Color Space
8 bits per channel 24 bit color image
Red
Color Channels
Green
Blue
12 bits per channel 36 bit color image
17
Image Compression
  • Lossless the image is unchanged after
    compression (no image data is lost)
  • Typical file size 50 of original
  • Example LZW compression
  • Lossy the image is altered after compression
    (image data is lost)
  • Example JPEG

18
TIFF
  • Tagged Image File Format
  • Most often used to save master versions of
    images (unedited)
  • Can be compressed or uncompressed (typically
    lossless)

19
Compuserve GIF
  • Graphic Interchange Format (GIF)
  • Maximum 8 bits/pixel 256 colors (shades)
  • Good for
  • Text and line art
  • Thumbnails
  • Not good for
  • Full-color pictures
  • Anything that requires more than 256 colors

20
JPEG
  • Joint Photographic Engineers Group
  • JPEG is actually a compression scheme the image
    file format is JFIF (JPEG File Image Format)
  • Good for
  • Full-color pictures
  • Anything that requires more than 256 colors
  • Not good for
  • Text or line art

21
New Image Formats
  • Portable Network Graphics (PNG) - from the W3C to
    replace the Compuserve GIF format and provide
    more capabilities
  • JPEG2000 - An upgrade of the JPEG format
  • Flashpix - from a consortium of commercial
    companies, to provide much higher-resolution
    images in a way that allows speedy network
    delivery
  • MrSID - From LizardTech, good for large format
    materials (maps, panoramic photos, etc.)

22
Capturing Images
  • Rendering Intent
  • Technologies
  • Digital Cameras
  • Flatbed Scanners
  • Film Scanners
  • Kodak PhotoCD
  • Outsourcing
  • Standards and Best Practices

23
What is Your Rendering Intent?
  • The Artifact
  • The look and feel
  • The experience of interacting with a specific
    object
  • Possible consequences
  • Choices for providing access are limited
  • Time and money spent on recreating the artifact
    may be better spent on increasing access
  • In some cases, preserving the look and feel
    actually harms other uses
  • The Intellectual Object
  • The content and the use of that content is
    central
  • Possible consequences
  • The experience of interacting with a specific
    object may be lost
  • The look and feel of a specific object may be
    lost
  • The user may not be aware of the actual physical
    state of the original

24
(No Transcript)
25
Digital Cameras
Phase One PowerPhase FX 10,500 x 12,600 pixels,
760MB (48 bit RGB)
BetterLight Super6K 6,000 x 8,000 pixels, 136MB
(24bit RGB) 16,990
26
Flatbed Scanners
  • Minimum requirements
  • 2400 dpi optical resolution
  • 42-bit color
  • Not for slides or transparencies, best for
    81/2x11 or 81/2x14 originals
  • Sheet feeder (often optional) helpful for
    digitizing text

27
Film Scanners
  • For 35mm slides and negativesothers available
    for larger formats
  • 400 - 3,000
  • Most around 2700-4000 dpi,30-36 bit color

28
Kodak PhotoCD
  • Take pictures with a normal camera, but have your
    pictures developed onto a PhotoCD
  • A proprietary image format ImagePAC, but very
    high resolution (4 different resolutions)

29
Outsourcing Pros and Cons
  • Benefits
  • No ramp-up costs (both time and money)
  • Probably higher quality, at least to begin with
  • High volume capability
  • Drawbacks
  • May be more costly if you have underutilized
    staff time
  • No internal capability or experience developed
    (that is, when the money runs out, so does your
    chance to do anything more)
  • Rare items may require in-house digitization

30
Outsourcing How
  • Write an RFQ (Request for Quote) outlining
  • Type and amount of material being digitized
  • Quality requirements
  • Volume per unit of time requirements
  • For RFQ guidance and samples, see RLG Tools for
    Digital Imaging
  • www.rlg.org/preserv/RLGtools.html

31
Digital Image Work Flow
Rotate, Crop, Retouch, Brightness/ Contrast
Resize, Sharpen
Original TIFF or PCD 10-100MB
JPEG 100K
GIF 10K
Indexed Color Space
RGB Color Space
Stored offline
Stored online
32
Editing Images
  • Rotating
  • Cropping
  • Retouching
  • Adjusting
  • Resizing
  • Sharpening
  • Saving

33
Image Editing Demonstration
34
Conversion to Text
  • Optical Character Recognition (OCR) software is
    required (Abbyy FineReader, Caere OmniPage Pro,
    Xerox TextBridge, etc.)
  • Quality and typography of originals is key
  • Less than 99.5 accuracy is less expensive to
    have re-keyed offshore
  • For some applications, uncorrected text is
    sufficient

35
Imaging Best Practices
  • General guidelines for archival versions
  • Photos, illustrations, maps, etc.
  • 300-600dpi
  • 24-36 bit color
  • B/W Text document
  • 300-600dpi
  • 8 bit grayscale
  • Negatives and Slides
  • 3000-4000 pixels in longest dimension
  • 24-36 bit color for color 8 bit grayscale for B/W

36
Imaging Best Practices
The key to image quality is not to capture at
the highest resolution or bit depth possible, but
to match the conversion process to the
informational content of the original, and to
scan at that level--no more, no less. Moving
Theory Into Practice
37
The Importance of Metadata
  • First definition Cataloging by those paid better
    than librarians
  • Second definition Structured description of an
    object or collection of objects
  • No matter what access system you use, having the
    right metadata is essential
  • The services you want to offer will define the
    metadata you must capture
  • The storage format is not that important as long
    as you lose nothing and you can output it in all
    the ways you wish to support
  • Capturing it at the correct granularity is key

38
Metadata Granularity
  • The degree to which you segment or chop up your
    metadata
  • Gross ltnamegtJohn Doelt/namegt
  • Fine ltnamegt ltgivengtJohnlt/givengt ltfamilygtDoelt/f
    amilygtlt/namegt

39
Metadata Qualification
  • ltname rolecreatorgtWilliam Randolph
    Hearstlt/namegt
  • ltsubject schemeLCSHgtBuilder -- Castles --
    Southern Californialt/subjectgt

40
Metadata Machine Parseability
  • The ability to pull apart and reconstruct
    metadata via software
  • For example, this
  • Can easily become this

ltnamegt ltfirstgtWilliamlt/firstgt ltmiddlegtRandolphlt/
middlegt ltlastgtHearstlt/lastgtlt/namegt
ltDC.creatorgtHearst, William Randolphlt/DC.creatorgt
41
Metadata Types
  • descriptive - e.g., title, creator, subject -
    used for discovery
  • administrative - e.g., resolution, bit depth -
    used for managing the collection
  • structural - e.g., table of contents page, page
    34, etc. - used for navigation
  • preservation - metadata useful for preserving the
    item

42
Item v. Collection Metadata
  • Collection-level metadata
  • Discovery metadata describes the collection
  • Example Kentuckiana Digital Library see
    www.kyvl.org/kentuckiana/digilibcoll/digilibcoll.s
    html
  • Item-level metadata
  • Discovery metadata describes the item
  • Example MARC or Dublin Core records for each
    item see californiadigitallibrary.org
  • Both types may be appropriate
  • Doing both often takes very little extra effort

43
(No Transcript)
44
(No Transcript)
45
Key Metadata Standards
  • Encoded Archival Description (EAD)
  • Used to describe archival collections
  • www.loc.gov/ead/
  • Expressed in SGML, or increasingly XML
  • Benefits
  • Allows for the description of large collections
    without individual item cataloging
  • Is the only relevant standard in the field and
    has the support of the key professional
    association
  • Drawbacks
  • Is often used to encapsulate individual digitized
    items as the only method of access to those items

46
Key Metadata Standards
  • Metadata Object Description Schema (MODS)
  • Purpose is not completely clear, but it is a
    bibliographic record standard similar to MARC
  • www.loc.gov/standards/mods/
  • Expressed in XML
  • Benefits
  • May be our best bet for leaving some of the
    baggage of MARC/AACR2 behind
  • Has the backing and support of the Library of
    Congress
  • Drawbacks
  • Appears to be under the control of the Library of
    Congress, which is finding it difficult to think
    outside of the MARC box
  • Is not yet fully described

47
Key Metadata Standards
  • Dublin Core (DC)
  • Used to provide a lowest common denominator for
    resource discovery among collections with more
    complex and unique metadata formats
  • dublincore.org
  • Expressed in various ways HTML, XML, RDF, etc.
  • Benefits
  • Provides a useful way to unify resource discovery
    for disparate collections
  • Is the only standard addressing this need and has
    the support of major players in digital libraries
  • Drawbacks
  • Is not yet fully described
  • There is often a mistaken assumption that
    adherence to DC for internal metadata needs is
    both sufficient and desirable

48
Key Metadata Standards
  • Metadata Encoding and Transfer Syntax (METS)
  • Used to encapsulate digital objects (both simple
    and complex)
  • www.loc.gov/standards/mets/
  • Expressed in XML
  • Benefits
  • Provides a method for unifying access to one or
    more packages of descriptive metadata, all of the
    relevant digital files, and structural
    information
  • Is the only standard addressing this need and has
    the support of major players in digital libraries
  • Drawbacks
  • Is at an early stage of development, although
    projects are using it now

49
Key Metadata Standards
  • Open Archives Initiative Protocol for Metadata
    Harvesting (OAI-PMH)
  • Used to provide a method of harvesting metadata
    from compliant repositories
  • www.openarchives.org
  • Is both a protocol and a syntax (expressed in
    XML)
  • Benefits
  • Provides a useful way to unify resource discovery
    for disparate collections
  • Is the only standard addressing this need and has
    the support of major players in digital libraries
  • Drawbacks
  • Is in an early state, with only one standard
    metadata format (DC)
  • It is a harvesting protocol, not a searching
    protocol, and therefore your ability to get only
    those records that interest you is limited

50
Databases Pick Your Poison
  • Virtually any database or indexing product will
    in most cases work
  • Key considerations
  • What do you already have?
  • Which platform are you on?
  • Which product will your IT staff be willing to
    support?
  • Are there search features you must have?
  • How much money do you have to spend?

51
Databases Examples
  • Targeted to the market and purpose e.g.,
    ContentDM from OCLC
  • General purpose commercial e.g., Oracle, Sybase,
    SQL Server
  • General purpose open source e.g., MySQL, SWISH-E
  • Shrink-wrapped consumer e.g., MS Access

52
Case Study SWISH-E
  • Free, open source indexing software for Unix
    (including Mac OS X) and Windows
  • Is HTML and XML aware (you can limit searches to
    specific tags)

53
http//escholarship.cdlib.org/ucpress/
54
File System
Encodedin TEIXML
Stored
Search Index
Full Text
55
(No Transcript)
56
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
MODS record
UC Press record
Library Catalog
UC PressDatabase
57
(No Transcript)
58
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
Userqueries
MODS record
UC Press record
Library Catalog
UC PressDatabase
59
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
Search Results
MODS record
User requests book
UC Press record
Library Catalog
UC PressDatabase
60
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Javaservlet
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
User requestsbook segment
Project Profile
MODS record
METS record in XML
UC Press record
XSLT
Library Catalog
UC PressDatabase
61
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Javaservlet
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
XSLT
Project Profile
Booksegmentreturned
MODS record
UC Press record
Library Catalog
UC PressDatabase
62
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
Userqueries
MODS record
UC Press record
Library Catalog
UC PressDatabase
63
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Project Profile
Resultsreturned
MODS record
UC Press record
Library Catalog
UC PressDatabase
64
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Javaservlet
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
User wants to seesearch wordsin context
Project Profile
MODS record
UC Press record
Library Catalog
UC PressDatabase
65
File System
Encodedin TEIXML
Stored
Search Index
Full Text
Javaservlet
Structure
Search Index
SelectedFieldsExtracted
METSRepository
RecordsCreated
Stored
Booksegmentreturnedw/termshighlighted
XSLT
Project Profile
MODS record
UC Press record
Library Catalog
UC PressDatabase
66
Skills Required of Staff
  • Imaging
  • OCR
  • Markup languages (HTML, XML)
  • Cataloging metadata
  • Indexing and database technology
  • User interface design
  • Programming
  • Web technology
  • Project management

67
Final Thoughts
  • Be careful what you wish for Once you start a
    digital project, you are committed to it for
    life - Peter Hirtle
  • Hardware is cheap, people are expensive
  • For any given project, there are several ways it
    can succeed (there is no one right answer)
  • Never forget for whom you are doing this! (its
    the customer, stupid)
Write a Comment
User Comments (0)
About PowerShow.com