Title: UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving
1Optical Data Capture Optical Character
Recognition (OCR) Intelligent Character
Recognition (ICR) Intelligent Recognition
2Summary
- Concept/Definition
- Forms Design
- Scanners Software
- Storage
- Accuracy
- OCR/ICR Advantages and Disadvantages
- Intelligent Recognition (IR)
- Commercial Suppliers
3Definition/Concept of OCR
- Gives scanning and imaging systems the ability to
turn images of machine printed characters into
machine readable characters. - Images of the machine printed characters are
extracted from a bitmap of the scanned image
4Definition/Concept of ICR
- Gives scanning and imaging systems the ability to
turn images of hand written characters into
machine readable characters - Images of the hand written characters are
extracted from a bitmap of the scanned image
5OCR and ICR Differences
- OCR is less accurate than OMR but more accurate
than ICR - ICR will require editing to achieve high data
coverage
6Forms
- OCR/ICR has less strict form design compared to
OMR - No timing tracks
- Has Registration Marks
- ICR requires hand printed boxes filled one
alphanumeric character per box
7OCR
- Forms
- OCR/ ICR is more flexible since
- no timing tracks are required
- The image can float on a page
- The use of drop color reduces the size of the
scanners output and enhances the accuracy - ICR/OCR technology often uses registration mark
on the four-corners of a document, in the
recognition of an image
8(No Transcript)
9OCR/ICR Scanners and Software
- Forms can be scanned through a scanner and then
the recognition engine of the OCR/ICR system
interpret the images and turn images of
handwritten or printed characters into ASCII data
(machine-readable characters). - Users can scan up without doing the OCR
- Speeds Range from 85-160 sheets/min (dependent
on the recognition engine)
10OCR/ICR Storage Characteristics
- Storage/Retrieval
- Images are scanned and stored and maintained
electronically - There is no need to store the paper forms as long
as you safeguard the electronic files - With OCR/ICR technologies, images can be scanned,
indexed, and written to optical media
11Ideal OCR/ICR Accuracy Thresholds
- Accuracy
- Accuracy achieved by data entry clerks (99.5)
are approximately equal to OCR/ICR in in perfect
tuning (99.5) - Up to 99.9 accuracy with editing (like OMR)
- The recognition engine must be tuned, tested and
validated very carefully
12OCR/ICR Advantages
- Advantages
- Recognition engines used with imaging can capture
highly specialized data sets - OCR/ICR recognize machine-printed or hand-printed
characters. - Scanning and recognition allowed efficient
management and planning for the rest of the
processing workload - Quick retrieval for editing and reprocessing
13OCR/ICR Disadvantages
- Technology is costly
- May require significant manual intervention
- Additional workload to data collectors -ICR has
severe limitations when it comes to human
handwriting - Characters must be hand-printed/machine-printed
with separate characters in boxes - ineffective when dealing with cursive characters
14OMR-OCR/ICR Compared
15OCR/ICR Challenges/Issues
- Has corresponding issues with OMR
- Algorithm development (Preparation of memory
dictionary) - Processing time considerations due to recognition
engine - Development costs
16Definition/Concept of IR
- State of the art recognition technology
- Gives scanning and imaging systems the ability to
turn images of hand written and cursive
characters into machine readable characters - Images of the hand written and cursive characters
are extracted from a bitmap of the scanned image - The ability to capture cursive make this method
unique
17Definition/Concept of IR
- eight elements that make up the trajectories of
all cursive letters (figure 1)
Photo Parascript LLC
18Definition/Concept of IR
- Intelligent Recognition dynamically uses context
- context is used during the recognition process,
improving the accuracy of results - Contexts helps to identify letters where the
symbol segmentation of an image is ambiguous
Photo Parascript LLC
19Technology Evolution
FORM TYPES
TEXT STYLES
No special form design
No constraining boxes or combs
Condensed strings
Cursive
Dirty Noisy forms
Bad quality paper
Legacy Forms
Bad quality machine print
Unconstrained Handprint
Specially designed for automatic recognition
Constrained Handprint
Constraining boxes or combs
Drop out ink for preprinted text boxes
Machine Print
Intelligent Recognition
OCR
ICR
TECHNOLOGY EVOLUTION
Illustration Conference on Technology Options
for 2011 Census
20Major Commercial Suppliers
- Top Image Systems (TIS) (http//www.topimagesystem
s.com) - ReadSoft (http//www.readsoft.com)
- Teleform (http//www.intelliscan.com/TeleForm1.htm
) - Scanner Suppliers
- Fujitsu, Canon, Bell Howell, Kodak
21THANK YOU!