Introduction to Optical Character Recognition (OCR) - PowerPoint PPT Presentation


PPT – Introduction to Optical Character Recognition (OCR) PowerPoint presentation | free to download - id: 4d9a0e-NmQ0N


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Introduction to Optical Character Recognition (OCR)


Introduction to Optical Character Recognition (OCR) Summary Overview of OCR System Requirements Advantages and Disadvantages Operation and Management Questionnaire ... – PowerPoint PPT presentation

Number of Views:2290
Avg rating:3.0/5.0
Slides: 30
Provided by: cms9


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to Optical Character Recognition (OCR)

Introduction to Optical Character Recognition
  • Overview of OCR
  • System Requirements
  • Advantages and Disadvantages
  • Operation and Management
  • Questionnaire Design and Preparation
  • OCR Field Operation
  • OCR Country Outlook

OCR (Optical Character Recognition)
  • Function Features of OCR/ICR
  • ICR, OCR and OMR Compared
  • Optical Mark Reader (OMR)
  • OCR/ ICR

OCR (Optical Character Recognition)
  • Also referred to as Optical Character Reader
  • a system that provides a full alphanumeric
    recognition of printed or handwritten characters
    at electronic speed by simply scanning the
    form.(UNESCAP, Pop-IT project, 1997-2001)
  • Intelligent Character Recognition (ICR) is used
    to describe the process of interpreting image
    data, in particular alphanumeric text.
  • Sometimes OCR is known as ICR

Functions Features of OCR
  • Forms can be scanned through a scanner and then
    the recognition engine of the OCR system
    interpret the images and turn images of
    handwritten or printed characters into ASCII data
    (machine-readable characters).
  • The technology provides a complete form
    processing and documents capture solution.
  • Allows an open, scaleable and workflow.
  • Includes forms definition, scanning, image
  • pre-processing, and recognition capabilities.

ICR,OCR and OMR Differences
  • ICR and OCR are recognition engines used with
  • OMR is a data collection technology that does not
    require a recognition engine.
  • OMR cannot recognize hand-printed or
    machine-printed characters.

Optical Mark Reader (OMR)
  • Forms
  • An OMR works with a specialized document and
    contains timing tracks along one edge of the form
    to indicate scanner where to read for marks which
    look like black boxes on the top or bottom of a
  • The cut of the form is very precise and the
    bubbles on a form must be located in the same
    location on every form.
  • Storage
  • With OMR, the image of a document is not scanned
    and stored.
  • Accuracy
  • OMR is simpler than OCR.
  • designed properly, OMR has more accuracy than OCR.

  • Forms
  • OCR/ ICR is more flexible since no timing tracks
    or block like form IDs required.
  • The image can float on a page.
  • ICR/ OCR technology uses registration mark on the
    four-corners of a document, in the recognition of
    an image. Respondents place one character per box
    on this form.
  • The use of drop color reduces the size of the
    scanners output and enhances the accuracy.
  • Storage/ retrieval
  • If the document needs to be electronically stored
    and maintained, then OCR/ ICR is needed.
  • OCR/ICR technologies, images can be scanned,
    indexed, and written to optical media.

OMR-OCR/ICR Compared
System Requirements
  • Minimum capacity PC Requirements
  • Processor Pentium 200 MHz RAM 32 MB Disk 4 GB
  • Form modules are designed to operate in a batch
  • Run under LAN and PC based platforms and take
    full advantage of the graphical user interface
    and 32 bit processing power available with most
    Windows versions.
  • Software
  • OCR with ICR capability software
  • Questionnaire Design Software

System Requirements (cont.)
  • Scanner
  • OCR scanners with minimum capacity
  • Duplex scanning
  • Speed 60 sheets/ min
  • Automatic Document Feeder (ADF) Scanning can
    take a significant amount, and the system lets
    user scan up without doing the OCR.

Advantages and Disadvantages
  • Advantages of Using Images Rather Than Paper
  • Quicker processing no moving or storage of
    questionnaires near operators
  • Savings in costs and efficiencies by not having
    the paper questionnaires
  • Scanning and recognition allowed efficient
    management and planning for the rest of the
    processing workload
  • Reduced long term storage requirements,
    questionnaires could be destroyed after the
    initial scanning, recognition and repair
  • Quick retrieval for editing and reprocessing
  • Minimizes errors associated with physical
    handling of the questionnaires

Advantages and Disadvantages
  • Disadvantages of Using Images Rather Than Paper
  • Accuracy
  • While OCR technology can be effective in
    converting handwritten or typed characters, it
    does not give as high accuracy as of OMR for
    reading data, where users are actually marking
  • Additional workload to data collectors OCR has
    severe limitations when it comes to human
  • Characters must be hand-printed with separate
    characters in boxes

Operation and Management
  • OCR Process Stages
  • Document Scanning process
  • Scanning speed will be determined by the quality
    of the scanner machines, the size of non-drop out
    color. Paper quality, cleanness, weights.
  • Recognizing process
  • The recognizing process is to interpret images.
    The right memory (dictionary) and the
    configuration threshold will determine the
    accuracy of interpretation of the ICR.
  • Verifying Process
  • To compare the value of the interpreted image
    with the real image of the form.
  • Processing can be in geographic order or in
    random order.

Operation and Management (cont.)
  • Image Manipulation
  • Electronic questionnaires can be sent to
    specialist operators then back to the original
    operator if necessary
  • Same questionnaire can be worked on
    simultaneously by two or more persons
  • Electronic questionnaires are readily available
    for post census analysis (easier access to
  • Parts of various questionnaires on screen at once
    for inter record editing
  • Able to view the relevant field book entry on
    screen in conjunction with questionnaires which
    is helpful for coding and editing

Operation and Management (cont.)
  • Coding Assistance
  • The problems are simpler for the operator to
  • Can use images of questions that will not be
    captured (scanned but not recognized) to help the
    coding process. ex, light pencil.
  • Operator can magnify images to read characters
    not discernible to the naked eye
  • Appropriate software ensures that the data is
    validated as the forms are read.
  • Checks to ensure selections on a form are filled
  • Possible to distinguish between intended marks
    and marks that have been erased.

Operation and Management (cont.)
  • OMR Scanner Speed
  • Factors
  • Skew Each document is moved from an automatic
    feeder into ascanner and angle of skew is
    sometimes introduced.
  • De-skew Analyze the image bit- map, calculates
    and returns the angle of skew up to /-25.
    Example. De-skew often refer to , which is the
    pixel shift. 10 is a 20-pixel shift in a line of
    200 pixels or one tenth of an inch in an inch
    long line.

Operation and Management (cont.)
  • Landscape Detection and Auto Rotation
  • landscape detection will automatically detect and
    rotate appropriate images 90 degrees.
  • White Page Detection
  • Normally, a double-sided scanner creates two
    images per scanners page.
  • However, if the back or front page is blank,
    there is no need to store this image.
  • White page detection
  • Allows the user to avoid storing blank page.

Operation and Management (cont.)
  • Other Factors
  • Automatic Image Registration
  • De-Speckle and Shade Removal
  • Character Enhancer
  • Cost Savings
  • Automatic processes to improve recognition rates
  • Voting techniques, Multiple engines, Learning

Questionnaire Design and Preparation
  • Drop Out Color
  • Usually red- the color facility in OCR system
    that allows the system to pick up only the
    meaningful information from an OCR form.
  • The system doesn't need to know the values
    including tick boxes written in the drop out
  • The OCR system only needs to see the black parts,
    and compares them to specifications to see parts
    that are filled or written.
  • Characters or Marks
  • Considering the speed of the data capture process
    and to reduce rates, it is advisable to use marks
    or ticks as much as possible

Questionnaire Design and Preparation (cont.)
  • How to Obtain Good Results of Scanning
  • Select adequate paper quality Reliable printing
  • Appropriate ink, considering drop out color, for
    the questionnaires paper heavier than 80 grams
    per square meter can help avoid paper crashes or
    over read the other side of a single page.
  • Form Design Advise
  • Number items to be included in a form Design
    size of boxes for each character answer
  • Define drop out color properly use registration
  • Pre-print the codes near the place where the box
    for ticks are located
  • Maintain consistent pattern in which the
    information to be collected will be located.
  • Do not disturb the visibility of the ticks and
    marks with titles, labels or instructions.
  • Avoid putting "answers" of one field to another
    page of the questions Avoid using open ended

OCR Field Operation
  • Training for Collection and Processing Staff
  • Basic software, scanner operations, including
    installation and troubleshooting.
  • Applications with emphasis on the development of
    custom applications including configuring
    nonstandard forms
  • Pre-marking of forms, use of overprinting
    customize forms
  • Processing of surveys
  • Crating custom outputs file formats

OCR Field Operation (cont.)
  • Reasons of Error- Reading of OCR
  • Bad condition of the form because of dirt,
    folded, crumple, etc.
  • Forms fed into OCR scanner are not straight (at
    an angle) Incompletely filled
  • Reduce Error-Reading of OCR
  • Checking the questionnaires for completeness and
    consistencies Preparation of own memory
    (dictionary) Defining permissible margins of OCR
    reading errors
  • Particular Care in Writing Numbers or Alphabetic
  • One box contains only one character Characters
    should not extend outside designated boxes
    Unnecessary lines of characters such as points,
    decorative strokes, hooks, etc. are prohibited.
    Strokes should not be ended with flourishes or
  • All lines should be connected without breaks All
    lines or dots should be pressed with the same
  • Value Checking Steps Verify that the information
    captured by OMR is the same with the
  • Control for Blank If the information is blank,
    what type of control must be taken.
  • Control steps should be taken if the information
    image is partial or no information to assure the
    quality of generated files.
  • Missing Questionnaire Make sure that the entire
    questionnaires are scanned
  • completely, no missing and no duplication as

OCR Country Outlook
  • Countries using optical mark recognition
  • (Greece)
  • Countries using optical character recognition
  • (Croatia- in use for the next census round)
  • (Japan-out-sources entire process and in use for
    the next census round)
  • Countries using both
  • Belgium
  • Countries planning to use OCR
  • Tajikistan
  • (Tonga) looking to introduce and use OCR for our
    next Census

OCR Country Outlook
  • Common device/scanner and software used by NSOs
  • (Croatia) KODAK DS3520 bitonal scanners, IBM IFP
    (intelligent Forms Processing)
  • (Greece) OMR- devices/scanners were axm
    990/995 with FORM/ AXF/ ADELE software
  • (New Zealand) Kodak scanners i830 and i7620 -
    scanning and raw data capture process
    (recognition aspect) were outsourced.- For the
    next census -end scanning and data capture
    process will more than likely be outsourced but
    it really is a variation to a current supplier
  • (Belgium) AGFA (high resolution) scanner

OCR in Use
  • Editing method used for the census
  • (Japan) cold-deck method, hot-deck method, etc.
  • (Croatia) in house developed logical checking
    and automatic and manual correcting
  • (Greece) via PC- editor (officer of N.S.S.G.)
    confirms or rejects a non-accurate value or
    inputs a missing one.
  • (New Zealand) mixture of micro and macro editing
    practices. Individual responses may have range
    or validity edits, inter-field edits and also
    inter-form edits (within a household). Macro
    editing is particularly used during the data
    evaluation process and data may be reprocessed as
    a result of this

OCR Country Outlook
  • Common commercial or free software used in OCR
  • (Croatia) Use ACTR (automated coding by text
    recognition) for coding -software developed by
    Statistics Canada.
  • (Greece) Commercial software, after an open
    bidding, according to the budgetary plan of the
    population census
  • (New Zealand) IBM Intelligent Forms Processing
    (IFP) system through an established user
  • (Belgium) IRIS (Image Recognition Integrated

OCR Country Outlook
  • Concerns/issues with the use of optical character
    recognition for data capture for the census?
  • (Japan) Speed of data capture and recognition,
    recognition accuracy of Japanese characters, etc.
  • (Greece) OMR -related to the optical recognition
    of numbers, the rapidity of optical recognition
    itself and the electronic storage of the
  • (Tajikistan) Getting equipment and training.
  • (Samoa) Not enough financial support and
    technical human resources.