Document Analysis and Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Document Analysis and Recognition

Description:

Extract name phone pairs. Telephone book. Directory. Directories ... ICDAR 03, Edinburgh, UK. Multiple Classifiers. Postal Automation and Check Processing ... – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0
Slides: 21
Provided by: venugovi
Category:

less

Transcript and Presenter's Notes

Title: Document Analysis and Recognition


1
Document Analysis and Recognition
  • CS 661

2
What is a Document?
  • A written or printed paper that bears the
    original, official, or legal form of something
    and can be used to furnish decisive evidence or
    information.
  • Something, such as a recording or a photograph,
    that can be used to furnish evidence or
    information.
  • A writing that contains information.
  • Computer Science. A piece of work created with an
    application, as by a word processor.
  • Computer Science. A computer file that is not an
    executable file and contains data for use by
    applications

3
Document Image Analysis
  • DIA is the theory and practice of recovering the
    symbol structures of digital images scanned from
    paper or produced by computer
  • DIA is a subfield of Digital Image processing
  • Digital images of natural objects X-rays,
    fingerprints, faces, scenery, etc. are NOT part
    of DIA
  • Digital images of symbolic objects Postal
    addresses, printed articles, forms, music sheets,
    engineering drawings, topographic maps belong to
    DIA
  • Source Scanners, printers, fax machines, hand!
  • Incidental text license plates, billboards,
    subtitles, in photos and video
  • WWW ??
  • DIAs grand goal is take us to the land of
    paperless office

4
Paperless Office?
  • Traditional transmission and storage of
    information has been by paper documents
  • Documents are increasingly originating on the
    computer
  • Documents printed for reading, dissemination, and
    markup
  • Paper in the office has increased!!
  • Goal Deal with the flow of electronic and paper
    documents in an efficient and integrated manner
  • Implication Unlike computer media, paper
    documents should be read by both the computer and
    people

5
Short Tour of DIA
  • Field started before digital computers could
    represent information traditionally appeared on
    paper
  • Patents on OCR for telegraph and reading machines
    for the blind filed in the 19th century and
    working models demonstrated in 1916
  • OCR on specially designed fonts used in 1950s
  • First postal address reader installed in 1965
  • OCRs to read scanned pages came into their own in
    1980s with the advent of the low cost
    microprocessors, bit-mapped displays, and
    scanners
  • Large capacity storage devices have now ignited
    the field with the prospects of Digital Libraries
  • Document imaging today is a billion dollar
    industry but document interpretation is only a
    small part of it

6
Document Image Analysis
Graphical Processing
Textual Processing
Optical Character Recognition
Page Layout Analysis
Line Processing
Region and Symbol Processing
Skew, blocks, paragraphs
Lines, curves, corners
Filled regions
Text
7
Current
  • Processors getting faster
  • Storage costs are down
  • Pictures are typically 512 x 512 pixels
  • Speech signals are typically 256 sample points
  • Business letters are typically 2550 x 3300 pixels
    at 300 dpi
  • Eng drawings are typically 34000 x 44000 pixels
    at 1000 dpi
  • Digital libraries need WWW interface
  • Information retrieval and search
  • OCR accuracy on the rise
  • Contextual models improved

8
Document page
300 dpi, 8.5x11 in 255 gray X 3 color 2,550 x
3,300 pixels
Data capture
107 pixels
Pixel-level processing
7,500 character boxes, 15x20
pixels each
500 line and curve segments, 20 to 20,000
pixels each
10 filled regions 20x20
to 200x200 pixels each
Feature-level processing
10x5 region features
7500x10 character features
500x5 line and curve features
Text analysis recognition
Graphics analysis recognition
2 line diagrams, 1 company logo, etc.
1,500 words, 10 paragraphs, 1 title, 2 subtitles,
etc.
Document Description
9
Document Image Analysis
10
Document Taxonomy
11
Postal Examples
12
Forms
13
Unconstrained Text
14
Graphics Documents
15
Personal DL
16
DAS 02, Princeton, NJ
  • OCR Features and Systems
  • Degradation models, script ID, Bilingual OCR,
    Kannada OCR, Tamil OCR, mp versus hw checks,
    traffic ticket reading
  • Handwriting Recognition
  • Stochastic models, holistic methods, Japanese OCR
  • Classifiers and Learning
  • Multi-classifier systems
  • Layout Analysis
  • Skew correction, geometric methods, test/graphics
    separation, logical labeling
  • Tables and Forms
  • Detecting tables in HTML documents, use of graph
    grammars, semantics
  • Text Extraction
  • Indexing and Retrieval
  • Document Engineering
  • New Applications
  • CAPTCHA, Tachograph chart system, accessing
    driving directions

17
ICDAR 03, Edinburgh, UK
  • Multiple Classifiers
  • Postal Automation and Check Processing
  • Document Understanding
  • HMM Classifiers
  • Segmentation
  • Character Recognition
  • Graphics Recognition
  • Non-Latin Alphabets- Kanji/Chinese,
    Korean/Hangul, Arabic/Indian
  • Web Documents, Video
  • Word Recognition
  • Image Processing
  • Writer Identification
  • Forms and Tables

18
CS 661 Class Schedule
19
Grading
  • Home Assignments and Quizzes
  • 4 x 10 40 points
  • schedule is tentative to preserve surprise
    element
  • Based on class participation and paper handouts
  • Midterm project
  • Demo 10
  • Report 15
  • Final project
  • Demo 10
  • Report 25

20
References
  • Handbook of Character Recognition and Document
    Image Analysis, H. Bunke and PSP Wang (editors),
    World Scientific Press
  • Document Image Analysis, Gorman and Kasturi ,
    IEEE Computer Society Press
  • International Conference on Document Analysis and
    Recognition proceedings
  • International Workshop on Document Analysis
    Systems proceedings
  • Symposium on Document Image Understanding
    Technology
Write a Comment
User Comments (0)
About PowerShow.com