Title: Document Image Analysis
 1Document Image Analysis CSE 717 An Introduction 
 2Document Image Analysis
- DIA is the theory and practice of recovering the 
symbol structures of digital images scanned from 
paper or produced by computer  - DIA is a subfield of Digital Image processing 
 - Digital images of natural objects X-rays, 
fingerprints, faces, scenery, etc. are NOT part 
of DIA  - Digital images of symbolic objects Postal 
addresses, printed articles, forms, music sheets, 
engineering drawings, topographic maps belong to 
DIA  - Source Scanners, printers, fax machines, hand! 
 - Incidental text license plates, billboards, 
subtitles, in photos and video  - WWW ?? 
 - DIAs grand goal is take us to the land of 
paperless office  
  3Document Image Analysis
Graphical Processing
Textual Processing
Optical Character Recognition
Page Layout Analysis
Line Processing
Region and Symbol Processing
Skew, blocks, paragraphs
Lines, curves, corners
Filled regions
Text 
 4 Document Image Analysis
Processing Text Graphics
Pixels Preprocessing Representation, Noise removal, binarization, skew, script id, font id Preprocessing Representation, Noise removal, binarization, thinning, vectorization
Primitives Glyph Recognition Connected components, strokes, punctuations, words Primitive Recognition Straight lines, curve segments, junctions, nodes, loops, characters
Structures Text Recognition Word segmentation, text line reconstruction, table analysis, linguistics Structure Recognition Text fields, legends, labels, dimensions, graphics symbols
Documents Page Layout Analysis Text versus non-text, physical component analysis, logical component analysis, functional component analysis, compression Interpretation Component recognition, connectivity analysis, CAD layer separation, Database attribute extraction, Compression
Corpus Information Retrieval Document Classification, indexing, search, security, authentication, privacy Database, CAD Validation, search, update 
 5Postal Examples 
 6Forms 
 7Unconstrained Text 
 8Graphics Documents 
 9References
- Handbook of Character Recognition and Document 
Image Analysis, H. Bunke and PSP Wang (editors), 
World Scientific Press  - Document Image Analysis, Gorman and Kasturi , 
IEEE Computer Society Press  - International Conference on Document Analysis and 
Recognition proceedings  - International Workshop on Document Analysis 
Systems proceedings  - Symposium on Document Image Understanding 
Technology  
  10- OCR Features and Systems 
 - Script ID, Devanagari OCR, Tamil OCR, MP versus 
HW  - Handwriting Recognition 
 - Postal applications, Arabic Documents 
 - Classifiers and Learning 
 - Multi-classifier systems 
 - Layout Analysis 
 - Skew correction, geometric methods, test/graphics 
separation, logical labeling  - Tables and Forms 
 - Detecting tables in HTML documents, use of graph 
grammars, semantics  - Document Engineering 
 - Processing of historical documents (palm leaf 
manuscripts).  - Camera Based DIA 
 - Locating and reading Barcodes 
 - New Applications -CAPTCHA