Title: Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features
1Building Multi Script OCR for Brahmi Scripts
Selection of Efficient Features
Center for Research on Bangla Language Processing
(CRBLP), Department of Computer Science and
Engineering, BRAC University, Dhaka, Bangladesh.
2Brahmi Script Analysis
- Features of the graphemes of the characters
- Baseline / Matraa
- Vertical bar
- Curvatures
Baseline / Matraa
Curvatures
Vertical Bar
3Brahmi Script Analysis
- Vowels have dependent and independent form.
- Vowel change shape when followed by consonant.
- Consonant followed by consonant creates new shape.
4Bangla Script
5Other Scripts
Devanagari
Gurmukhi
6Other Scripts
Tibetan
Sharda
7Outline
- Feature extraction for OCR.
- Classification
- Analysis of feature extraction approaches
- Conclusion
8What is Feature Extraction?
- Devijver and Kittle define feature extraction as
the problem of extracting from the raw data the
information which most relevant for
classification purposes, in the sense of
minimizing the within-class pattern variability
while enhancing the between-class pattern
variability. - Image features are unique characteristics that
can represent a specific image. - Meaningful, detectable parts of the image.
- Overcome the vulnerabilities of template
matching. - reduce the computation cost.
9Feature Extraction in OCR
- The selection of image features and corresponding
extraction methods is probably the most important
step in achieving high performance for an OCR
system.
- The preprocessing stage aims to make the image be
suitable for different feature extraction
algorithms.
10Feature Extraction in OCR
- Properties of image features
- Robust to transformations
- Robust to noise
- Feature extraction efficiency
- Feature matching efficiency
- Issues in feature extraction
- Invariants
- features remains unchanged when a particular
transformation is applied. - Reconstruction
- can be reconstructed from the extracted features.
11Features and Classifiers
- Different feature type may need different type of
classifiers. - Graph description - structural or syntactic
classifiers. - Discrete features - decision trees.
- Real valued features - statistical classifier.
12Types of Features
- Feature extraction methods are based on three
types of features - Statistical
- Projections and profiles
- Crossing and distance
- Zoning
- Structural
- Nodal features
- Stroke analysis
- Global transformation and shape based
- Unitary image transform
- Shape (boundary region based)
13Statistical Features (Projection Histograms)
- Introduced in 1956 in hardware OCR system.
- Today, this technique is mostly used for
- Segmenting characters, words, and text lines
- Detect if an input image is rotated.
Vertical Projection
Horizontal Projection
14Statistical Features (Profile)
- Count distance between the bounding box and the
edge of a character image. - Used to extract the contour of the character
image.
15Statistical Features (Crossing)
- Count the number of transitions from background
to foreground pixels.
V 2
H 3
Figure crossing
16Statistical Features (Distance)
- Count the distance of the first Image pixel
detected from upper and lower boundaries.
U 6
L 5
R 6
B 7
Figure crossing and Distance
17Limitations (Projection, Profile, Crossing
Distance)
- Scale dependent.
- Sensitive to rotation.
- Sensitive to the variability in writing style.
- Important information about the character shape
seems to be lost.
18Statistical Features (Zoning)
- Divide the character image (matrix) into certain
number of zones (sub-matrix). - Apply computation on each zone separately.
- The goal of zoning is to obtain the local
characteristics instead of global
characteristics. - Calculation over each zone
- Percentage of black pixels.
- Weight of each zone.
- Evaluate the extent to which sub-matrix shape
matches any direction. (Used for MLP based
classifier)
19Structural Features (Zoning)
9 X 7
Two rows overlap
One row overlap
Weight Matrix ( 9 X 7 )
One row overlap
Two rows overlap
One row overlap
60 Degree Path
Figure Zones of a 32 X 24 image
20Statistical Features (Zoning)
- Observations
- Additional features needed to improve the
classifier performance. - Overlapping between zones to enhance the
reliability of the features.
21Structural Features (Nodal features)
Figure Nodes extracted from a character image
22Global Transformation (Unitary Image Transform)
- Reduction in the number of features.
- Preserving most of the information.
- Pixels are ordered by their variance, and the
pixels with the highest variance are used as
features. - Reconstruction ability.
- Limitations
- Not rotation invariant
- Input image have to be exactly the same size
(Scaling and resampling is necessary if the size
can vary)
23Global Transformation (Unitary Image Transform)
- Several transformation methods
- Karhunen-Loeve (KL) computationally demanding
- Fourier recommended by andrew
- Hadamard (or Walsh) -- recommended by andrew
- Haar transform
- Cosine computationally reasonable, better in
terms of image compression - Sine
- Slant Transform
We applied Discrete Cosine Transform with Hidden
Markov Model (HMM) as a classifier.
24Global Transformation (Discrete Cosine Transform)
Table Reconstruction result of different
variance difference
0.7
Table Number of features for different variance
difference
25Shape Based
- Features are invariants to translation, scale,
rotation, blur and noise. - Most commonly used image features, where shape
representation is the most important issue. - Classified into two categories
- boundary-based invariants
- region-based invariants.
26Shape (Boundary Based)
- Explore only the contour information
- Two techniques
- Chain code
- Fourier descriptors
- Cannot capture the interior content of the shape.
- Reconstruction ability.
- Limitations
- cannot deal with disjoint shapes
- Decisions
- Not appropriate for us.
27Shape (Region Based)
- All of the pixels of the image are taken into
account to represent the shape. - Can capture some of the global properties.
- Popular region-based methods
- Hus seven moment invariants
- Zernike moments
- Can also be employed to describe disjoint shapes.
- Reconstruction ability.
28Region Based(Hus moment invariants)
- Seven moments
- Hus invariants have the properties of being
invariant - image translation
- scaling
- rotation
Table Hus seven moments
- Compute the higher order of Hus moment
invariants is quite complex.
29Region Based(Zernike moments)
- Allow independent moment invariants to be
constructed easily to an arbitrarily high order. - Concept of orthogonal moments to recover the
image. - Invariants to
- Rotation
- Normalized Zernike moments, Invariants to
- Translation
- Scale
- Rotation
30Region Based(Zernike moments)
Table Number of features for different order of
Zernike moment
Table Reconstruction result for different order
of Zernike moment
31Features s of the existing open-source OCRs
- OCROPUS Features used by the system currently
include - Gradients
- Singular points of the skeleton
- Presence of holes and
- Unary-coded geometric information
- Location relative to the baseline and
- Original aspect ratio and skew prior to skew
correction.
32Features s of the existing open-source OCRs
- Tesseract
- Feature used by tesseract includes
- Segments of the polynomial approximation.
- Direction of the outline
- For test character features are three
dimensional - x position
- y - position
- angle
- For training character features are three
dimensional - x position
- y - position
- angle
- length
33Features s of the existing open-source OCRs
- GOCR
- Feature used by GOCR includes
- size
- skew
- presence of serifs
34Conclusion
- Unique features can extract from the similar
Brahmi scripts. - Zonal features are useful as secondary features.
- Nodal features are useful if properly extracted.
- Moments are useful primary features.
- Hus seven features.
- Zernike features up to 40 order.
35References
1 D. Trier, A.K. Jain, and T. Taxt, "Feature
extraction methods for character recognition - a
survey," Pattern Recognition, vol. 29, no. 4, pp.
641-662, Apr. 1996. 2 Tinku Acharya and Ajoy
K. Ray, Image Processing Principles and
Applications. New JerseyJohn Wiley Sons,
2005. 3 Qing Chen, "EVALUATION OF OCR
ALGORITHMS FOR IMAGES WITH DIFFERENT SPATIAL
RESOLUTIONS AND NOISES", Graduate Thesis Report,
School of Information Technology and Engineering,
Faculty of Engineering, University of
Ottawa. 4 Peter Burrow, "Arabic Handwriting
Recognition", Graduate Thesis Report, School of
Informatics, University of Edinburgh. 5 Md.
Abul Hasnat, S. M. Murtoza Habib, and Mumit Khan,
Segmentation free Bangla OCR using HMM Training
and Recognition, Proc. of 1st DCCA2007, Irbid,
Jordan, 2007. 6 R. Kapoor, D. Bagai and T.S.
Kamal, Representation and Extraction of Nodal
Features of DevNagri Letters, Proceedings of the
3rd Indian Conference on Computer Vision,
Graphics and Image Processing. 7 Jan Flusser ,
Moment Invariants in Image Analysis,
TRANSACTIONS ON ENGINEERING, COMPUTING AND
TECHNOLOGY, V11, Feb. 2006, ISSN 1305-5313 9
Liu Maofu, He Yanxiang and Ye Bin, "Image
Zernike moments shape feature evaluation based on
image reconstruction", Geo-spatial Information
Science, Volume 10, Issue 3 , May 31, 2007. 10
http//www.micro.dibe.unige.it/Research/OCR.htm
11 www.iit.demokritos.gr/IIT_SS/Presentations/Off
-Line20Handwritten20OCR.ppt 12
http//tesseract-ocr.repairfaq.org/tess_glossary.h
tml
36Questions
37Thank You