Title: Expected category of contribution: OCR for Indian Languages
1OPTICAL CHARATCER RECOGNITION SYSTEM FOR INDIAN
LANGUAGES
Proposer CDAC Noida
Name of the company CDAC Noida
Language/Language pair Hindi, Punjabi, Marathi,
Tamil, Telgu, Malayalam, Bangla, Oriya
Expected category of contribution OCR for Indian
Languages
2- Strength of CDAC Noida
-
- Technical Capabilities NLP Lab equipped with
necessary software and 50 trained
Engineers - Previous collaboration with universities/RD
institutions - ABBYY Software Ltd, Moscow W3C
- Thapar Institute of Engineering, Patiala
- University of Hyderabad Jamia-Milia
- IIT Guwahati CDAC Trivandrum
- Utkal Univ CDAC Kolkata
- ISI Kolkata IISc Banglore
- CSIO Chandigrah IIT Kanpur
- IIT Roorkee ELDA,France
- DRDO, DElhi CSTT, Delhi BITS
Pilani Banasthali Vidyapeeth COCOSDA,
Japan Kumaon Univ, Nainital MGAHV,
Wardha Delhi Press Prakashan - Pustak Mahal Kendriya Hindi Sansthan
3Previous work done in this or similar areas
- Beta Version of product named Swarnakriti
(Integration of Indian Languages OCRs and Hindi
TTS with Unicode Word Processor) has been
released and made available for public domain
usage through ILDC portal. Product has been
appreciated by users. - Chitraksharika OCR for Devanagari Script.
4Proposed Approach and Architecture
Components based approach will be used for
development of OCR system. Major Components to be
developed for OCRs are 1.Page Layout Analysis
Engine (Page Segmentation) 2.Visual Component
Extraction Engine 3.Visual Component Recognizer
Engine (Template Based) 4. Post Processor Engine
(Error Correction and Detection Module) 5.
Testing Data Annotation Tools
5Architecture
Proposed architecture of the system for Phase-I
Phase-II is as follows
Phase I
In Phase I we will be using the ABBYY
International, Moscow Fine Reader Engine 7.1
SDK for Document/Page Analysis Engine. Rest
all components will be developed by CDAC Noida
Phase II
In Phase II we will be using our own developed
Document/Page Analysis Engine.
6Phase I
User Friendly Graphical User Interface
Devnagari Script
GurmukhiScript
BanglaScript
Tamil Script
TelguScript
OCRs
ABBYY Software Ltd Fine Reader Engine 7.1
SDK Document Layout Analysis Layout Retention
API Engine
Respective Modules will be Developed
7Phase II