CRBLPs Center for Research on Bangla Language Processing Activities and Achievements on Bangla Langu - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

CRBLPs Center for Research on Bangla Language Processing Activities and Achievements on Bangla Langu

Description:

... Computer Processing on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17 February, 2006. ... and Information Technology (ICCIT 2004), Dhaka, Bangladesh, December 2004. ... – PowerPoint PPT presentation

Number of Views:442
Avg rating:3.0/5.0
Slides: 44
Provided by: naushad3
Category:

less

Transcript and Presenter's Notes

Title: CRBLPs Center for Research on Bangla Language Processing Activities and Achievements on Bangla Langu


1
CRBLPs (Center for Research on Bangla Language
Processing) Activities and Achievements on Bangla
Language Processing, January 2007
  • Naushad UzZaman
  • CRBLP, BRAC U, Bangladesh
  • http//www.naushadzaman.com

2
CRBLPs Activities
  • Center for Research on Bangla Language
    Processing, CRBLP working on Bangla Language
    Processing since 2004
  • 11 Research staff (9 Computer Science background,
    2 linguistics background)
  • Students working part-time, doing internship
  • 13 Summer 2006 Interns and 7 former members
  • Motivation of open source
  • Academic
  • Offered course on language processing (CSE 431
    Natural Language Processing, offered at Spring
    2006 and Spring 2007 in BRAC U)
  • Thesis on NLP
  • Summer Internship

3
CRBLP Members (Full Time Staff Members)
  • Dr. Mumit Khan email website
  • Head, CRBLP and Associate Professor, CSE
    Department
  • Matin Saad Abdullah email?
  • Program Manager, CRBLP and Senior Lecturer, CSE
    Department
  • Naira Khan email
  • Linguist, CRBLP and Lecturer, English and
    Humanities (On Leave)
  • Zahurul Islam email website
  • Research Programmer, CRBLP and Part-time Faculty
    Member, CSE
  • Naushad UzZaman email website
  • Research Programmer, CRBLP and Part-time Faculty
    Member, CSE
  • Md. Abul Hasnat, Research Programmer email
    website
  • S. M. Murtoza Habib, Research Programmer email
    website
  • Firoj Alam, Research Programmer email website

4
CRBLP Members (Part-time and Interns)
  • Part Time Staff Members
  • Kamrul Hayder, Language Consultant
  • M. Abdur Rahman, Research Assistant
  • Maruf Muqtadir, Research Assistant
  • Summer 2006 Research Interns
  • Fahim Muhammad Hasan
  • M. Hammad Ali
  • Ayesha Binte Mosaddeque
  • Nafid Haque
  • Yeasir Arafat
  • Nizam Uddin
  • M. Abdur Rahman
  • Fahim Tawfique Chowdhury
  • Munirul Mansur
  • Md. Jahangir Alam
  • Annajiat Alim Rasel
  • Munshi Asadullah
  • Salman Zaman

5
Areas of Research
  • Document Authoring
  • Information Retrieval
  • Optical Character Recognition
  • Pronunciation Generator
  • Speech Processing
  • Morphology
  • Parts of Speech Tagging
  • Syntax
  • And also few other small research projects

6
Document Authoring, BanglaPad
  • The current version of the BanglaPad includes the
    following features
  • 1. Platform independent. (Current version tested
    on Windows and Linux).
  • 2. Edit Bangla and English text in the same
    document.
  • 3. Rich text editing with pictures and tables.
  • 4. Export document as HTML. (You can develop web
    contents in Bangla using this feature!)
  • 5. Support character encoding including UTF8 and
    UTF16.
  • 6. Bangla and English Spell checking. (Bangla
    spelling checker uses Puspa Speller)
  • 7. Bangla and English Search and replace.
  • 8. Printing formatted document.
  • 9. Three different skins for the editor.
  • 10. Built-in keyboard driver for easy Bangla
    typing. (No need to install a keyboard driver).
  • 11. Customizable Key-Maps for Bangla.
  • 12. Easy to use Installer for Windows

7
Spelling Checker in BanglaPad
8
Rich Text Editing in BanglaPad and exporting to
HTML
9
BanglaPad Download and Team Members
  • Download http//sourceforge.net/project/showfiles
    .php?group_id158301package_id180246
  • Developers
  • Zahurul Islam August 2005 - December 2005
  • Naushad UzZaman August 2005 - December 2005
  • Abdur Rahman January 2006 - present
  • Maruf Muqtadir January 2006 - present
  • Advisors
  • Matin Saad Abdullah August 2005 - present
  • Mumit Khan August 2005 - present

10
English to Bangla Transliteration
  • Type phonetically in English, you will get
    similar sounding dictionary word. Can be used for
    Bangla text input with English keyboard.
  • Developed by Naushad UzZaman
  • Relevant Publication
  • 1. Naushad UzZaman, Arnab Zaheen and Mumit Khan,
    A Comprehensive Roman (English) to Bangla
    Transliteration Scheme, Proc. International
    Conference on Computer Processing on Bangla
    (ICCPB-2006), Dhaka, Bangladesh, 17 February,
    2006. ?
  • 2. Naushad UzZaman, Phonetic Encoding for Bangla
    and its Application to Spelling Checker, Name
    Searching, Transliteration and Cross Language
    Information Retrieval, Undergraduate Thesis
    (Computer Science), BRAC University, May 2005.

11
Pata, English to Bangla Transliteration
12
Spelling Checker
  • Bangla Speller Sandbox Bangla Phonetic Speller
    (Puspa). Gives suggestion for misspelling words
    based on similarities in pronunciation.
    Implemented based on Double Metaphone phonetic
    encoding
  • Developed by Naushad UzZaman
  • Download http//sourceforge.net/project/showfiles
    .php?group_id158301package_id180247

13
Publications on Spelling Checker
  • 1. Naushad UzZaman and Mumit Khan, A Bangla
    Phonetic Encoding for Better Spelling
    Suggestions, Proc. 7th International Conference
    on Computer and Information Technology (ICCIT
    2004), Dhaka, Bangladesh, December 2004.
  • 2. Naushad UzZaman and Mumit Khan, A Double
    Metaphone Encoding for Bangla and its Application
    in Spelling Checker, Proc. 2005 IEEE
    International Conference on Natural Language
    Processing and Knowledge Engineering, pp.
    705-710, Wuhan, China, October 30 - November 1,
    2005.
  • 3. Naushad UzZaman and Mumit Khan, A
    Comprehensive Bangla Spelling Checker, Proc.
    International Conference on Computer Processing
    on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17
    February, 2006.
  • 4. Naushad UzZaman, Phonetic Encoding for Bangla
    and its Application to Spelling Checker, Name
    Searching, Transliteration and Cross Language
    Information Retrieval, Undergraduate Thesis
    (Computer Science), BRAC University, May 2005.
  • 5. Munshi Asadullah, Md. Zahurul Islam, and Mumit
    Khan, Error-tolerant Finite-state Recognizer and
    String Pattern Similarity Based Spell-Checker for
    Bengali, to appear in the Proc. of International
    Conference on Natural Language Processing, ICON
    2007, January 2007.

14
Puspa Spelling Checker
15
Search Engine
  • Bangla search engine based on open-source search
    engine Nutch.
  • Developed by M. Hammad Ali and Nafid Haque
  • Relevant Publications
  • 1. Nafid Haque, Hammad Ali, Mumit Khan, and Matin
    Saad Abdullah, Infrastructure for Bangla
    Information retrieval in the context of ICT for
    Development, to appear in the Proc. of
    International Conference on Systems, Computing
    Sciences and Software Engineering (SCSS 06) of
    International Joint Conferences on Computer,
    Information, and Systems Sciences, and
    Engineering (CISSE 06), December 4 - 14, 2006.
  • 2. M. Hammad Ali, Nafid Haque, A Decentralised
    Approach to Information Retrieval for a
    developing country like Bangladesh, Education
    Without Borders 2007, Abu Dhabi, February 25 -
    27, 2007.

16
Search Engine example
17
Optical Character Recognition
  • BanglaOCR is the Optical Character Recognizer for
    Bangla Script. It takes scanned images of a
    printed page or document as input and converts
    them into editable Unicode text. BanglaOCR allows
    users to train the data set from any document and
    observe the recognition performance.
  • BanglaOCR developed by Md. Abul Hasnat and S M
    Murtoza Habib.
  • Download http//sourceforge.net/project/showfiles
    .php?group_id158301package_id215908
  • Another OCR implemented using Kohonen Network,
    developed by Shoeb Shatil.
  • Download http//sourceforge.net/project/showfiles
    .php?group_id158301package_id180249

18
OCR Status
  • OCR Application
  • Status Version 0.1, Release candidate 1
  • Status of Different Segments of OCR
  • Document skew correction
  • Bangla document skew corrector based on Radon
    transform. Status Complete.
  • Segmentation
  • Bangla line segmentation. Status Complete
  • Bangla word segmentation. Status Complete
  • Bangla character segmentation. Status Work in
    progress. The large number of combinations
    (consonant clusters and the non-spacing marks)
    complicates this task. This is omnifont, so must
    work with any typeface.
  • Character/Symbol recognition
  • Neural net based recognizer Fairly complete for
    the basic alphabet and a subset of the consonant
    clusters. The non-spacing marks pose a
    significant challenge.
  • Hidden Markov Model (HMM) based recognizer
    Status First demo available.
  • -Post Processing for OCR
  • Post processing spelling checker for OCR
    corrects spelling mistakes due to unsuccessful
    recognition. Status First demo available.

19
BanglaOCR
20
OCR Related Publications
  • 1. Md. Abul Hasnat, S M Murtoza Habib and Mumit
    Khan, Segmentation free Bangla OCR using HMM
    Training and Recognition, to appear in the Proc.
    of 1st International Conference on Digital
    Communications and Computer Applications
    (DCCA2007), Irbid, Jordan, 2007.
  • 2. A. M. Shoeb Shatil and Mumit Khan, Minimally
    Segmenting High Performance Bangla OCR using
    Kohonen Network, to appear in the Proc. of 9th
    International Conference on Computer and
    Information Technology (ICCIT 2006), Dhaka,
    Bangladesh, December 2006.
  • 3. S. M. Murtoza Habib, Nawsher Ahmed Noor and
    Mumit Khan, Skew correction of Bangla script
    using Radon Transform, to appear in the Proc. of
    9th International Conference on Computer and
    Information Technology (ICCIT 2006), Dhaka,
    Bangladesh, December 2006.

21
Automated Pronunciation Generator
  • Pronunciation Generator Input any Bangla word,
    this application will give the pronunciation of
    that word in IPA (International Phonetic
    Alphabet).
  • Demo available online at http//student.bu.ac.bd/
    7Eu02201011/g2pweb/g2p1.htm
  • Source code available online at
    http//student.bu.ac.bd/7Eu02201011/g2pweb/
  • Developed by, Ayesha Binte Mosaddeque
  • Relevant Publication
  • Ayesha Binte Mosaddeque, Naushad UzZaman and
    Mumit Khan, Rule based Automated Pronunciation
    Generator, to appear in the Proc. of 9th
    International Conference on Computer and
    Information Technology (ICCIT 2006), Dhaka,
    Bangladesh, December 2006.

22
Bangla Pronunciation Generator
23
Speech Processing
  • Text-to-speech
  • Voice for Festival.
  • Status First demo available, Developed by Firoj
    Alam
  • Automatic Speech Recognition
  • Isolated Speech Recognition, Developed by A K M
    Mahmudul Hoque
  • Continuous Speech Recognition. Status First demo
    available. Developed by Md. Abul Hasnat

24
Speech Related Publications
  • 1. Firoj Alam and Promila Kanti Nath, Bangla Text
    to Speech using Festival, Undergraduate Thesis
    (Computer Science), BRAC University, May 2006.
    Supervisor Mumit Khan.
  • 2. A K M Mahmudul Hoque, Bangla Speech
    Recognition, Undergraduate Thesis (Computer
    Science), BRAC University, May 2006. Supervisor
    Mumit Khan.
  • 3. Firoj Alam, Promila Kanti Nath and Mumit Khan,
    Text To Speech for Bangla Language using
    Festival, to appear in the Proc. of 1st
    International Conference on Digital
    Communications and Computer Applications
    (DCCA2007), Irbid, Jordan, 2007.

25
Morphology
  • Morphology The branch of grammar which studies
    the structure or forms of words.
  • Work done on Bangla Morphology
  • Generative verb morphology using two-level rules
  • Basic concatanative noun morphology with features
  • Software developed Jkimmo, A Multilingual
    Computational Morphology Framework for PC-KIMMO.
    Developed by Md. Zahurul Islam.
  • Download Jkimmo http//sourceforge.net/project/sh
    owfiles.php?group_id158301package_id180248

26
Morphological Analyzer Jkimmo
27
Morphology Related Publications
  • 1. Sajib Dasgupta and Mumit Khan, Morphological
    Parsing of Bangla Words using PC-KIMMO, Proc. 7th
    International Conference on Computer and
    Information Technology, Dhaka, Bangladesh,
    December, 2004.
  • 2. Sajib Dasgupta and Mumit Khan, Feature
    Unification for Morphological Parsing in Bangla,
    Proc. 7th International Conference on Computer
    and Information Technology, Dhaka, Bangladesh,
    December, 2004.
  • 3. Sajib Dasgupta, Dewan Shahriar Hossain Pavel,
    Asif Iqbal Sarkar, Naira Khan and Mumit Khan,
    Morphological Analysis of Inflecting Compound
    Words in Bangla, Proc. 8th International
    Conference on Computer Information Technology
    (ICCIT), Islamic University of Technology (IUT),
    Dhaka, Bangladesh, 2005.
  • 4. Md. Zahurul Islam and Mumit Khan, JKimmo A
    Multilingual Computational Morphology Framework
    for PC-KIMMO, to appear in the Proc. of 9th
    International Conference on Computer and
    Information Technology (ICCIT 2006), Dhaka,
    Bangladesh, December 2006.

28
Bangla Parts of Speech (POS) Tagging
  • This application tags words in a sentence with
    the parts of speech of that word. Implemented and
    compared HMM, n-gram and Transformation based
    Brills POS Tagging for Bangla, Hindi and Telegu
    on different sized corpus. For Bangla it was
    compared on different sized tagset too.
  • Developed by Fahim Muhammad Hasan.
  • Relevant Publications
  • Fahim Muhammad Hasan, Naushad UzZaman and Mumit
    Khan, Comparison of different POS Tagging
    Techniques (n-gram, HMM and Brill's tagger) for
    Bangla, to appear in the Proc. of International
    Conference on Systems, Computing Sciences and
    Software Engineering (SCSS 06) of International
    Joint Conferences on Computer, Information, and
    Systems Sciences, and Engineering (CISSE 06),
    December 4 - 14, 2006.

29
POS Tagging example
30
Syntax
  • Syntax the grammatical arrangement of words in
    sentences
  • Bangla syntactic analysis using
  • Lexical Functional Grammar (LFG) formalism
  • Head-driven Phrase Structure Grammar (HPSG)
    formalism
  • Work done by Naira Khan, Ayesha Binte Mosaddeque,
    M Hammad Ali and Nafid Haque.
  • Relevant Publications
  • 1. Md. Nasimul Haque and M. Khan, Parsing Bangla
    using LFG An Introduction, BRAC University
    Journal, Vol 2, No. 2, 2005.
  • 2. Naira Khan and Mumit Khan, Developing a
    Computational Grammar for Bengali using the HPSG
    Formalism, to appear in the Proc. of 9th
    International Conference on Computer and
    Information Technology (ICCIT 2006), Dhaka,
    Bangladesh, December 2006.
  • 3. Ayesha Binte Mosaddeque, M. Hammad Ali and
    Nafid Haque, Design of Head-Driven Phrase
    Structure Grammer for Bangla, Undergraduate
    Thesis (Computer Science), BRAC University,
    December 2006. Supervisor Mumit Khan.

31
(No Transcript)
32
  • Small Research Projects

33
Bangla Grammar Checker
  • Implemented a statistical Bangla grammar checker
    based on n-gram analysis.
  • Developed by Md. Jahangir Alam.
  • Relevant Publications
  • Md. Jahangir Alam, Naushad UzZaman and Mumit
    Khan, N-gram based Statistical Grammar Checker
    for Bangla and English, to appear in the Proc. of
    9th International Conference on Computer and
    Information Technology (ICCIT 2006), Dhaka,
    Bangladesh, December 2006.

34
Bangla Text Categorization
  • Implemented Bangla Text categorization based on
    n-gram analysis. Trained on Prothom Alo newspaper
    corpus on 6 different categories.
  • Developed by Munirul Mansur.
  • Relevant Publications
  • 1. Munirul Mansur, Naushad UzZaman and Mumit
    Khan, Analysis of N-gram based text
    categorization for Bangla in a newspaper corpus,
    to appear in the Proc. of 9th International
    Conference on Computer and Information Technology
    (ICCIT 2006), Dhaka, Bangladesh, December 2006.
  • 2. Munirul Mansur, Analysis of n-gram based text
    categorization for Bangla in a newspaper corpus,
    Undergraduate Thesis (Computer Science), BRAC
    University, August 2006. Supervisor Mumit Khan.

35
Analysis of Prothom-Alo newspaper Corpus
  • Frequency analysis of 1 year Prothom-Alo
    newspaper corpus.
  • Relevant Publications
  • 1. Yeasir Arafat, Analysis and Observations From
    a Bangla news corpus, Undergraduate Thesis
    (Computer Science), BRAC University, August 2006.
    Supervisor Mumit Khan.
  • 2. Yeasir Arafat, Md. Zahurul Islam and Mumit
    Khan, Analysis and Observations From a Bangla
    news corpus, to appear in the Proc. of 9th
    International Conference on Computer and
    Information Technology (ICCIT 2006), Dhaka,
    Bangladesh, December 2006.

36
Language Modeling, forward and backward n-gram
  • Investigating the prospect of backward n-gram
    compared to forward n-gram for Bangla.
  • Relevant Publication
  • Naira Khan, Md. Tarek Habib, Md. Jahangir Alam,
    Rajib Rahman, Naushad UzZaman and Mumit Khan,
    History (forward n-gram) or Future (backward
    n-gram)? Which model to consider for n-gram
    analysis in Bangla?, to appear in the Proc. of
    9th International Conference on Computer and
    Information Technology (ICCIT 2006), Dhaka,
    Bangladesh, December 2006.

37
Font Converter
  • Converts different TTF fonts to Unicode encoding.
    Status Completed for Ullash, Prothoma, Bangsi
    Alpona fonts.
  • Developed by Md. Zahurul Islam.
  • Download http//sourceforge.net/project/showfiles
    .php?group_id158301package_id180250

38
Stemming
  • Stemming Stemming is an algorithm developed to
    reduce a search query to its stem or root form,
    in other words, variations of particular words
    such as past tense and plural and singular usage
    are taken into account when performing a search,
    For example, applies, applying applied matches
    apply.
  • Relevant Publications
  • Md. Zahurul Islam, Md. Nizam Uddin and Mumit
    Khan, A Light Weight Stemmer for Bengali and Its
    Use in Spelling Checker, to appear in the Proc.
    of 1st International Conference on Digital
    Communications and Computer Applications
    (DCCA2007), Irbid, Jordan, 2007.

39
Text Summarization
  • Text summarization is the technique which
    automatically creates an abstract or summary of a
    text. In this study we investigate what works
    have been done in this area and implement an
    extraction based text summarizer for Bangla
    language.
  • Relevant work
  • Md. Nizam Uddin, "A Study on Text Summarization
    Techniques and an Approach for Bangla Text
    Summarization", Independent Study, Computer
    Science, BRAC University, December 2006,
    Supervisor Md. Zahurul Islam, Mumit Khan

40
Language Resources
  • Lexicon
  • Wordlist of 160 thousands words with 1st step
    parts of speech tags.
  • Corpus
  • 1 year Prothom alo newspaper corpus
  • Charjapad and Boru Chandi Dash er kabbo corpus
    (Edited by Md. Abdul Hai and Anwar Pasha)

41
CRBLP Publications
  • 2004
  • ICCIT 2004 (Bangladesh) 3 (Morphology, Spelling
    Checker)
  • Total 3
  • 2005
  • IASTED CI 2005 (Canada) 1 (Name Searching)
  • IEEE NLP KE 2005 (China) 1 (Spelling Checker)
  • IEE Mobility 2005 (China) 1 (Text Input System
    for Mobile)
  • ICCIT 2005 2 (Morphology, Compiler)
  • BU Journal 1 (Morphological Parsing)
  • Undergraduate Thesis 1 (Phonetic Encoding)
  • Total 7

42
CRBLP Publication cont.
  • 2006
  • ICCPB 2006 (Bangladesh) 4 (Corpus, Lexicon,
    Spelling Checker, Transliteration)
  • ICCIT 2006 (Bangladesh) 11 (HPSG, Corpus
    Analysis, Text Categorization, Pronunciation
    Generator, Backward n-gram, Grammar Checker, Skew
    Correction, Traveler Information System, OCR
    using Kohonen Network, Mobile Messaging,
    Morphology)
  • CISSE 2006 (Online) 2 (comparison of POS
    tagging, Bangla Information Retrieval)
  • Undergraduate Thesis 9 (Skew Correction, Mobile
    Messaging, Speech Recognition, OCR using Kohonen
    network, Text to Speech, Corpus Analysis, Text
    Categorization, POS Tagging, HPSG)
  • Total 24
  • 2007
  • ICON 2007 (India) 1 (Spelling Checker)
  • DCCA 2007 (Jordan) 5 (Stemming, OCR, Text to
    Speech, Semantics, wireless LAN)
  • EWB 2007 (Abu Dhabi) 2 (Information Retrieval,
    Localization)
  • Total 8
  • Till January 2007

43
CRBLP website
  • http//www.bracu.ac.bd/research/crblp/
Write a Comment
User Comments (0)
About PowerShow.com