Transliteration based Gazetteer Preparation for Named Entity Recognition in Hindi - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Transliteration based Gazetteer Preparation for Named Entity Recognition in Hindi

Description:

Intermediate alphabet - size 34 preserves phonetic properties. 11/5/09 ... Define the phonetic map-table. Maps English n-gram into intermediate character ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 26
Provided by: CVIT9
Category:

less

Transcript and Presenter's Notes

Title: Transliteration based Gazetteer Preparation for Named Entity Recognition in Hindi


1
Transliteration based Gazetteer Preparation for
Named Entity Recognition in Hindi
Sujan Kumar Saha, Sudeshna Sarkar, Pabitra
Mitra Department of Computer Science
Engineering Indian Institute of Technology
Kharagpur India
2
Contents
  • Introduction
  • NER approaches
  • Gazetteer and application in NER
  • Gazetteer Indian language scenario
  • Transliteration our approach gazetteer
    preparation
  • MaxEnt based Hindi NER system
  • Use of prepared gazetteers in MaxEnt system

3
What is NER
  • Named Entity Recognition (NER) Locate and
    Classify the Names in Text
  • Example
  • Jawaharlal Nehru was the first prime minister of
    India
  • Per-beg Per-end
    Title Loc
  • Importance
  • Information Extraction, Question-Answering
  • Can help Summarization, ASR and MT
  • Intelligent document access
  • etc

4
NER Approaches
  • 1. Linguistic Approach
  • Rule based Models
  • Requires lot of experience and grammatical
    knowledge
  • Example Ralph Grishman (1995), Wakao et al.
    (1996)
  • 2. Machine Learning based Approach
  • ML based Models HMM, MaxEnt, CRF
  • Requires huge amount of annotated data
  • Example Borthwick (1999), McCallum et al.
    (2003)
  • 3. Hybrid Approach
  • Several ML and Rule-based systems are combined
  • Example Srihari et al.(2000), ..

5
Gazetteer
  • Name Lists of specific type very helpful in
    identification of NEs
  • Example
  • Names of cities, countries etc.
  • First name list, surnames
  • List of towns in India
  • Used in many NER systems
  • Rule-based approaches and ML approaches
    gazetteers used in both

6
Gazetteers in NER Systems Example
  • Grishman (1995) (rule-based)
  • Names of all countries, major cities, companies,
    common first names etc.
  • Wakao et al. (1996) (rule-based)
  • Names of locations, organizations, first names,
    human titles etc.
  • Borthwick (1999) (ML-based)
  • First names (1,245), corporate names (10,300),
    colleges and universities (1,225), corporate
    suffixes (244), date and time (51) etc.
  • Srihari et al. (2000) (ML-based)
  • Male names (3,000), female names (5,000), family
    names (14,000), location names (2,50,000) with
    category.

7
Gazetteer Advantage Disadvantage
  • Advantage
  • Simple
  • Fast
  • Language Independent
  • Easy to retarget
  • Disadvantage
  • Impossible to enumerate all names
  • Collection maintenance is difficult
  • Cant deal name variants
  • Cant resolve ambiguity

8
Gazetteer Indian Language Scenario
  • Resource Poor Language
  • No reasonable size publicly available list in
    Indian languages
  • Web resources use of Indian languages in web is
    very little compared to English
  • Relevant lists are available in English
  • Need transliteration English to Indian language
    (Hindi)

9
Transliteration
  • Direct transliteration (English Hindi) is not
    easy
  • Vocabulary size 52 vs 26
  • Compound characters in Hindi
  • Availability of bilingual corpora
  • Many possible outcomes for a input string
  • Example Bengali English transliteration system
  • Joint source channel model
  • Accuracy
  • 69.3 word agreement ratio (Bengali English)
  • 67.9 word agreement ratio (English
    Bengali)

10
Our Approach
  • Transliteration through an intermediate alphabet
  • English string (E) is transliterated to the
    intermediate string (S1)
  • Also Hindi string (H) is transliterated to the
    intermediate string (S2)
  • If S1 equivalent to S2
  • E and H are transliteration of each other
  • Intermediate alphabet - size 34 preserves
    phonetic properties

11
Transliteration - English to Intermediate
  • Define the phonetic map-table
  • Maps English n-gram into intermediate character
  • Example few entities of map-table

12
Procedure - Transliteration
  • Source English, Output - Intermediate
  • Scan the source string (S) from left to right
  • Extract the first n-gram (G) from the string (n
    3)
  • Find if it is in the map-table
  • If yes, insert its corresponding intermediate
    state entity into target string T Remove the
    n-gram from S, S S G Go to step 2
  • Else set n n 1 Go to step 3.

13
Transliteration Hindi to Intermediate
  • Convert the Hindi strings to itrans
  • itrans is a representation of Indian language
    alphabets in terms of ASCII
  • The map-table is available at aczoom.com/itrans
  • Define a map-table for itrans to Intermediate
    transliteration
  • Follow similar procedure

14
Transliteration Example
Debasis Debashis Debasish
??????? ???????
intermediate
15
Gazetteer Lists
  • Prepared lists with size and example
  • Common Location List 70 jila, nagar, road
  • Location List 13019 Kolkata, New York
  • Month name, Days of week 40 ravivar, March
  • First Name List 9722 Suman, Kunal
  • Middle Name List 35 Kumar, Chandra
  • Surname List 1700 Kundu, Chopra, Gates
  • Organization list 950 Bhajapa, Microsoft

16
Gazetteers and Transliteration
  • Collected English name lists are transliterated
    into Intermediate alphabet
  • These transliterated lists are used as gazetteers
    in Hindi NER development
  • Hindi strings are transliterated into
    Intermediate alphabet and searched in the
    gazetteers
  • Gazetteer based NER system (If wi is in list of
    category Cj then wi is an NE)
  • Accuracy Person 43.1, Location 61.44,
    Organization 48.14

17
MaxEnt Based Hindi NER
  • Computes P(t/h)
  • t outcome, h history, gi feature, ai
    weight of gi
  • An open-nlp MaxEnt tool (www.maxent.sourceforge.ne
    t) is used to compute the class-conditional
    probabilities (P(t/h))
  • Beam Search Algorithm to find the most probable
    tag sequence

18
Training Data
  • Collected from Dainik Jagaran
  • 4 classes Person, Location, Organization, Date
  • Each class contains 4 subclasses Begin,
    Continue, End, Unique
  • Example New/LocBeg Delhi/LocEnd, Atal/PerBeg
    Bihari/PerCon Vajpayee/PerEnd, Microsoft/OrgUniq
  • Corpus size 2,43,341 words with 16482 NEs
  • 6298 Person, 4696 Location, 3652 Organization and
    1845 Date Entities

19
Hindi NER Features
  • Static Word Feature
  • Dynamic NE Tag of previous words
  • Contains Digit
  • Made up of 4 Digits
  • Numerical Words
  • Word Suffix fixed length or binary
  • Word Prefix
  • POS information full or coarse grained,
    binary POS features (nominal-PSP)

20
Gazetteer Features
  • Direct matching based decision is not taken
    Causes ambiguity
  • Kharagpur Location vs IIT Kharagpur
    Organization
  • Rabindranath Tagore Person vs Rabindranath
    Tagore Lane Location
  • Purba may be name of person or building
  • Gazetteers are used as features of MaxEnt
    Binary features
  • If wi is in location list,
    location-gazetter-feature(wi) 1 else
    location-gazetter-feature(wi) 0

21
Accuracy for Different Features
  • The Accuracies of MaxEnt based NER system for
    different feature sets All the values are
    f-values
  • Without using gazetteer information
  • The highest accuracy obtained is 75.89 f-value.

22
Accuracy for Different Gazetteers
  • The improvement of accuracies after adding
    gazetteer features
  • Individual gazetteers are capable of increasing
    the accuracy of the particular class

23
Accuracy MaxEnt System Gazetteer
The transliterated gazetteers are able to
increase accuracy The highest system accuracy is
81.12 f-value
24
References
  • Andrew Borthwick. 1999. A Maximum Entropy
    Approach to Named Entity Recognition. Ph.D.
    thesis, Computer Science Department, New York
    University.
  • Daniel M. Bikel, Scott Miller, Richard Schwartz,
    and Ralph Weischedel. 1997. Nymble A high
    Performance Learning Name-finder. In Proceedings
    of the Fifth Conference on Applied Natural
    Language Processing, pages 194-201.
  • Ekbal Asif, S. Naskar, S. Bandyopadhyay. 2006.
    A Modified Joint Source Channel Model for
    Transliteration. In Proceedings of the COLING/ACL
    2006, Australia, 191-198.
  • Grishman Ralph. 1995. Where's the syntax? The New
    York University MUC-6 System. In Proceedings of
    the Sixth Message Understanding Conference.
  • R Srihari, C Niu, W Li. 2000. A Hybrid Approach
    for Named Entity and Sub-Type Tagging. In
    Proceedings of the sixth conference on Applied
    natural language processing.
  • Wakao T., R. Gaizauskas and Y. Wilks. 1996.
    Evaluation of an Algorithm for the Recognition
    and Classification of Proper Names. In
    Proceedings of COLING-96.

25
Thank You
Write a Comment
User Comments (0)
About PowerShow.com