Title: Transliteration based Gazetteer Preparation for Named Entity Recognition in Hindi
1Transliteration based Gazetteer Preparation for
Named Entity Recognition in Hindi
Sujan Kumar Saha, Sudeshna Sarkar, Pabitra
Mitra Department of Computer Science
Engineering Indian Institute of Technology
Kharagpur India
2Contents
- Introduction
- NER approaches
- Gazetteer and application in NER
- Gazetteer Indian language scenario
- Transliteration our approach gazetteer
preparation - MaxEnt based Hindi NER system
- Use of prepared gazetteers in MaxEnt system
3What is NER
- Named Entity Recognition (NER) Locate and
Classify the Names in Text - Example
- Jawaharlal Nehru was the first prime minister of
India - Per-beg Per-end
Title Loc - Importance
- Information Extraction, Question-Answering
- Can help Summarization, ASR and MT
- Intelligent document access
- etc
4NER Approaches
- 1. Linguistic Approach
- Rule based Models
- Requires lot of experience and grammatical
knowledge - Example Ralph Grishman (1995), Wakao et al.
(1996) - 2. Machine Learning based Approach
- ML based Models HMM, MaxEnt, CRF
- Requires huge amount of annotated data
- Example Borthwick (1999), McCallum et al.
(2003) - 3. Hybrid Approach
- Several ML and Rule-based systems are combined
- Example Srihari et al.(2000), ..
5Gazetteer
- Name Lists of specific type very helpful in
identification of NEs - Example
- Names of cities, countries etc.
- First name list, surnames
- List of towns in India
-
- Used in many NER systems
- Rule-based approaches and ML approaches
gazetteers used in both
6Gazetteers in NER Systems Example
- Grishman (1995) (rule-based)
- Names of all countries, major cities, companies,
common first names etc. - Wakao et al. (1996) (rule-based)
- Names of locations, organizations, first names,
human titles etc. - Borthwick (1999) (ML-based)
- First names (1,245), corporate names (10,300),
colleges and universities (1,225), corporate
suffixes (244), date and time (51) etc. - Srihari et al. (2000) (ML-based)
- Male names (3,000), female names (5,000), family
names (14,000), location names (2,50,000) with
category.
7Gazetteer Advantage Disadvantage
- Advantage
- Simple
- Fast
- Language Independent
- Easy to retarget
- Disadvantage
- Impossible to enumerate all names
- Collection maintenance is difficult
- Cant deal name variants
- Cant resolve ambiguity
8Gazetteer Indian Language Scenario
- Resource Poor Language
- No reasonable size publicly available list in
Indian languages - Web resources use of Indian languages in web is
very little compared to English - Relevant lists are available in English
- Need transliteration English to Indian language
(Hindi)
9Transliteration
- Direct transliteration (English Hindi) is not
easy - Vocabulary size 52 vs 26
- Compound characters in Hindi
- Availability of bilingual corpora
- Many possible outcomes for a input string
- Example Bengali English transliteration system
- Joint source channel model
- Accuracy
- 69.3 word agreement ratio (Bengali English)
- 67.9 word agreement ratio (English
Bengali)
10Our Approach
- Transliteration through an intermediate alphabet
- English string (E) is transliterated to the
intermediate string (S1) - Also Hindi string (H) is transliterated to the
intermediate string (S2) - If S1 equivalent to S2
- E and H are transliteration of each other
- Intermediate alphabet - size 34 preserves
phonetic properties
11Transliteration - English to Intermediate
- Define the phonetic map-table
- Maps English n-gram into intermediate character
- Example few entities of map-table
12Procedure - Transliteration
- Source English, Output - Intermediate
- Scan the source string (S) from left to right
- Extract the first n-gram (G) from the string (n
3) - Find if it is in the map-table
- If yes, insert its corresponding intermediate
state entity into target string T Remove the
n-gram from S, S S G Go to step 2 - Else set n n 1 Go to step 3.
13Transliteration Hindi to Intermediate
- Convert the Hindi strings to itrans
- itrans is a representation of Indian language
alphabets in terms of ASCII - The map-table is available at aczoom.com/itrans
- Define a map-table for itrans to Intermediate
transliteration - Follow similar procedure
14Transliteration Example
Debasis Debashis Debasish
??????? ???????
intermediate
15Gazetteer Lists
- Prepared lists with size and example
- Common Location List 70 jila, nagar, road
- Location List 13019 Kolkata, New York
- Month name, Days of week 40 ravivar, March
- First Name List 9722 Suman, Kunal
- Middle Name List 35 Kumar, Chandra
- Surname List 1700 Kundu, Chopra, Gates
- Organization list 950 Bhajapa, Microsoft
16Gazetteers and Transliteration
- Collected English name lists are transliterated
into Intermediate alphabet - These transliterated lists are used as gazetteers
in Hindi NER development - Hindi strings are transliterated into
Intermediate alphabet and searched in the
gazetteers - Gazetteer based NER system (If wi is in list of
category Cj then wi is an NE) - Accuracy Person 43.1, Location 61.44,
Organization 48.14
17MaxEnt Based Hindi NER
- Computes P(t/h)
- t outcome, h history, gi feature, ai
weight of gi
- An open-nlp MaxEnt tool (www.maxent.sourceforge.ne
t) is used to compute the class-conditional
probabilities (P(t/h)) - Beam Search Algorithm to find the most probable
tag sequence
18Training Data
- Collected from Dainik Jagaran
- 4 classes Person, Location, Organization, Date
- Each class contains 4 subclasses Begin,
Continue, End, Unique - Example New/LocBeg Delhi/LocEnd, Atal/PerBeg
Bihari/PerCon Vajpayee/PerEnd, Microsoft/OrgUniq - Corpus size 2,43,341 words with 16482 NEs
- 6298 Person, 4696 Location, 3652 Organization and
1845 Date Entities
19Hindi NER Features
- Static Word Feature
- Dynamic NE Tag of previous words
- Contains Digit
- Made up of 4 Digits
- Numerical Words
- Word Suffix fixed length or binary
- Word Prefix
- POS information full or coarse grained,
binary POS features (nominal-PSP)
20Gazetteer Features
- Direct matching based decision is not taken
Causes ambiguity - Kharagpur Location vs IIT Kharagpur
Organization - Rabindranath Tagore Person vs Rabindranath
Tagore Lane Location - Purba may be name of person or building
- Gazetteers are used as features of MaxEnt
Binary features - If wi is in location list,
location-gazetter-feature(wi) 1 else
location-gazetter-feature(wi) 0
21Accuracy for Different Features
- The Accuracies of MaxEnt based NER system for
different feature sets All the values are
f-values - Without using gazetteer information
- The highest accuracy obtained is 75.89 f-value.
22Accuracy for Different Gazetteers
- The improvement of accuracies after adding
gazetteer features - Individual gazetteers are capable of increasing
the accuracy of the particular class
23Accuracy MaxEnt System Gazetteer
The transliterated gazetteers are able to
increase accuracy The highest system accuracy is
81.12 f-value
24References
- Andrew Borthwick. 1999. A Maximum Entropy
Approach to Named Entity Recognition. Ph.D.
thesis, Computer Science Department, New York
University. - Daniel M. Bikel, Scott Miller, Richard Schwartz,
and Ralph Weischedel. 1997. Nymble A high
Performance Learning Name-finder. In Proceedings
of the Fifth Conference on Applied Natural
Language Processing, pages 194-201. - Ekbal Asif, S. Naskar, S. Bandyopadhyay. 2006.
A Modified Joint Source Channel Model for
Transliteration. In Proceedings of the COLING/ACL
2006, Australia, 191-198. - Grishman Ralph. 1995. Where's the syntax? The New
York University MUC-6 System. In Proceedings of
the Sixth Message Understanding Conference. - R Srihari, C Niu, W Li. 2000. A Hybrid Approach
for Named Entity and Sub-Type Tagging. In
Proceedings of the sixth conference on Applied
natural language processing. - Wakao T., R. Gaizauskas and Y. Wilks. 1996.
Evaluation of an Algorithm for the Recognition
and Classification of Proper Names. In
Proceedings of COLING-96.
25Thank You