Transliteration based Gazetteer Preparation for Named Entity Recognition in Hindi - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Transliteration based Gazetteer Preparation for Named Entity Recognition in Hindi

Description:

Intermediate alphabet - size 34 preserves phonetic properties. 11/5/09 ... Define the phonetic map-table. Maps English n-gram into intermediate character ... – PowerPoint PPT presentation

Number of Views:142

Avg rating:3.0/5.0

Slides: 26

Provided by: CVIT9

Category:

more less

Transcript and Presenter's Notes

Title: Transliteration based Gazetteer Preparation for Named Entity Recognition in Hindi

1
Transliteration based Gazetteer Preparation for
Named Entity Recognition in Hindi
Sujan Kumar Saha, Sudeshna Sarkar, Pabitra
Mitra Department of Computer Science
Engineering Indian Institute of Technology
Kharagpur India
2
Contents

Introduction
NER approaches
Gazetteer and application in NER
Gazetteer Indian language scenario
Transliteration our approach gazetteer
preparation
MaxEnt based Hindi NER system
Use of prepared gazetteers in MaxEnt system

3
What is NER

Named Entity Recognition (NER) Locate and
Classify the Names in Text
Example
Jawaharlal Nehru was the first prime minister of
India
Per-beg Per-end
Title Loc
Importance
Information Extraction, Question-Answering
Can help Summarization, ASR and MT
Intelligent document access
etc

4
NER Approaches

1. Linguistic Approach
Rule based Models
Requires lot of experience and grammatical
knowledge
Example Ralph Grishman (1995), Wakao et al.
(1996)
2. Machine Learning based Approach
ML based Models HMM, MaxEnt, CRF
Requires huge amount of annotated data
Example Borthwick (1999), McCallum et al.
(2003)
3. Hybrid Approach
Several ML and Rule-based systems are combined
Example Srihari et al.(2000), ..

5
Gazetteer

Name Lists of specific type very helpful in
identification of NEs
Example
Names of cities, countries etc.
First name list, surnames
List of towns in India
Used in many NER systems
Rule-based approaches and ML approaches
gazetteers used in both

6
Gazetteers in NER Systems Example

Grishman (1995) (rule-based)
Names of all countries, major cities, companies,
common first names etc.
Wakao et al. (1996) (rule-based)
Names of locations, organizations, first names,
human titles etc.
Borthwick (1999) (ML-based)
First names (1,245), corporate names (10,300),
colleges and universities (1,225), corporate
suffixes (244), date and time (51) etc.
Srihari et al. (2000) (ML-based)
Male names (3,000), female names (5,000), family
names (14,000), location names (2,50,000) with
category.

7
Gazetteer Advantage Disadvantage

Advantage
Simple
Fast
Language Independent
Easy to retarget
Disadvantage
Impossible to enumerate all names
Collection maintenance is difficult
Cant deal name variants
Cant resolve ambiguity

8
Gazetteer Indian Language Scenario

Resource Poor Language
No reasonable size publicly available list in
Indian languages
Web resources use of Indian languages in web is
very little compared to English
Relevant lists are available in English
Need transliteration English to Indian language
(Hindi)

9
Transliteration

Direct transliteration (English Hindi) is not
easy
Vocabulary size 52 vs 26
Compound characters in Hindi
Availability of bilingual corpora
Many possible outcomes for a input string
Example Bengali English transliteration system
Joint source channel model
Accuracy
69.3 word agreement ratio (Bengali English)
67.9 word agreement ratio (English
Bengali)

10
Our Approach

Transliteration through an intermediate alphabet
English string (E) is transliterated to the
intermediate string (S1)
Also Hindi string (H) is transliterated to the
intermediate string (S2)
If S1 equivalent to S2
E and H are transliteration of each other
Intermediate alphabet - size 34 preserves
phonetic properties

11
Transliteration - English to Intermediate

Define the phonetic map-table
Maps English n-gram into intermediate character
Example few entities of map-table

12
Procedure - Transliteration

Source English, Output - Intermediate
Scan the source string (S) from left to right
Extract the first n-gram (G) from the string (n
3)
Find if it is in the map-table
If yes, insert its corresponding intermediate
state entity into target string T Remove the
n-gram from S, S S G Go to step 2
Else set n n 1 Go to step 3.

13
Transliteration Hindi to Intermediate

Convert the Hindi strings to itrans
itrans is a representation of Indian language
alphabets in terms of ASCII
The map-table is available at aczoom.com/itrans
Define a map-table for itrans to Intermediate
transliteration
Follow similar procedure

14
Transliteration Example
Debasis Debashis Debasish
??????? ???????
intermediate
15
Gazetteer Lists

Prepared lists with size and example
Common Location List 70 jila, nagar, road
Location List 13019 Kolkata, New York
Month name, Days of week 40 ravivar, March
First Name List 9722 Suman, Kunal
Middle Name List 35 Kumar, Chandra
Surname List 1700 Kundu, Chopra, Gates
Organization list 950 Bhajapa, Microsoft

16
Gazetteers and Transliteration

Collected English name lists are transliterated
into Intermediate alphabet
These transliterated lists are used as gazetteers
in Hindi NER development
Hindi strings are transliterated into
Intermediate alphabet and searched in the
gazetteers
Gazetteer based NER system (If wi is in list of
category Cj then wi is an NE)
Accuracy Person 43.1, Location 61.44,
Organization 48.14

17
MaxEnt Based Hindi NER

Computes P(t/h)
t outcome, h history, gi feature, ai
weight of gi

An open-nlp MaxEnt tool (www.maxent.sourceforge.ne
t) is used to compute the class-conditional
probabilities (P(t/h))
Beam Search Algorithm to find the most probable
tag sequence

18
Training Data

Collected from Dainik Jagaran
4 classes Person, Location, Organization, Date
Each class contains 4 subclasses Begin,
Continue, End, Unique
Example New/LocBeg Delhi/LocEnd, Atal/PerBeg
Bihari/PerCon Vajpayee/PerEnd, Microsoft/OrgUniq
Corpus size 2,43,341 words with 16482 NEs
6298 Person, 4696 Location, 3652 Organization and
1845 Date Entities

19
Hindi NER Features

Static Word Feature
Dynamic NE Tag of previous words
Contains Digit
Made up of 4 Digits
Numerical Words
Word Suffix fixed length or binary
Word Prefix
POS information full or coarse grained,
binary POS features (nominal-PSP)

20
Gazetteer Features

Direct matching based decision is not taken
Causes ambiguity
Kharagpur Location vs IIT Kharagpur
Organization
Rabindranath Tagore Person vs Rabindranath
Tagore Lane Location
Purba may be name of person or building
Gazetteers are used as features of MaxEnt
Binary features
If wi is in location list,
location-gazetter-feature(wi) 1 else
location-gazetter-feature(wi) 0

21
Accuracy for Different Features

The Accuracies of MaxEnt based NER system for
different feature sets All the values are
f-values
Without using gazetteer information

The highest accuracy obtained is 75.89 f-value.

22
Accuracy for Different Gazetteers

The improvement of accuracies after adding
gazetteer features
Individual gazetteers are capable of increasing
the accuracy of the particular class

23
Accuracy MaxEnt System Gazetteer
The transliterated gazetteers are able to
increase accuracy The highest system accuracy is
81.12 f-value
24
References

Andrew Borthwick. 1999. A Maximum Entropy
Approach to Named Entity Recognition. Ph.D.
thesis, Computer Science Department, New York
University.
Daniel M. Bikel, Scott Miller, Richard Schwartz,
and Ralph Weischedel. 1997. Nymble A high
Performance Learning Name-finder. In Proceedings
of the Fifth Conference on Applied Natural
Language Processing, pages 194-201.
Ekbal Asif, S. Naskar, S. Bandyopadhyay. 2006.
A Modified Joint Source Channel Model for
Transliteration. In Proceedings of the COLING/ACL
2006, Australia, 191-198.
Grishman Ralph. 1995. Where's the syntax? The New
York University MUC-6 System. In Proceedings of
the Sixth Message Understanding Conference.
R Srihari, C Niu, W Li. 2000. A Hybrid Approach
for Named Entity and Sub-Type Tagging. In
Proceedings of the sixth conference on Applied
natural language processing.
Wakao T., R. Gaizauskas and Y. Wilks. 1996.
Evaluation of an Algorithm for the Recognition
and Classification of Proper Names. In
Proceedings of COLING-96.