Named Entity Recognition and Transliteration for 50 Languages

About This Presentation

Title:

Named Entity Recognition and Transliteration for 50 Languages

Description:

Richard Sproat, Dan Roth, ChengXiang Zhai, Elabbas Benmamoun, ... word to the right of bibi, together with the word bibi, is likely to designate a ... – PowerPoint PPT presentation

Number of Views:217

Avg rating:3.0/5.0

Slides: 56

Provided by: richar781

Category:

more less

Transcript and Presenter's Notes

Title: Named Entity Recognition and Transliteration for 50 Languages

1
Named Entity Recognition and Transliteration for
50 Languages

Richard Sproat, Dan Roth, ChengXiang Zhai,
Elabbas Benmamoun,
Andrew Fister, Nadia Karlinsky, Alex Klementiev,
Chongwon Park,
Vasin Punyakanok, Tao Tao, Su-youn Yoon
University of Illinois at Urbana-Champaign
http//compling.ai.uiuc.edu/reflex
The Second Midwest Computational Linguistics
Colloquium
(MCLC-2005)
May 14-15
The Ohio State University

2
General Goals

Develop multilingual named entity recognition
technology focus on persons, places,
organizations
Produce seed rules and (small) corpora for
several LCTLs (Less Commonly Taught Languages)
Develop methods for automatic named entity
transliteration
Develop methods for tracking names in comparable
corpora

3
Languages

Languages for seed rules Chinese, English,
Spanish, Arabic, Hindi, Portuguese, Russian,
Japanese, German, Marathi, French, Korean, Urdu,
Italian, Turkish, Thai, Polish, Farsi, Hausa,
Burmese, Sindhi, Yoruba, Serbo-Croatian, Pashto,
Amharic, Indonesian, Tagalog, Hungarian, Greek,
Czech, Swahili, Somali, Zulu, Bulgarian, Quechua,
Berber, Lingala, Catalan, Mongolian, Danish,
Hebrew, Kashmiri, Norwegian, Wolof, Bamanankan,
Twi, Basque.
Languages for (small) corpora Chinese, Arabic,
Hindi, Marathi, Thai, Farsi, Amharic, Indonesian,
Swahili, Quechua.

4
Milestones

Resources for various languages
NER seed rules for Armenian, Persian, Swahili,
Zulu, Hindi, Russian, Thai
Tagged corpora for Chinese, Arabic, Korean
Small tagged corpora for Armenian, Persian,
Russian (10-20K words)
Named Entity recognition technology
Ported NER technology from English to Chinese,
Arabic, Russian and German
Name transliteration Chinese-English,
Arabic-English, Korean-English

5
Linguistic/Orthographic Issues

Capitalization
Word boundaries
Phonetic vs.Orthographic issues in
transliteration

6
Named Entity Recognition
7
Multi-lingual Text Annotator
Annotate any word in a sentence by selecting the
word and an available category. It's also
possible to create new categories.
http//l2r.cs.uiuc.edu/cogcomp/ner_applet.php
8
Multi-lingual Text Annotator
View text in other encodings. New language
encodings are easily added in a simple text file
mapping.
http//l2r.cs.uiuc.edu/cogcomp/ner_applet.php
9
Motivation for Seed Rules

The only supervision is in the form of 7 seed
rules (namely, that New York, California and U.S.
are locations that any name containing Mr. is a
person that any name containing Incorporated is
an organization and that I.B.M. and Microsoft
are organizations).
Collins and Singer, 1999

10
Seed Rules Thai

Something including and to the right of ??? is
likely to be a personSomething including and to
the right of ??? is likely to be a
personSomething including and to the right of
?????? is likely to be a personSomething
including and to the right of ?.?. is likely to
be a personSomething including and to the right
of ??? is likely to be a personSomething
including and to the right of ???????? is likely
to be a personSomething including and to the
right of ?.?. is likely to be a person
Something including and to the right of ?.?.?. is
likely to be a personSomething including and to
the right of ??.?.?. is likely to be a
personSomething including and to the right of
??.?.?. is likely to be a personSomething
including and to the right of ??.?.?. is likely
to be a personSomething including and to the
right of ?.?. is likely to be a person
?????? ??????? is a person?????? is likely a
person??? ??????? is a person?????? ???????? is
a person

11
Seed Rules Thai

Something including and in between ?????? and
????? is likely to be an organizationSomething
including and to the right of ???. is likely to
be an organizationSomething including and in
between ?????? and ????? (?????) is likely to be
an organizationSomething including and in
between ???. and (?????) is likely to be an
organizationSomething including and to the right
of ????????????????? is likely to be an
organizationSomething including and to the right
of ???. is likely to be an organization
????????????????? is an organization??????? is
an organization??????? is an organization???????
?????? is an organization???????????????? is an
organization??????????? is an organization
Something including and to the right of ???????
is likely to be a locationSomething including
and to the right of ?. is likely to be a
locationSomething including and to the right of
????? is likely to be a locationSomething
including and to the right of ???? is likely to
be a location
????????????? is a location????????? is a
location???????? is a location??????? is a
location

12
Seed Rules Armenian

CityName CapWord ????? ??????????
StateName CapWord ??????
CountryName1 CapWord ?????
PersonName1 TITLE? FirstName? LastName
LastName ?-?.???
FirstName FirstName1 FirstName2
FirstName1 ?-?\.
FirstName2 ?-?.
PersonNameForeign TITLE FirstName?
CapWord? CapWord PersonAny PersonName1
PersonNameForeign

13
Armenian Lexicon

Lexicon GEODESC
?????????
?????????
Lexicon PLACEDESC
??????
?????
Lexicon ORGDESC
?????????
?????
Lexicon COMPDESC
???????????????
????????????
Lexicon TITLE
?????
???

14
Seed Rules Persian

Lexicon TITLE?????????????????????????
Lexicon OrgDesc?????????????????????????????
?????

Lexicon POSITION???? ?????????
????????????????????
Descriptors for named entitiesLexicon
PerDesc?????????Lexicon CityDesc????????????
?Lexicon CountryDesc????

15
Seed Rules Swahili

People Rules
Something including and to the right of Bw. is
likely to be a person.
Something including and to the right of Bi. is
likely to be a person.
A capitalized word to the right of bwana,
together with the word bwana, is likely to be a
person.
A capitalized word to the right of bibi, together
with the word bibi, is likely to designate a
person.

Place Rules
A capitalized word to the right of a word ending
with -jini, is likely to be a place.
A capitalized word starting with the letter U is
likely to be a place.
A word ending in ni is likely to be a place.
A sequence of words including and following the
capitalized word Uwanja is likely a place.

16
Named Entity Recognition

Identify entities of specific types in text (e.g.
people, locations, dates, organizations, etc.)
After receiving his M.B.A. from ORG Harvard
Business School, PER Richard F. America
accepted a faculty position at the ORG McDonough
School of Business in LOC Washington.

17
Named Entity Recognition

Not an easy problem since entities
Are inherently ambiguous (e.g. JFK can be both
location and a person depending on the context)
Can appear in various forms (e.g. abbreviations)

Can be nested, etc.
Are too numerous and constantly evolving

(cf. Baayen, H. 2000. Word Frequency
Distributions. Kluwer. Dordrecht.)
18
Named Entity Recognition

Two tasks (sometimes, done simultaneously)
Identify the named entity phrase boundaries
(segmentation)
May need to respect constraints
Phrases do not overlap
Phrase order
Phrase length
Classify the phrases (classification)

19
Identifying phrase properties with sequential
constraints

View as inference with classifiers problem. Three
models Punyakanok Roth NIPS01
http//l2r.cs.uiuc.edu/danr/Papers/iwclong.pdf
HMMs
HMM with classifiers
Conditional Models
Projection based Markov model
Constraint Satisfaction Models
Constraint satisfaction with classifiers
Other models proposed
CRF
StructurePerceptron
A model comparison in the context of the SRL
problem Punyakanok et al IJCAI05

Most common
20
Adaptation

Most approaches in NER are targeted toward
specific setting language, subject, set of tags,
etc.
Labeled data may be hard to acquire for each
particular setting
Trained classifiers tend to be brittle when moved
even just to a related subject
We consider the problem of exploiting the
hypothesis we learned in one setting to improve
learning in another.
Kinds of adaptation that can be considered
Across corpora with a domain
Across domains
Across annotation methodologies
Across languages

21
Adaptation Example
Starting with Reuters classifier is better than
starting from scratch

Train on
Reuters increasing amounts of NYT
No Reuters, just increasing amounts of NYT
Test on NYT
Performance on NYT increases quickly as
classifier is trained on examples from NYT
Starting with existing classifier trained on
related corpus is better than starting from
scratch

Trained on Reuters 13 NYT tested on NYT
Trained on Reuters tested on NYT
22
Current Architecture - Training
Annotated Corpus

Pre-process annotated corpus
Extract features
Train classifier

Honorifics
Features script
Gazetteers
Italics setting specific optional
23
Current Architecture - Tagging
Corpus

Pre-process corpus
Extract features
Run NER

Honorifics
Features script
Gazetteers
Network file
Annotated Corpus
24
Extending Current Architecture to Multiple
Settings
Chinese newswire
Corpus
German biological
English news

Choose setting
Pre-process, extract features and run NER

FEX
NER SNoW-based
25
Extending Current Architecture to Multiple
Settings Issues

For each setting, we need
Honorifics and gazetteers
Tuned sentence and word splitters
Types of features
Tagged training corpus
Work is being done to move tags across parallel
corpora (if available)

26
Extending Current Architecture to Multiple
Settings Issues

If parallel corpora are available and one is
annotated, may be able to use Stochastic
Inversion Transduction Grammars to move tags
across corpora Wu, Computational Linguistics
97
Generate bilingual (annotated and unannotated
parallel corpora) parses
Use ITGs as a filter to deem sentence/phrase
pairs as parallel enough
For those that are, simply move the label from
annotated to the unannotated phrase in same parse
tree node.
Use the now tagged examples as training corpus

27
Extending Current Architecture to Multiple
Settings

Baseline experiments with Arabic, German, and
Russian
E.g. For Russian with no honorifics, gazetteers,
features tuned for English, and imperfect
sentence splitter we still get about 77
precision and 36 recall.
NB Used small hand-constructed corpus of
approx. 15K wds, 1,300 NE (80/20 split)

28
Summary

Seed rules and corpora for subset of 50 languages
Adapted NER system for English to other languages
Demonstrated adaptation of NER system to other
settings
Experimenting with ITG as basis for annotation
transplantation

29
Methods of Transliteration
30
Comparable Corpora
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
31
Transliteration in Comparable Corpora

Take the newspapers for a day in any set of
languages a lot of them will have names in
common.
Given a name in one language, find its
transliteration in a similar text in another
language.
How can we make use of
Linguistic factors such as similar pronunciations
Distributional factors
Right now we used partly supervised methods (e.g.
we assume small training dictionaries)
We are aiming for largely unsupervised methods
(in particular, no training dictionary)

32
Some Comparable Corpora

We have (from the LDC) comparable text corpora
for
English (19M words)
Chinese (22M characters)
Arabic (8M words)
Many more such corpora can, in principle, be
collected from the web

33
How Chinese Transliteration Works

About 500 characters tend to be used for foreign
words
Attempt to mimic the pronunciation
But lots of alternative ways of doing it

34
Transliteration Problem

Many applications of transliteration have been in
machine translation KnightGraehl, 1998
Al-OnaizanKnight, 2002 Gao, 2004
Whats the best translation of this Chinese name?
Our problem is slightly different
Are these two names the same?
Want to be able to reject correspondences
Assign 0 probability to some unseen cases in
training data

35
Approaches to Transliteration

Much work using the source-channel approach
Cast as a problem where you have a clean
source e.g. a Chinese name and a noisy
channel that corrupts the source into the
observed form e.g. an English name
P(EC)P(C)
E.g. P(fi,E fi1,E fi2,E fin,E sC)
Chinese characters represent syllables (s) we
match these to sequences of English phonemes (f)

36
Resources

Small dictionary of 721 (mostly English) names
and their Chinese transliterations
Large dictionary of about 1.6 million names from
LDC

37
General Approach

Train a tight transliteration model from a
dictionary of known transliterations
Identify names in English news text for a given
day using an existing named entity recognizer
Process same day of Chinese text looking for
sequences of characters used in foreign names
Do an all-pairs match using the transliteration
model to find possible transliteration pairs

38
Model Estimation

Seek to estimate P(ec) where e is a sequence of
words in Roman script and c is a sequence of
Chinese characters
We actually estimate P(ec), where e is the
pronunciation of e and c is the pronunciation of
c.
We decompose the estimate of P(ec) as
Chinese transliteration matches syllables to
similar-sounding spans of foreign phones. So cI
are syllables, and eI are subsequences of the
English phone string

39
Model Estimation

Align phone strings using modified
Sankoff/Kruskal algorithm
For each Chinese s, allow an English phone string
f to correspond just in case the initial of s
corresponds to the initial of f some minimum
number of times in training
Smooth probabilities using Good-Turing
Distribute unseen probability mass over unseen
cases non-uniformly according to a weighting
scheme

40
Model Estimation

We estimate the probability for a given unseen
case as follows
Where
P(n0) is the probability of unseen cases
according to the Good-Turing smoothing
P(len(e)mlen(c)n) is the probability of a
Chinese syllable of length n corresponding to an
English phone sequence of length m
count(len(e)m) is the type count of phone
sequences of length m (estimated from 194,000
pronunciations produced by the Festival TTS
system on the XTag dictionary)

41
Some Automatically Found Pairs
Pairs found in same day of newswire text
42
Further Pairs
43
Time Correlations

When some major event happens (e.g., the tsunami
disaster), it is very likely covered by news
articles in multiple languages
Each event/topic tends to have its own
associated vocabulary (e.g., names such as Sri
Lanka, India may occur in recent news articles)
We thus will likely see that the frequency of a
name such as Sri Lanka will peak as compared with
other time periods and the pattern is likely the
same across languages
cf. Kay and Roscheisen, CL, 1993 Kupiec, ACL,
1993 Rapp, ACL, 1995 Fung, WVLC, 1995

44
Construct Term Distributions over Time
45
Measure Correlations of English and Chinese Word
Pairs
bad correlation corr 0.0324
good correlation corr 0.885
46
Chinese Transliteration
English term Edmonton
Chinese documents
Candidate Chinese names
???? ??? ??? ??? ??? ????
???? 0.96 ??? 0.91 ??? 0.88 ???
0.75
?
Rank Candidates

Methods
Phonetic approach
Frequency correlation
Combination

47
Evaluation
English term Edmonton
MRR Mean Reciprocal Rank AllMRR Evaluation over
all English names CoreMRR Evaluation over just
names w/ found Chinese
correspondence
48
Summary and Future Work

So far
Phonetic transliteration models
Time correlation between name distributions
Work in progress
Linguistic models
Develop graphical model approach to
transliteration
Semantic aspects of transliteration in Chinese
female names ending in ia transliterated with ?
ya rather than ?
Resource-poor transliteration for any pair of
languages
Document alignment
Coordinated mixture models for document/word-level
alignment

49
Graphical Models Bilmes Zweig 2002
50
Semantic Aspects of Transliteration

Phonological model doesnt capture
semantic/orthographic features of
transliteration
Saint, San, Sao, use ? sheng holy
Female names ending in ia transliterated with ?
ya rather than ? ya
Such information boosts evidence that two strings
are transliterations of each other
Consider gender. For each character c
compute log-likelihood ratio abs(log(P(fc)/P(mc)
))
build a decision list ranked by decreasing LLR

51
Decision List for Gender Classification
41.5833898566 ? male 40.8357601821 ? female 39.0
64753687 ? female 35.8207097407 ? male 34.798058
9928 ? female 34.4008926875 ? female 33.98712877
66 ? female 33.9225902555 ? male 26.945842105 ?
male
52
Document Alignment
Basic idea Sum up correlations of all e-c pairs.
Use these to find documents paired by relevance.
Method 1 Expected correlation (ExpCorr)
e1 e2 e3 eE
c1 c2 c3 cC
Matching two rare words is more surprising, so
should count more
English document E
Chinese document C
Method 2 IDF weighted correlation (IDFCorr)
Repeated occurrences of a word Contributes less
than the first few Occurrences.
IDF (for Inverse Doc Freq) penalizes common words
Method 3 BM25 weighting (TF-IDF)
BM25 is a typical retrieval weighting function
53
Document Alignment Evaluation
Method 3 TF-IDF

Randomly pick 6 English documents
Retrieve 50 Chinese documents (out of approx.
900) for each English document
Rank the 300 E-C pairs by each of 3 methods
Evaluate the relevance by standard precision
metric

Method 2 IDF
Method 1 ExpCorr
About 80 of the top 100 pairs of documents are
correct.
54
Summary and Some Ongoing Work

Some seed rules and corpora more in progress
NER techniques being adapted to other languages
Investigating ITG for annotation transplantation
What features to use for various languages
Combined phonetic and temporal information in
transliteration
Semantic/orthographic aspects of transliteration
Resource poor transliteration
Document alignment
Coordinated mixture models

55
Acknowledgments

National Security Agency Contract NBCHC040176,
REFLEX (Research on English and Foreign Language
Exploitation)
Language experts thus far
Karine Megerdoomian (Persian, Armenian)
Alla Rozovskaya (Russian, Hebrew)
Archna Bhatia (Hindi)
Brent Henderson (Swahili)
Tholani Hlongwa (Zulu)
Karen Livescu for much help with GMTK

56
Model Estimation

Training data was small (721 names) smoothing is
essential
Align using small hand-derived set of rules, plus
the alignment algorithm of Sankoff Kruskal.
Some sample rules

57
Model Estimation

For an English phone span to correspond to a
Chinese syllable, the initial phone of the
English span must have been seen in the training
data as corresponding to the initial of the
Chinese syllable some minimum number of times
For consonant-initial syllables we set the number
to 4
For vowel-initial syllables (since these tend to
be more variable in their correspondences) we set
the minimum at 1
We then compute P(eI cI) for the seen cases,
and smooth for unseen correspondences using
Sampsons Simple Good Turing.

58
High Correlation Word-Character Pairs
English
Chinese
English
Chinese
Correlation
Correlation
.
59
Top Ranked Chinese Characters (swimming Afghan)
60
Next Steps

Coordinated mixture models for word/document
alignment

61
A Coordinated Mixture Model
Generating a Chinese doc at time t
?1,E ?1,c
?2,E ?2,c

Day 1
?k,E ?k,c
?E
?C
?1,E ?1,c

Day 2
?k,E ?k,c
Applications of the model

ENGLISH
CHINESE
62
Graphical Models Bilmes Zweig 2002
63
Coordinated Mixture Model
k coordinated themes
English word dist.
Date (or Alignment) distribution
Chinese word dist.
Swimming 0.04 Medal 0.02 Men 0.01
? 0.05 ? 0.03 ? 0.01

Alignments A1 A2 AM
Day1 Day2
2 non-coordinated themes
English word dist. capturing unaligned English
topics
English date dist. capturing uneven topic
coverage over time
Chinese word dist. capturing unaligned Chinese
topics
Chinese date dist. capturing uneven topic
coverage over time
64
Details of the Mixture Model
Coordinated mixture model
Lexical translation Document alignment

Write a Comment

User Comments (0)