Named Entity Recognition and Transliteration for 50 Languages - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Named Entity Recognition and Transliteration for 50 Languages

Description:

Richard Sproat, Dan Roth, ChengXiang Zhai, Elabbas Benmamoun, ... word to the right of bibi, together with the word bibi, is likely to designate a ... – PowerPoint PPT presentation

Number of Views:217
Avg rating:3.0/5.0
Slides: 56
Provided by: richar781
Category:

less

Transcript and Presenter's Notes

Title: Named Entity Recognition and Transliteration for 50 Languages


1
Named Entity Recognition and Transliteration for
50 Languages
  • Richard Sproat, Dan Roth, ChengXiang Zhai,
    Elabbas Benmamoun,
  • Andrew Fister, Nadia Karlinsky, Alex Klementiev,
    Chongwon Park,
  • Vasin Punyakanok, Tao Tao, Su-youn Yoon
  • University of Illinois at Urbana-Champaign
  • http//compling.ai.uiuc.edu/reflex
  • The Second Midwest Computational Linguistics
    Colloquium
  • (MCLC-2005)
  • May 14-15
  • The Ohio State University

2
General Goals
  • Develop multilingual named entity recognition
    technology focus on persons, places,
    organizations
  • Produce seed rules and (small) corpora for
    several LCTLs (Less Commonly Taught Languages)
  • Develop methods for automatic named entity
    transliteration
  • Develop methods for tracking names in comparable
    corpora

3
Languages
  • Languages for seed rules Chinese, English,
    Spanish, Arabic, Hindi, Portuguese, Russian,
    Japanese, German, Marathi, French, Korean, Urdu,
    Italian, Turkish, Thai, Polish, Farsi, Hausa,
    Burmese, Sindhi, Yoruba, Serbo-Croatian, Pashto,
    Amharic, Indonesian, Tagalog, Hungarian, Greek,
    Czech, Swahili, Somali, Zulu, Bulgarian, Quechua,
    Berber, Lingala, Catalan, Mongolian, Danish,
    Hebrew, Kashmiri, Norwegian, Wolof, Bamanankan,
    Twi, Basque.
  • Languages for (small) corpora Chinese, Arabic,
    Hindi, Marathi, Thai, Farsi, Amharic, Indonesian,
    Swahili, Quechua.

4
Milestones
  • Resources for various languages
  • NER seed rules for Armenian, Persian, Swahili,
    Zulu, Hindi, Russian, Thai
  • Tagged corpora for Chinese, Arabic, Korean
  • Small tagged corpora for Armenian, Persian,
    Russian (10-20K words)
  • Named Entity recognition technology
  • Ported NER technology from English to Chinese,
    Arabic, Russian and German
  • Name transliteration Chinese-English,
    Arabic-English, Korean-English

5
Linguistic/Orthographic Issues
  • Capitalization
  • Word boundaries
  • Phonetic vs.Orthographic issues in
    transliteration

6
Named Entity Recognition
7
Multi-lingual Text Annotator
Annotate any word in a sentence by selecting the
word and an available category. It's also
possible to create new categories.
http//l2r.cs.uiuc.edu/cogcomp/ner_applet.php
8
Multi-lingual Text Annotator
View text in other encodings. New language
encodings are easily added in a simple text file
mapping.
http//l2r.cs.uiuc.edu/cogcomp/ner_applet.php
9
Motivation for Seed Rules
  • The only supervision is in the form of 7 seed
    rules (namely, that New York, California and U.S.
    are locations that any name containing Mr. is a
    person that any name containing Incorporated is
    an organization and that I.B.M. and Microsoft
    are organizations).
  • Collins and Singer, 1999

10
Seed Rules Thai
  • Something including and to the right of ??? is
    likely to be a personSomething including and to
    the right of ??? is likely to be a
    personSomething including and to the right of
    ?????? is likely to be a personSomething
    including and to the right of ?.?. is likely to
    be a personSomething including and to the right
    of ??? is likely to be a personSomething
    including and to the right of ???????? is likely
    to be a personSomething including and to the
    right of ?.?. is likely to be a person
  • Something including and to the right of ?.?.?. is
    likely to be a personSomething including and to
    the right of ??.?.?. is likely to be a
    personSomething including and to the right of
    ??.?.?. is likely to be a personSomething
    including and to the right of ??.?.?. is likely
    to be a personSomething including and to the
    right of ?.?. is likely to be a person
  • ?????? ??????? is a person?????? is likely a
    person??? ??????? is a person?????? ???????? is
    a person

11
Seed Rules Thai
  • Something including and in between ?????? and
    ????? is likely to be an organizationSomething
    including and to the right of ???. is likely to
    be an organizationSomething including and in
    between ?????? and ????? (?????) is likely to be
    an organizationSomething including and in
    between ???. and (?????) is likely to be an
    organizationSomething including and to the right
    of ????????????????? is likely to be an
    organizationSomething including and to the right
    of ???. is likely to be an organization
  • ????????????????? is an organization??????? is
    an organization??????? is an organization???????
    ?????? is an organization???????????????? is an
    organization??????????? is an organization
  • Something including and to the right of ???????
    is likely to be a locationSomething including
    and to the right of ?. is likely to be a
    locationSomething including and to the right of
    ????? is likely to be a locationSomething
    including and to the right of ???? is likely to
    be a location
  • ????????????? is a location????????? is a
    location???????? is a location??????? is a
    location

12
Seed Rules Armenian
  • CityName CapWord  ????? ??????????
  • StateName CapWord ??????
  • CountryName1 CapWord ?????
  • PersonName1 TITLE? FirstName? LastName 
  • LastName ?-?.???
  • FirstName FirstName1 FirstName2
  • FirstName1 ?-?\.
  • FirstName2 ?-?.
  •   PersonNameForeign TITLE FirstName?
    CapWord? CapWord PersonAny PersonName1
    PersonNameForeign
  •  

13
Armenian Lexicon
  • Lexicon GEODESC
  • ?????????
  • ?????????
  • Lexicon PLACEDESC
  • ??????
  • ?????
  • Lexicon ORGDESC
  • ?????????
  • ?????
  • Lexicon COMPDESC
  • ???????????????
  • ????????????
  • Lexicon TITLE
  • ?????
  • ???

14
Seed Rules Persian
  • Lexicon TITLE?????????????????????????
  • Lexicon OrgDesc?????????????????????????????
    ?????
  • Lexicon POSITION???? ?????????
    ????????????????????
  • Descriptors for named entitiesLexicon
    PerDesc?????????Lexicon CityDesc????????????
    ?Lexicon CountryDesc????

15
Seed Rules Swahili
  • People Rules
  • Something including and to the right of Bw. is
    likely to be a person.
  • Something including and to the right of Bi. is
    likely to be a person.
  • A capitalized word to the right of bwana,
    together with the word bwana, is likely to be a
    person.
  • A capitalized word to the right of bibi, together
    with the word bibi, is likely to designate a
    person.
  • Place Rules
  • A capitalized word to the right of a word ending
    with -jini, is likely to be a place.
  • A capitalized word starting with the letter U is
    likely to be a place.
  • A word ending in ni is likely to be a place.
  • A sequence of words including and following the
    capitalized word Uwanja is likely a place.

16
Named Entity Recognition
  • Identify entities of specific types in text (e.g.
    people, locations, dates, organizations, etc.)
  • After receiving his M.B.A. from ORG Harvard
    Business School, PER Richard F. America
    accepted a faculty position at the ORG McDonough
    School of Business in LOC Washington.

17
Named Entity Recognition
  • Not an easy problem since entities
  • Are inherently ambiguous (e.g. JFK can be both
    location and a person depending on the context)
  • Can appear in various forms (e.g. abbreviations)
  • Can be nested, etc.
  • Are too numerous and constantly evolving

(cf. Baayen, H. 2000. Word Frequency
Distributions. Kluwer. Dordrecht.)
18
Named Entity Recognition
  • Two tasks (sometimes, done simultaneously)
  • Identify the named entity phrase boundaries
    (segmentation)
  • May need to respect constraints
  • Phrases do not overlap
  • Phrase order
  • Phrase length
  • Classify the phrases (classification)

19
Identifying phrase properties with sequential
constraints
  • View as inference with classifiers problem. Three
    models Punyakanok Roth NIPS01
    http//l2r.cs.uiuc.edu/danr/Papers/iwclong.pdf
  • HMMs
  • HMM with classifiers
  • Conditional Models
  • Projection based Markov model
  • Constraint Satisfaction Models
  • Constraint satisfaction with classifiers
  • Other models proposed
  • CRF
  • StructurePerceptron
  • A model comparison in the context of the SRL
    problem Punyakanok et al IJCAI05

Most common
20
Adaptation
  • Most approaches in NER are targeted toward
    specific setting language, subject, set of tags,
    etc.
  • Labeled data may be hard to acquire for each
    particular setting
  • Trained classifiers tend to be brittle when moved
    even just to a related subject
  • We consider the problem of exploiting the
    hypothesis we learned in one setting to improve
    learning in another.
  • Kinds of adaptation that can be considered
  • Across corpora with a domain
  • Across domains
  • Across annotation methodologies
  • Across languages

21
Adaptation Example
Starting with Reuters classifier is better than
starting from scratch
  • Train on
  • Reuters increasing amounts of NYT
  • No Reuters, just increasing amounts of NYT
  • Test on NYT
  • Performance on NYT increases quickly as
    classifier is trained on examples from NYT
  • Starting with existing classifier trained on
    related corpus is better than starting from
    scratch

Trained on Reuters 13 NYT tested on NYT
Trained on Reuters tested on NYT
22
Current Architecture - Training
Annotated Corpus
  • Pre-process annotated corpus
  • Extract features
  • Train classifier

Honorifics
Features script
Gazetteers
Italics setting specific optional
23
Current Architecture - Tagging
Corpus
  • Pre-process corpus
  • Extract features
  • Run NER

Honorifics
Features script
Gazetteers
Network file
Annotated Corpus
24
Extending Current Architecture to Multiple
Settings
Chinese newswire
Corpus
German biological
English news
  • Choose setting
  • Pre-process, extract features and run NER

FEX
NER SNoW-based
25
Extending Current Architecture to Multiple
Settings Issues
  • For each setting, we need
  • Honorifics and gazetteers
  • Tuned sentence and word splitters
  • Types of features
  • Tagged training corpus
  • Work is being done to move tags across parallel
    corpora (if available)

26
Extending Current Architecture to Multiple
Settings Issues
  • If parallel corpora are available and one is
    annotated, may be able to use Stochastic
    Inversion Transduction Grammars to move tags
    across corpora Wu, Computational Linguistics
    97
  • Generate bilingual (annotated and unannotated
    parallel corpora) parses
  • Use ITGs as a filter to deem sentence/phrase
    pairs as parallel enough
  • For those that are, simply move the label from
    annotated to the unannotated phrase in same parse
    tree node.
  • Use the now tagged examples as training corpus

27
Extending Current Architecture to Multiple
Settings
  • Baseline experiments with Arabic, German, and
    Russian
  • E.g. For Russian with no honorifics, gazetteers,
    features tuned for English, and imperfect
    sentence splitter we still get about 77
    precision and 36 recall.
  • NB Used small hand-constructed corpus of
    approx. 15K wds, 1,300 NE (80/20 split)

28
Summary
  • Seed rules and corpora for subset of 50 languages
  • Adapted NER system for English to other languages
  • Demonstrated adaptation of NER system to other
    settings
  • Experimenting with ITG as basis for annotation
    transplantation

29
Methods of Transliteration
30
Comparable Corpora
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
????????????111??????? ????,??????112?119?????
???????,??????114?111? ???????????? In the
day's other matches, second seed Zhou Mi
overwhelmed Ling Wan Ting of Hong Kong, China
11-4, 11-4, Zhang Ning defeat Judith Meulendijks
of Netherlands 11-2, 11-9 and third seed Gong
Ruina took 21 minutes to eliminate Tine
Rasmussen of Denmark 11-1, 11-1, enabling China
to claim five quarterfinal places in the women's
singles.
31
Transliteration in Comparable Corpora
  • Take the newspapers for a day in any set of
    languages a lot of them will have names in
    common.
  • Given a name in one language, find its
    transliteration in a similar text in another
    language.
  • How can we make use of
  • Linguistic factors such as similar pronunciations
  • Distributional factors
  • Right now we used partly supervised methods (e.g.
    we assume small training dictionaries)
  • We are aiming for largely unsupervised methods
    (in particular, no training dictionary)

32
Some Comparable Corpora
  • We have (from the LDC) comparable text corpora
    for
  • English (19M words)
  • Chinese (22M characters)
  • Arabic (8M words)
  • Many more such corpora can, in principle, be
    collected from the web

33
How Chinese Transliteration Works
  • About 500 characters tend to be used for foreign
    words
  • Attempt to mimic the pronunciation
  • But lots of alternative ways of doing it

34
Transliteration Problem
  • Many applications of transliteration have been in
    machine translation KnightGraehl, 1998
    Al-OnaizanKnight, 2002 Gao, 2004
  • Whats the best translation of this Chinese name?
  • Our problem is slightly different
  • Are these two names the same?
  • Want to be able to reject correspondences
  • Assign 0 probability to some unseen cases in
    training data

35
Approaches to Transliteration
  • Much work using the source-channel approach
  • Cast as a problem where you have a clean
    source e.g. a Chinese name and a noisy
    channel that corrupts the source into the
    observed form e.g. an English name
  • P(EC)P(C)
  • E.g. P(fi,E fi1,E fi2,E fin,E sC)
  • Chinese characters represent syllables (s) we
    match these to sequences of English phonemes (f)

36
Resources
  • Small dictionary of 721 (mostly English) names
    and their Chinese transliterations
  • Large dictionary of about 1.6 million names from
    LDC

37
General Approach
  • Train a tight transliteration model from a
    dictionary of known transliterations
  • Identify names in English news text for a given
    day using an existing named entity recognizer
  • Process same day of Chinese text looking for
    sequences of characters used in foreign names
  • Do an all-pairs match using the transliteration
    model to find possible transliteration pairs

38
Model Estimation
  • Seek to estimate P(ec) where e is a sequence of
    words in Roman script and c is a sequence of
    Chinese characters
  • We actually estimate P(ec), where e is the
    pronunciation of e and c is the pronunciation of
    c.
  • We decompose the estimate of P(ec) as
  • Chinese transliteration matches syllables to
    similar-sounding spans of foreign phones. So cI
    are syllables, and eI are subsequences of the
    English phone string

39
Model Estimation
  • Align phone strings using modified
    Sankoff/Kruskal algorithm
  • For each Chinese s, allow an English phone string
    f to correspond just in case the initial of s
    corresponds to the initial of f some minimum
    number of times in training
  • Smooth probabilities using Good-Turing
  • Distribute unseen probability mass over unseen
    cases non-uniformly according to a weighting
    scheme

40
Model Estimation
  • We estimate the probability for a given unseen
    case as follows
  • Where
  • P(n0) is the probability of unseen cases
    according to the Good-Turing smoothing
  • P(len(e)mlen(c)n) is the probability of a
    Chinese syllable of length n corresponding to an
    English phone sequence of length m
  • count(len(e)m) is the type count of phone
    sequences of length m (estimated from 194,000
    pronunciations produced by the Festival TTS
    system on the XTag dictionary)

41
Some Automatically Found Pairs
Pairs found in same day of newswire text
42
Further Pairs
43
Time Correlations
  • When some major event happens (e.g., the tsunami
    disaster), it is very likely covered by news
    articles in multiple languages
  • Each event/topic tends to have its own
    associated vocabulary (e.g., names such as Sri
    Lanka, India may occur in recent news articles)
  • We thus will likely see that the frequency of a
    name such as Sri Lanka will peak as compared with
    other time periods and the pattern is likely the
    same across languages
  • cf. Kay and Roscheisen, CL, 1993 Kupiec, ACL,
    1993 Rapp, ACL, 1995 Fung, WVLC, 1995

44
Construct Term Distributions over Time
45
Measure Correlations of English and Chinese Word
Pairs
bad correlation corr 0.0324
good correlation corr 0.885
46
Chinese Transliteration
English term Edmonton
Chinese documents
Candidate Chinese names
???? ??? ??? ??? ??? ????
???? 0.96 ??? 0.91 ??? 0.88 ???
0.75
?
Rank Candidates
  • Methods
  • Phonetic approach
  • Frequency correlation
  • Combination

47
Evaluation
English term Edmonton
MRR Mean Reciprocal Rank AllMRR Evaluation over
all English names CoreMRR Evaluation over just
names w/ found Chinese
correspondence
48
Summary and Future Work
  • So far
  • Phonetic transliteration models
  • Time correlation between name distributions
  • Work in progress
  • Linguistic models
  • Develop graphical model approach to
    transliteration
  • Semantic aspects of transliteration in Chinese
    female names ending in ia transliterated with ?
    ya rather than ?
  • Resource-poor transliteration for any pair of
    languages
  • Document alignment
  • Coordinated mixture models for document/word-level
    alignment

49
Graphical Models Bilmes Zweig 2002
50
Semantic Aspects of Transliteration
  • Phonological model doesnt capture
    semantic/orthographic features of
    transliteration
  • Saint, San, Sao, use ? sheng holy
  • Female names ending in ia transliterated with ?
    ya rather than ? ya
  • Such information boosts evidence that two strings
    are transliterations of each other
  • Consider gender. For each character c
  • compute log-likelihood ratio abs(log(P(fc)/P(mc)
    ))
  • build a decision list ranked by decreasing LLR

51
Decision List for Gender Classification
41.5833898566 ? male 40.8357601821 ? female 39.0
64753687 ? female 35.8207097407 ? male 34.798058
9928 ? female 34.4008926875 ? female 33.98712877
66 ? female 33.9225902555 ? male 26.945842105 ?
male
52
Document Alignment
Basic idea Sum up correlations of all e-c pairs.
Use these to find documents paired by relevance.
Method 1 Expected correlation (ExpCorr)
e1 e2 e3 eE
c1 c2 c3 cC
Matching two rare words is more surprising, so
should count more
English document E
Chinese document C
Method 2 IDF weighted correlation (IDFCorr)
Repeated occurrences of a word Contributes less
than the first few Occurrences.
IDF (for Inverse Doc Freq) penalizes common words
Method 3 BM25 weighting (TF-IDF)
BM25 is a typical retrieval weighting function
53
Document Alignment Evaluation
Method 3 TF-IDF
  • Randomly pick 6 English documents
  • Retrieve 50 Chinese documents (out of approx.
    900) for each English document
  • Rank the 300 E-C pairs by each of 3 methods
  • Evaluate the relevance by standard precision
    metric

Method 2 IDF
Method 1 ExpCorr
About 80 of the top 100 pairs of documents are
correct.
54
Summary and Some Ongoing Work
  • Some seed rules and corpora more in progress
  • NER techniques being adapted to other languages
  • Investigating ITG for annotation transplantation
  • What features to use for various languages
  • Combined phonetic and temporal information in
    transliteration
  • Semantic/orthographic aspects of transliteration
  • Resource poor transliteration
  • Document alignment
  • Coordinated mixture models

55
Acknowledgments
  • National Security Agency Contract NBCHC040176,
    REFLEX (Research on English and Foreign Language
    Exploitation)
  • Language experts thus far
  • Karine Megerdoomian (Persian, Armenian)
  • Alla Rozovskaya (Russian, Hebrew)
  • Archna Bhatia (Hindi)
  • Brent Henderson (Swahili)
  • Tholani Hlongwa (Zulu)
  • Karen Livescu for much help with GMTK

56
Model Estimation
  • Training data was small (721 names) smoothing is
    essential
  • Align using small hand-derived set of rules, plus
    the alignment algorithm of Sankoff Kruskal.
    Some sample rules

57
Model Estimation
  • For an English phone span to correspond to a
    Chinese syllable, the initial phone of the
    English span must have been seen in the training
    data as corresponding to the initial of the
    Chinese syllable some minimum number of times
  • For consonant-initial syllables we set the number
    to 4
  • For vowel-initial syllables (since these tend to
    be more variable in their correspondences) we set
    the minimum at 1
  • We then compute P(eI cI) for the seen cases,
    and smooth for unseen correspondences using
    Sampsons Simple Good Turing.

58
High Correlation Word-Character Pairs
English
Chinese
English
Chinese
Correlation
Correlation
.
59
Top Ranked Chinese Characters (swimming Afghan)
60
Next Steps
  • Coordinated mixture models for word/document
    alignment

61
A Coordinated Mixture Model
Generating a Chinese doc at time t
?1,E ?1,c
?2,E ?2,c

Day 1
?k,E ?k,c
?E
?C
?1,E ?1,c

Day 2
?k,E ?k,c
Applications of the model

ENGLISH
CHINESE
62
Graphical Models Bilmes Zweig 2002
63
Coordinated Mixture Model
k coordinated themes
English word dist.
Date (or Alignment) distribution
Chinese word dist.
Swimming 0.04 Medal 0.02 Men 0.01
? 0.05 ? 0.03 ? 0.01


Alignments A1 A2 AM
Day1 Day2
2 non-coordinated themes
English word dist. capturing unaligned English
topics
English date dist. capturing uneven topic
coverage over time
Chinese word dist. capturing unaligned Chinese
topics
Chinese date dist. capturing uneven topic
coverage over time
64
Details of the Mixture Model
Coordinated mixture model
Lexical translation Document alignment
Write a Comment
User Comments (0)
About PowerShow.com