CrossLingual Named Entity Retrieval - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

CrossLingual Named Entity Retrieval

Description:

Index: inverted index by Lemur toolkit http://www-2.cs.cmu.edu/~lemur ... Single word frequency vector: Lemur toolkit provides such function ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 27
Provided by: Ale8212
Category:

less

Transcript and Presenter's Notes

Title: CrossLingual Named Entity Retrieval


1
Cross-Lingual Named Entity Retrieval
  • ChengXiang Zhai, Tao Tao
  • Department of Computer Science
  • Univ. of Illinois at Urbana-Champaign
  • Jan. 4, 2005

2
Outline
  • Problem definition (unsupervised learning from
    comparable corpora)
  • General ideas (freq. correlation, iterative
    feedback, mixture models)
  • Preliminary results (lexical retrieval,
    transliteration)

3
Problem Definition
  • The general problem Unsupervised learning from
    comparable corpora
  • Given a set of comparable corpora (e.g., news
    articles published on the same day)
  • Assuming no additional resources (e.g., no
    bilingual dictionary)
  • How do we do
  • Document alignment
  • Entity extraction
  • Transliteration
  • Different from most existing work in the emphasis
    on completely unsupervised learning and
    robustness of methods

4
A More Specific Problem Cross-lingual Named
Entity Retrieval
  • Given a name in English (e.g., Bush), how do we
    find its translation(s) in Chinese?
  • More generally, given a word/phrase in one
    language, how do we exploit comparable corpora to
    find related words/phrases in another language?
  • Challenges
  • No additional resources to leverage (completely
    unsupervised)
  • Need to figure out the word/phrase boundaries in
    languages such as Chinese

5
Our Basic Idea
  • Exploit the correlation between words in
    different languages that are about the same topic
  • Observations
  • When some major event happens (e.g., the recent
    sea surge disaster), it is very likely covered by
    news articles in multiple languages
  • Each event/topic tends to have its own
    associated vocabulary (e.g., names such as Sri
    Lanka, India may occur in recent news
    articles)
  • We thus will likely see the frequency of a name
    such as Sri Lanka tends to peak recently as
    compared with other time periods and the pattern
    is likely the same across languages

6
An Example of Frequency Correlation(swimming)
7
Unsupervised Cross-Lingual Lexical Retrieval
  • Represent each lexical unit with a frequency
    distribution over the dates
  • Given a lexical unit X in language A, compute the
    similarity between its freq. distribution and
    that of any unit Y in language B
  • Return the top ranked Ys in language B as
    possibly related units in B for X in A.
  • The top ranked Ys may suggest transliterations
    of X in B if X is a name in A

8
Top Words for Swimming (from both English and
Chinese Articles)
Correct Chinese translation of swimming
12. took 21.1154 13. meters 21.087 14. Korea
21.0268 15. Women's 21.0154 16. She 21.0062
17. ? 20.9737 18. ? 20.9366 19. Japan 20.9341
20. Japan, 20.9245 21. half 20.8783 22. medals
20.8348
1. swimming 23.6389 2. ? 22.5897 3. Swimming
22.497 4. Russia 21.9056 5. record 21.7685 6.
Medal 21.6855 7. won 21.6112 8. ? 21.5027 9. ?
21.3815 10. Hungary 21.2824 11. third 21.1592
Correct translations
A standard retrieval formula (BM25) is used
9
The Choice of Lexical Units
  • The frequency distribution clearly depends on the
    choice of lexical units
  • The method can be expected to work well for some
    unigrams
  • For many names, we will have to consider n-grams
    of Chinese characters

10
Another Example (Bush) Showing the Need for
N-grams
Bu Correlation 0.286
Shi Correlation 0.386
Bu-Shi Correlation 0.448
11
However, even if we consider n-grams, the method
would still work well in only very special cases
  • Two additional questions
  • How can we make use of such partially correct
    results?
  • How do we further improve the results ?

12
Exploiting Cross-Lingual Lexical Retrieval
  • As additional bias/evidence for transliteration
  • E.g., take the top-k candidates from any
    preliminary transliteration results and rerank
    them
  • As a basis for document alignment
  • Define the similarity between two articles in
    different languages based on the similarities
    between the freq distributions of the words in
    each of them
  • Align a document in language A with the
    top-ranked document in language B
  • A possible similarity function

13
An Iterative Algorithm for Mapping Lexical Units
and Aligning Articles
  • Start with the most reliable mappings of lexical
    units
  • Align articles based on these reliable mappings
  • Re-compute the frequency distributions based on
    the aligned articles and generate a new
    generation of mappings
  • Re-align the articles using the new mappings

14
How do We Do All These in a Principled Way?
15
A Coordinated Mixture Model
Day 1
Day 2


Day n
16
Details of the Mixture Model
Coordinated mixture model
Lexical translation Document alignment
17
Results from a Similar Mixture Modelfor News
Article Comparison
18
Preliminary Experiments Lexical Retrieval
  • Data Set Chinese-English comparable corpora
  • About 150 days, 87,000 English articles, 35,000
    Chinese articles
  • Task Given an English word, retrieve the top-k
    most correlated Chinese n-grams
  • Research questions
  • How to efficiently compute the frequency of
    n-grams?
  • How to represent a lexical unit with freq.
    distribution?
  • How to measure lexical similarity?

19
Efficient N-gram Freq. Counting
  • Index inverted index by Lemur toolkit
    http//www-2.cs.cmu.edu/lemur
  • quickly access documents from a word
  • quickly access word position information
  • Single word frequency vector Lemur toolkit
    provides such function
  • N-gram frequency vector Merge the single word
    frequency vectors.For example, t1 d1 1, 3,
    9, 34 t2 d1 2, 7, 10,
    18, 56

t1t2 d1 1, 9 t2t1 d1 2
20
Frequency Vector Normalization
day1 day2 day3 .. day156
English term E 2 0 4 7
5 8 Chinese n-gram C 1 2 4
5 6 2
  • TF normalization a word count 0?1 is more
    important than 100?1011) TF ? log(1TF) 4 ?
    log(5)2) Okapi
  • Normalize them as a prob. distribution divided
    by the sum of all counts

21
Similarity Measures
X x10.02, x20.00, y30.04, x40.07 Y
y10.01, y20.02, x30.00, y40.05
  • Mutual information only consider zero or
    non-zero
  • Cosine distance
  • JS divergence H(p1 P(X) p2 P(Y)) p1
    H(P(x)) p2 H(P(Y))
  • Dynamic Time Warping
  • Consider the time shifting
  • Similar to the edit distance(dynamic programming)

x1, x2, x3, , xn
Constraint time shifting
y1 y2 y3 yn
22
Comparison (bush)
  • Cosine dist.
  • ?? 0.835038
  • ?? 0.774567
  • ?? 0.769585
  • ?? 0.764102
  • ?? 0.76317
  • ?? 0.76276
  • ?? 0.758067
  • ?? 0.753101
  • ?? 0.745806
  • ?? 0.743761
  • JS Divergence
  • ?? 0.0635597
  • ?? 0.0658826
  • ?? 0.0681127
  • ?? 0.0689964
  • ?? 0.0717511
  • ?? 0.0729297
  • ?? 0.0748018
  • ?? 0.078494
  • ?? 0.097773 (21)

Mutual info. ?? 0.154381 ?? 0.154381 ??
0.154381 ?? 0.154381 ?? 0.154381 ?? 0.154381
?? 0.154381 ?? 0.154381 ?? 0.0827685 (129)
23
Another example (blair)
Cosine distance
  • ?? 0.646626
  • ?? 0.641913
  • ?? 0.638979
  • ?? 0.629399
  • ?? 0.623625
  • ?? 0.621927

???
24
Transliteration Prior
  • Intuition The ranking the score can be set as
    the prior probability to do transliteration
  • Examples

palestine
kazakhstan
colombia
???? 0.761795 ??? 0.443768 ??? 0.353061 ???
0.310424 ??? 0.263955 ??? 0.259201 ???
0.227172 ??? 0.202834 ???? 0.199544 ???
0.196019 ??? 0.10867 ??? 0.106474
???? 0.62962 ??? 0.487948 ??? 0.231453 ????
0.147453 ??? 0.139291 ??? 0.136079 ???
0.0783344 ??? 0.0760304 ??? 0.0760304
????? 0.797614 ??? 0.403872 ??? 0.316003 ???
0.175327 ???? 0.138001 ??? 0.131324 ??? 0.126798
??? 0.0426488
25
Next Steps
  • Further improve the similarity measures
  • Apply lexical retrieval to transliteration
  • Apply lexical retrieval to alignment
  • Explore the iterative feedback strategy
  • Explore mixture models

26
The End
Write a Comment
User Comments (0)
About PowerShow.com