Identifying Translations - PowerPoint PPT Presentation

About This Presentation
Title:

Identifying Translations

Description:

... C2 in L2 Assume each segment has 0 or 1 translation equivalents Match up the equivalents Equivalent to maximum bipartite matching problem Exhaustive solution ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 11
Provided by: umd62
Category:

less

Transcript and Presenter's Notes

Title: Identifying Translations


1
Identifying Translations
  • Philip Resnik, Noah Smith
  • University of Maryland

2
Reasons to identify translations
  • Locating parallel text on the Web
  • Filtering out poor quality translations
  • Cross-language duplicate detection/caching

3
Identifying translations using structure
STRAND (Resnik, 1999)
4
Related Work
  • Web mining for parallel text (Nie et al. 1999)
  • Sentence alignment (Fluhr et al. 2000)
  • Duplicate detection (e.g. Broder et al. 1997)

5
Translational Equivalence as a Function over Sets
  • Broder et al (1997) Document representation as a
    set of shingles S(D)

S(D1) ? S(D2)
r(D1,D2)
S(D1) ? S(D2)
  • Cross language generalization partial equality

with confidence value t(e,f)
6
Ways of computing equivalence
  • Bilingual dictionaries
  • t(e,f) 1 if (e,f) present in dictionary, 0
    otherwise
  • Translation model (Melamed 2000, model A)
  • t(e,f) Pr(e,f)
  • String similarity for cognates
  • t(e,f) Longest common substring ratio (LCSR)
    variant
  • Trained on non-zero entries in translation model

7
Evaluation task
  • Given segmented corpus C1 in L1, C2 in L2
  • Assume each segment has 0 or 1 translation
    equivalents
  • Match up the equivalents
  • Equivalent to maximum bipartite matching problem
  • Exhaustive solution available for small sets
  • Approximated using competitive linking (Melamed)
  • True equivalence pairs give precision/recall
    curve

8
Some results sentence matching
  • Task corpora
  • Chinese-English Hong Kong Laws sentences
  • 5622 training sentences, 191 test sentences
  • Spanish-English U.N. Parallel Corpus
  • 4695 training sentences, 200 test sentences

English-Chinese
English-Spanish
9
Some results document matching
  • Task corpora
  • 232 English-French Web documents

10
New directions
  • Exploiting the Internet Archive
  • 100-200 million pages (4TB) on disk
  • Exhaustive URL matching within site
  • STRAND now adapted for disk-based access
  • Combining structure and content
  • Improving document-level matching
  • Selecting good chunks within documents
Write a Comment
User Comments (0)
About PowerShow.com