Automatic Identification of Cognates, False Friends, and Partial Cognates - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Identification of Cognates, False Friends, and Partial Cognates

Description:

Automatic Identification of Cognates, False Friends, and Partial Cognates ... False Friends (Faux Amis) are pairs of words in two languages that are perceived ... – PowerPoint PPT presentation

Number of Views:562
Avg rating:3.0/5.0
Slides: 24
Provided by: oana2
Category:

less

Transcript and Presenter's Notes

Title: Automatic Identification of Cognates, False Friends, and Partial Cognates


1
Automatic Identification of Cognates, False
Friends, and Partial Cognates
  • University of Ottawa, Canada

2
Outline
  • Overview of the Thesis
  • Research Contribution
  • Cognate and False Friend Identification
  • Partial Cognate Disambiguation
  • CLPA- Cognate and False Friend Annotator
  • Conclusions and Future Work

3
Overview of the Thesis
  • Tasks
  • Automatic Identification of Cognates and False
    Friends
  • Automatic Disambiguation of Partial Cognates
  • Areas of Applications
  • CALL, MT, Word Alignment, Cross-Language
    Information Retrieval
  • CALL Tool - CLPA

4
Definitions
  • Cognates or True Friends (Vrais Amis), are pairs
    of words that are perceived as similar and are
    mutual translations.
  • nature - nature, reconnaissance -
    recognition
  • False Friends (Faux Amis) are pairs of words in
    two languages that are perceived as similar but
    have different meanings.
  • main (hand) - main (principal, essential),
    blesser (to injure) - bless (bénir in French)
  • Partial Cognates words that share the same
    meaning in two languages in some but not all
    contexts
  • note note, facteur - factor or mailman,
    maker

5
Research Contribution
  • Novel method based on ML algorithms to identify
    Cognates and False Friends
  • A method to create complete lists of Cognates and
    False Friends
  • Define a novel task Partial Cognate
    Disambiguation, and solve it using a supervised
    and a semi-supervised method
  • Combine and use corpora from different domains
  • Implement a CALL Tool CLPA to annotate Cognates
    and False Friends

6
Cognates and False Friends Identification
  • Our method
  • Machine Learning techniques with different
    algorithms
  • Instances French-English pairs of words
  • Feature Space 13 orthographic similarity
    measures
  • Classes Cog_FF and Unrelated
  • Experiments done for
  • Each measure separately
  • Average of all measures
  • All 13 measures

7
Cognates and False Friends Identification
  • Data

Training set Test set
Cognates 613 (73) 603 (178)
False-Friends 314 (135) 94 (46)
Unrelated 527 (0) 343 (0)
Total 1454 1040
8
Results for classification (COG_FF/UNREL)
Orthographic similarity measure Threshold Accuracy on Training set Accuracy on Test set
IDENT 1 43.90 55.00
PREFIX 0.03845 92.70 90.97
DICE 0.29669 89.40 93.37
LCSR 0.45800 92.91 94.24
NED 0.34845 93.39 93.57
SOUNDEX 0.62500 85.28 84.54
TRI 0.0476 88.30 92.13
XDICE 0.21825 92.84 94.52
XXDICE 0.12915 91.74 95.39
TRI-SIM 0.34845 95.66 93.28
TRI-DIST 0.34845 95.11 93.85
Average measure 0.14770 93.83 94.14
9
Results for classification (COG_FF/UNREL)
Classifier Accuracy cross- val. on training set Accuracy on test set
Baseline 63.75 66.98
OneRule 95.66 92.89
Naive Bayes 94.84 94.62
Decision Tree 95.66 92.08
Decision Tree (pruned) 95.66 93.18
IBK 93.81 92.80
Ada Boost 95.66 93.47
Perceptron 95.11 91.55
SVM (SMO) 95.46 93.76
10
Complete Lists of Cognates and False Friends
  • Method
  • Use the XXDICE orthographic similarity measure
  • Use list of pairs of words in two languages (the
    words that are translation of each other, or not,
    or monolingual lists of words)
  • Use a bilingual dictionary to determine if the
    words contained in a pair are translation of each
    other

11
Complete Lists of Cognates and False Friends
  • Evaluation
  • On the entry list of a French-English bilingual
    dictionary
  • 55 - Cognates
  • 2 - False Friends (5,619,270 pairs)
  • We created pair of words from two large
    monolingual list of words in French and English
  • 11,469,662 Orthographical Similar (0.8)
  • 3,496 Cognates (0.03)
  • 3,767,435 False Friends (32)

12
Cognates and False Friends Identification
  • Conclusion
  • We tested a number of orthographic similarity
    measures individually, and also combined using
    different Machine Learning algorithms
  • We evaluated the methods on a training set using
    10-fold cross validation, on a test set
  • We proposed an extension of the method to create
    complete lists of Cognates and False Friends
  • The results show that, for French and English, it
    is possible to achieve very good accuracy based
    on the orthographic measures of word similarity

13
Partial Cognate Disambiguation
  • Task
  • To determine the sense/meaning (Cognate or False
    Friend with the equivalent English word) of an
    Partial Cognate in a French context
  • Note
  • Cog
  • Le comité prend note de cette information.
  • The Committee takes note of this reply.
  • FF
  • Mais qui a dû payer la note?
  • So who got left holding the bill?

14
Data
  • Use a set of 10 Partial Cognates
  • Parallel sentences that have on the French side
    the French Partial Cognate and on the English
    side the English Cognate (English False Friend) -
    labeled as COG (FF)
  • Collected from EuroPar, Hansard
  • 115 sentences each class for Training
  • 60 sentences each class for Testing

15
Supervised Method
  • Traditional ML algorithms
  • Features
  • - used the bag-of-words (BOW) approach of
    modeling context, with the binary feature values
  • - context words from the training corpus that
    appeared at least 3 times in the training
    sentences
  • Classes COG and FF

16
Monolingual Bootstrapping
  • For each pair of partial cognates (PC)
  • 1. Train a classifier on the training seeds
    using the BOW approach and a NB-K classifier
    with attribute selection on the features
  • 2. Apply the classifier on unlabeled data
    sentences that contain the PC word, extracted
    from LeMonde (MB-F) or from BNC (MB-E)
  • 3. Take the first k newly classified sentences,
    both from the COG and FF class and add them to
    the training seeds (the most confident ones
    the prediction accuracy greater or equal than a
    threshold 0.85)
  • 4. Rerun the experiments training on the new
    training set
  • 5. Repeat steps 2 and 3 for t times
  • endFor

17
Bilingual Bootstrapping
  • 1. Translate the English sentences that were
    collected in the MB-E step into French using an
    online MT tool and add them to the French seed
    training data.
  • 2. Repeat the MB-F and MB-E steps for T times.

18
Additional Data
  • LeMonde
  • An average of 250 sentences for each class
  • BNC
  • An average of 200 sentences for each class
  • Multi-Domain corpus
  • An average of 80 sentences for each class

19
Results
20
Partial Cognate Disambiguation
  • Conclusions
  • Simple methods and available tools are used with
    success for a task hard to solve even for humans
  • Additional use of unlabeled data improves the
    learning process for the Partial Cognates
    Disambiguation task
  • Semi-Supervised Learning proves to be as good
    as Supervised Learning

21
CLPA-Cross Language Pair Annotator
22
Future Work
  • Apply the Cognate and False Friend Identification
    method, and create complete list for other pair
    of languages
  • Increase the accuracy results for the Partial
    Cognate Disambiguation task
  • Use lemmatization for French texts and human
    evaluation for CLPA

23
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com