A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words

Description:

Preslav Nakov, Sofia University 'St. Kliment Ohridski' Elena Paskaleva, ... to account typical cross-lingual correspondences between Bulgarian and Russian, ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 32
Provided by: Svetli6
Category:

less

Transcript and Presenter's Notes

Title: A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words


1
A Knowledge-Rich Approach to Measuring the
Similarity between Bulgarian and Russian Words
Workshop Multilingual Resources, Technologies
and Evaluation for Central and Eastern European
Languages, RANLP 2009
  • Preslav Nakov, Sofia University "St. Kliment
    Ohridski"
  • Elena Paskaleva, Bulgarian Academy of Sciences
  • Svetlin Nakov, Sofia University "St. Kliment
    Ohridski"

2
Introduction
  • Objective
  • Measure the extent to which a Bulgarian and a
    Russian word are perceived as similar by a person
    who is fluent in both languages
  • Orthographic similarity
  • Modified to account typical cross-lingual
    correspondences between Bulgarian and Russian,
    e.g. transformations of inflections
  • Example
  • Bulgarian ??????????? and Russian ???????????????
    are orthographically different but perceived as
    similar

3
Orthographic Similarity
  • Minimum Edit Distance Ratio (MEDR)
  • MED(s1, s2) the minimum number of INSERT /
    REPLACE / DELETE operations for transforming s1
    to s2 (Levenshtein distance)
  • MEDR is also known as normalized edit distance
    (NED)
  • Longest Common Subsequence Ratio (LCSR)
  • Maximal length subsequence common to both words,
    normalized by the longer word

4
Modified Minimum Edit Distance Ratio (MMEDR)
  • Our MMEDR similarity algorithm
  • Reduces the Russian word to an intermediate
    Bulgarian-sounding form
  • Applies a set of linguistically motivated
    transformation rules
  • Compares orthographically the modified Russian
    word with the Bulgarian word
  • Calculates weighted Levenshtein distance

5
Linguistic Motivation behind the MMEDR Algorithm
6
Linguistic Motivation
  • Transliteration from Cyrillic to Cyrillic
  • Full coincidence (equality) of letters
  • Regular letter transitions
  • Transformations of n-grams
  • Lemmatization
  • Transformation Weights

7
Transliteration
  • What is transliteration?
  • Transition of sounds and their letter
    correspondences in one language to letters in
    another language
  • Russian ? Bulgarian transliteration
  • Full coincidence (equality) of letters
  • E.g. a ? a (?????? ??????)
  • Russian letters missing in Bulgarian
  • E.g. ? ? ?, ? ? ? (???? ????, ???? ????)
  • Removing a Russian letter
  • E.g. ?????? ? ?????
  • Regular letter transitions
  • E.g. ??? ? ???, ???? ? ????, ??? ? ???

8
Transformation of n-grams
  • Regular sound-letter transitions from Russian to
    Bulgarian
  • Transformations originating from spelling
  • Double consonants, e.g. ??????? ? ??????
  • Voiceless to voiced consonants, e.g. ???????????
    ? ??????????
  • Transformations of morphological origin
  • Removing agglutinative morphemes (?? and ??),
    e.g. ?????????? ? ????????
  • Transforming endings, e.g. ??????? ? ??????

9
Transformation of Russian Adjectives
10
Transformation of Russian Verbs
11
Lemmatization
  • Bulgarian and Russian are highly-inflectional
    languages
  • Variety of endings express the different forms of
    the same word
  • What is lemmatization?
  • Replacement of inflected wordforms with their
    lemmata
  • E.g. ??????? ? ????? (Bulgarian), ??????????? ?
    ??????? (Russian)
  • Lemmatization can handle inflections

12
Transformation Weights
  • We use weights for letter substitutions when
    measuring Levenshtein distance
  • We account regular phonetic and spelling letter
    correspondences
  • Some substitutions are unlikely
  • E.g. ? ? ? is more likely than ? ? ?
  • Replacing letter with itself has cost 0
  • Regular letter substitution cost is 1
  • Consonants and vowels with similar sequences of
    distinctive phonetic features have less
    substitution cost (e.g. ? ? ?)

13
Transformation Weights
14
The MMEDR Algorithm in Details
15
The MMEDR Algorithm
  • MMEDR algorithm steps (order is important)
  • Lemmatize the Bulgarian word
  • Lemmatize the Russian word
  • Transform the Russian words ending
  • Transliterate the Russian word
  • Remove some double consonants in the Russian word
  • Calculate weighted Levenshtein distance
  • Normalize and calculate the MMEDR value

16
Lemmatizing Bulgarian and Russian Words
  • How to perform lemmatization?
  • Use of large morphological dictionaries
  • Wordforms are replaced with corresponding lemmata
  • Lemmatization if optional step in MMEDR
  • For each word it is either performed or not
  • When multiple lemmata are found, all of them are
    considered
  • Highest value of MMEDR is taken

17
Transforming the Russian Endings
  • The following endings are replaced in the Russian
    words
  • ???? ? ??? ??? ? ?? ???? ? ???
  • ??? ? ?? ?? ? ? ?? ? ? ???? ? ???
  • ??? ? ?? ?? ? ? ???? ? ??? ??? ? ?
  • ????? ? ?? ??? ? ? ??? ? ?
  • ??? ? ?? ??? ? ? ??? ? ??

18
Removing Double Consonants
  • The following substitutions are performed in the
    Russian words
  • ?? ? ? ?? ? ? ?? ? ? ?? ? ?
  • ?? ? ? ?? ? ? ?? ? ? ?? ? ?
  • ?? ? ? ?? ? ?
  • Note that not all double consonants are replaced,
    e.g. ?? is left ??
  • E.g. ????????? ? ????????

19
Calculating Weighted Levenshtein Distance
  • Starting from classical Levenshtein distance
    (MED) we modify it to use weights for letter
    substitutions (MMED)
  • We use the previously discussed linguistically
    motivated weights
  • We calculate MMEDR as follows

20
Calculating the Final Result
  • The final MMEDR value is calculated by maximum of
    all MMEDR values
  • with / without lemmatization of the Bulgarian
    word
  • with / without lemmatization of the Russian word
  • with / without transformation of the Russian word
    ending
  • Lemmatization sometimes produces multiple
    lemmata, so all of them are considered

21
MMEDR Algorithm Example
  • Bulgarian word ???????????
  • Russian word ???????????????
  • Traditional MEDR similarity
  • MED(???????????, ???????????????) 7
  • Apply normalization MEDR 1(7/15) 8/15 53
  • Even though these words "sound similar" to
    Bulgarian / Russian fluent speakers

22
MMEDR Algorithm Example (2)
  • Our improved MMEDR similarity
  • Lemmatization produces ????????? and
    ?????????????
  • We replace the double Russian consonant -??- by
    -?-
  • We obtain ????????? and ????????????
  • We replace the Russian ending -????? by the
    Bulgarian ending -??
  • We obtain identical words ????????? and
    ?????????
  • Thus our MMEDR similarity is 100

23
Another MMEDR Example
  • Bulgarian word ??????? and the Russian word
    ???????? (both meaning to run out)
  • MED(???????,????????) 5
  • MEDR 1 (5/8) 3/8 37.5
  • MMEDR first transforms ???????? to ???????
  • MMED(???????, ???????) 0.8 1 0.5 2.3
  • MMEDR 1 (2.3/7) 47/70 67

24
Experiments and Evaluation
25
Experimental Setup
  • Model the problem as information retrieval (IR)
    task
  • Retrieve all similar pairs of words from
    Bulgarian and Russian lists of words
  • Measure similarity between 200 x 200 40,000
    Bulgarian-Russian pairs of words
  • 163 pairs annotated as similar by linguist
  • 39,837 considered unrelated
  • Rank the 40,000 pairs by MMEDR algorithm
  • Evaluate the quality of the ranking with 11pt
    interpolated average precision

26
Resources
  • Textual resources
  • The first 200 words from the Russian novel The
    Lord of the World (????????? ????) by Alexander
    Belyayev
  • The first 200 words form the Bulgarian
    translation of the novel
  • Grammatical resources (for lemmatization)
  • Grammatical dictionary of Bulgarian
  • 1M wordforms and 70,000 lemmata
  • Grammatical dictionary of Russian
  • 1.5M wordforms and 100,000 lemmata

27
Results
  • MMEDR significantly outperforms traditional
    orthographic similarity measures

28
Results Produced Ranking
29
Conclusion
  • We proposed orthographical similarity measure
    algorithm for Bulgarian / Russian
  • Outperforms traditional orthographic similarity
    measures
  • Accuracy is still far from 100
  • Evaluation performed with stop words included
  • No publications on orthographic similarity for
    Bulgarian / Russian
  • Can not compare the results with others

30
Future Work
  • Combine the ideas of MMEDR with machine learning
    techniques
  • Automatically learning transformation rules for
    n-grams correspondences
  • Perform evaluation with stop words excluded
  • Evaluation for different pairs of languages

31
Questions?
A Knowledge-Rich Approach to Measuring the
Similarity between Bulgarian and Russian Words
Write a Comment
User Comments (0)
About PowerShow.com