Title: A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words
1A Knowledge-Rich Approach to Measuring the
Similarity between Bulgarian and Russian Words
Workshop Multilingual Resources, Technologies
and Evaluation for Central and Eastern European
Languages, RANLP 2009
- Preslav Nakov, Sofia University "St. Kliment
Ohridski" - Elena Paskaleva, Bulgarian Academy of Sciences
- Svetlin Nakov, Sofia University "St. Kliment
Ohridski"
2Introduction
- Objective
- Measure the extent to which a Bulgarian and a
Russian word are perceived as similar by a person
who is fluent in both languages - Orthographic similarity
- Modified to account typical cross-lingual
correspondences between Bulgarian and Russian,
e.g. transformations of inflections - Example
- Bulgarian ??????????? and Russian ???????????????
are orthographically different but perceived as
similar
3Orthographic Similarity
- Minimum Edit Distance Ratio (MEDR)
- MED(s1, s2) the minimum number of INSERT /
REPLACE / DELETE operations for transforming s1
to s2 (Levenshtein distance) -
- MEDR is also known as normalized edit distance
(NED) - Longest Common Subsequence Ratio (LCSR)
- Maximal length subsequence common to both words,
normalized by the longer word
4Modified Minimum Edit Distance Ratio (MMEDR)
- Our MMEDR similarity algorithm
- Reduces the Russian word to an intermediate
Bulgarian-sounding form - Applies a set of linguistically motivated
transformation rules - Compares orthographically the modified Russian
word with the Bulgarian word - Calculates weighted Levenshtein distance
5Linguistic Motivation behind the MMEDR Algorithm
6Linguistic Motivation
- Transliteration from Cyrillic to Cyrillic
- Full coincidence (equality) of letters
- Regular letter transitions
- Transformations of n-grams
- Lemmatization
- Transformation Weights
7Transliteration
- What is transliteration?
- Transition of sounds and their letter
correspondences in one language to letters in
another language - Russian ? Bulgarian transliteration
- Full coincidence (equality) of letters
- E.g. a ? a (?????? ??????)
- Russian letters missing in Bulgarian
- E.g. ? ? ?, ? ? ? (???? ????, ???? ????)
- Removing a Russian letter
- E.g. ?????? ? ?????
- Regular letter transitions
- E.g. ??? ? ???, ???? ? ????, ??? ? ???
8Transformation of n-grams
- Regular sound-letter transitions from Russian to
Bulgarian - Transformations originating from spelling
- Double consonants, e.g. ??????? ? ??????
- Voiceless to voiced consonants, e.g. ???????????
? ?????????? - Transformations of morphological origin
- Removing agglutinative morphemes (?? and ??),
e.g. ?????????? ? ???????? - Transforming endings, e.g. ??????? ? ??????
9Transformation of Russian Adjectives
10Transformation of Russian Verbs
11Lemmatization
- Bulgarian and Russian are highly-inflectional
languages - Variety of endings express the different forms of
the same word - What is lemmatization?
- Replacement of inflected wordforms with their
lemmata - E.g. ??????? ? ????? (Bulgarian), ??????????? ?
??????? (Russian) - Lemmatization can handle inflections
12Transformation Weights
- We use weights for letter substitutions when
measuring Levenshtein distance - We account regular phonetic and spelling letter
correspondences - Some substitutions are unlikely
- E.g. ? ? ? is more likely than ? ? ?
- Replacing letter with itself has cost 0
- Regular letter substitution cost is 1
- Consonants and vowels with similar sequences of
distinctive phonetic features have less
substitution cost (e.g. ? ? ?)
13Transformation Weights
14The MMEDR Algorithm in Details
15The MMEDR Algorithm
- MMEDR algorithm steps (order is important)
- Lemmatize the Bulgarian word
- Lemmatize the Russian word
- Transform the Russian words ending
- Transliterate the Russian word
- Remove some double consonants in the Russian word
- Calculate weighted Levenshtein distance
- Normalize and calculate the MMEDR value
16Lemmatizing Bulgarian and Russian Words
- How to perform lemmatization?
- Use of large morphological dictionaries
- Wordforms are replaced with corresponding lemmata
- Lemmatization if optional step in MMEDR
- For each word it is either performed or not
- When multiple lemmata are found, all of them are
considered - Highest value of MMEDR is taken
17Transforming the Russian Endings
- The following endings are replaced in the Russian
words - ???? ? ??? ??? ? ?? ???? ? ???
- ??? ? ?? ?? ? ? ?? ? ? ???? ? ???
- ??? ? ?? ?? ? ? ???? ? ??? ??? ? ?
- ????? ? ?? ??? ? ? ??? ? ?
- ??? ? ?? ??? ? ? ??? ? ??
18Removing Double Consonants
- The following substitutions are performed in the
Russian words - ?? ? ? ?? ? ? ?? ? ? ?? ? ?
- ?? ? ? ?? ? ? ?? ? ? ?? ? ?
- ?? ? ? ?? ? ?
- Note that not all double consonants are replaced,
e.g. ?? is left ?? - E.g. ????????? ? ????????
19Calculating Weighted Levenshtein Distance
- Starting from classical Levenshtein distance
(MED) we modify it to use weights for letter
substitutions (MMED) - We use the previously discussed linguistically
motivated weights - We calculate MMEDR as follows
20Calculating the Final Result
- The final MMEDR value is calculated by maximum of
all MMEDR values - with / without lemmatization of the Bulgarian
word - with / without lemmatization of the Russian word
- with / without transformation of the Russian word
ending - Lemmatization sometimes produces multiple
lemmata, so all of them are considered
21MMEDR Algorithm Example
- Bulgarian word ???????????
- Russian word ???????????????
- Traditional MEDR similarity
- MED(???????????, ???????????????) 7
- Apply normalization MEDR 1(7/15) 8/15 53
- Even though these words "sound similar" to
Bulgarian / Russian fluent speakers
22MMEDR Algorithm Example (2)
- Our improved MMEDR similarity
- Lemmatization produces ????????? and
????????????? - We replace the double Russian consonant -??- by
-?- - We obtain ????????? and ????????????
- We replace the Russian ending -????? by the
Bulgarian ending -?? - We obtain identical words ????????? and
????????? - Thus our MMEDR similarity is 100
23Another MMEDR Example
- Bulgarian word ??????? and the Russian word
???????? (both meaning to run out) - MED(???????,????????) 5
- MEDR 1 (5/8) 3/8 37.5
- MMEDR first transforms ???????? to ???????
- MMED(???????, ???????) 0.8 1 0.5 2.3
- MMEDR 1 (2.3/7) 47/70 67
24Experiments and Evaluation
25Experimental Setup
- Model the problem as information retrieval (IR)
task - Retrieve all similar pairs of words from
Bulgarian and Russian lists of words - Measure similarity between 200 x 200 40,000
Bulgarian-Russian pairs of words - 163 pairs annotated as similar by linguist
- 39,837 considered unrelated
- Rank the 40,000 pairs by MMEDR algorithm
- Evaluate the quality of the ranking with 11pt
interpolated average precision
26Resources
- Textual resources
- The first 200 words from the Russian novel The
Lord of the World (????????? ????) by Alexander
Belyayev - The first 200 words form the Bulgarian
translation of the novel - Grammatical resources (for lemmatization)
- Grammatical dictionary of Bulgarian
- 1M wordforms and 70,000 lemmata
- Grammatical dictionary of Russian
- 1.5M wordforms and 100,000 lemmata
27Results
- MMEDR significantly outperforms traditional
orthographic similarity measures
28Results Produced Ranking
29Conclusion
- We proposed orthographical similarity measure
algorithm for Bulgarian / Russian - Outperforms traditional orthographic similarity
measures - Accuracy is still far from 100
- Evaluation performed with stop words included
- No publications on orthographic similarity for
Bulgarian / Russian - Can not compare the results with others
30Future Work
- Combine the ideas of MMEDR with machine learning
techniques - Automatically learning transformation rules for
n-grams correspondences - Perform evaluation with stop words excluded
- Evaluation for different pairs of languages
31Questions?
A Knowledge-Rich Approach to Measuring the
Similarity between Bulgarian and Russian Words