A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words

About This Presentation

Title:

A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words

Description:

Preslav Nakov, Sofia University 'St. Kliment Ohridski' Elena Paskaleva, ... to account typical cross-lingual correspondences between Bulgarian and Russian, ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 32

Provided by: Svetli6

Category:

more less

Transcript and Presenter's Notes

Title: A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words

1
A Knowledge-Rich Approach to Measuring the
Similarity between Bulgarian and Russian Words
Workshop Multilingual Resources, Technologies
and Evaluation for Central and Eastern European
Languages, RANLP 2009

Preslav Nakov, Sofia University "St. Kliment
Ohridski"
Elena Paskaleva, Bulgarian Academy of Sciences
Svetlin Nakov, Sofia University "St. Kliment
Ohridski"

2
Introduction

Objective
Measure the extent to which a Bulgarian and a
Russian word are perceived as similar by a person
who is fluent in both languages
Orthographic similarity
Modified to account typical cross-lingual
correspondences between Bulgarian and Russian,
e.g. transformations of inflections
Example
Bulgarian ??????????? and Russian ???????????????
are orthographically different but perceived as
similar

3
Orthographic Similarity

Minimum Edit Distance Ratio (MEDR)
MED(s1, s2) the minimum number of INSERT /
REPLACE / DELETE operations for transforming s1
to s2 (Levenshtein distance)
MEDR is also known as normalized edit distance
(NED)
Longest Common Subsequence Ratio (LCSR)
Maximal length subsequence common to both words,
normalized by the longer word

4
Modified Minimum Edit Distance Ratio (MMEDR)

Our MMEDR similarity algorithm
Reduces the Russian word to an intermediate
Bulgarian-sounding form
Applies a set of linguistically motivated
transformation rules
Compares orthographically the modified Russian
word with the Bulgarian word
Calculates weighted Levenshtein distance

5
Linguistic Motivation behind the MMEDR Algorithm
6
Linguistic Motivation

Transliteration from Cyrillic to Cyrillic
Full coincidence (equality) of letters
Regular letter transitions
Transformations of n-grams
Lemmatization
Transformation Weights

7
Transliteration

What is transliteration?
Transition of sounds and their letter
correspondences in one language to letters in
another language
Russian ? Bulgarian transliteration
Full coincidence (equality) of letters
E.g. a ? a (?????? ??????)
Russian letters missing in Bulgarian
E.g. ? ? ?, ? ? ? (???? ????, ???? ????)
Removing a Russian letter
E.g. ?????? ? ?????
Regular letter transitions
E.g. ??? ? ???, ???? ? ????, ??? ? ???

8
Transformation of n-grams

Regular sound-letter transitions from Russian to
Bulgarian
Transformations originating from spelling
Double consonants, e.g. ??????? ? ??????
Voiceless to voiced consonants, e.g. ???????????
? ??????????
Transformations of morphological origin
Removing agglutinative morphemes (?? and ??),
e.g. ?????????? ? ????????
Transforming endings, e.g. ??????? ? ??????

9
Transformation of Russian Adjectives
10
Transformation of Russian Verbs
11
Lemmatization

Bulgarian and Russian are highly-inflectional
languages
Variety of endings express the different forms of
the same word
What is lemmatization?
Replacement of inflected wordforms with their
lemmata
E.g. ??????? ? ????? (Bulgarian), ??????????? ?
??????? (Russian)
Lemmatization can handle inflections

12
Transformation Weights

We use weights for letter substitutions when
measuring Levenshtein distance
We account regular phonetic and spelling letter
correspondences
Some substitutions are unlikely
E.g. ? ? ? is more likely than ? ? ?
Replacing letter with itself has cost 0
Regular letter substitution cost is 1
Consonants and vowels with similar sequences of
distinctive phonetic features have less
substitution cost (e.g. ? ? ?)

13
Transformation Weights
14
The MMEDR Algorithm in Details
15
The MMEDR Algorithm

MMEDR algorithm steps (order is important)
Lemmatize the Bulgarian word
Lemmatize the Russian word
Transform the Russian words ending
Transliterate the Russian word
Remove some double consonants in the Russian word
Calculate weighted Levenshtein distance
Normalize and calculate the MMEDR value

16
Lemmatizing Bulgarian and Russian Words

How to perform lemmatization?
Use of large morphological dictionaries
Wordforms are replaced with corresponding lemmata
Lemmatization if optional step in MMEDR
For each word it is either performed or not
When multiple lemmata are found, all of them are
considered
Highest value of MMEDR is taken

17
Transforming the Russian Endings

The following endings are replaced in the Russian
words
???? ? ??? ??? ? ?? ???? ? ???
??? ? ?? ?? ? ? ?? ? ? ???? ? ???
??? ? ?? ?? ? ? ???? ? ??? ??? ? ?
????? ? ?? ??? ? ? ??? ? ?
??? ? ?? ??? ? ? ??? ? ??

18
Removing Double Consonants

The following substitutions are performed in the
Russian words
?? ? ? ?? ? ? ?? ? ? ?? ? ?
?? ? ? ?? ? ? ?? ? ? ?? ? ?
?? ? ? ?? ? ?
Note that not all double consonants are replaced,
e.g. ?? is left ??
E.g. ????????? ? ????????

19
Calculating Weighted Levenshtein Distance

Starting from classical Levenshtein distance
(MED) we modify it to use weights for letter
substitutions (MMED)
We use the previously discussed linguistically
motivated weights
We calculate MMEDR as follows

20
Calculating the Final Result

The final MMEDR value is calculated by maximum of
all MMEDR values
with / without lemmatization of the Bulgarian
word
with / without lemmatization of the Russian word
with / without transformation of the Russian word
ending
Lemmatization sometimes produces multiple
lemmata, so all of them are considered

21
MMEDR Algorithm Example

Bulgarian word ???????????
Russian word ???????????????
Traditional MEDR similarity
MED(???????????, ???????????????) 7
Apply normalization MEDR 1(7/15) 8/15 53
Even though these words "sound similar" to
Bulgarian / Russian fluent speakers

22
MMEDR Algorithm Example (2)

Our improved MMEDR similarity
Lemmatization produces ????????? and
?????????????
We replace the double Russian consonant -??- by
-?-
We obtain ????????? and ????????????
We replace the Russian ending -????? by the
Bulgarian ending -??
We obtain identical words ????????? and
?????????
Thus our MMEDR similarity is 100

23
Another MMEDR Example

Bulgarian word ??????? and the Russian word
???????? (both meaning to run out)
MED(???????,????????) 5
MEDR 1 (5/8) 3/8 37.5
MMEDR first transforms ???????? to ???????
MMED(???????, ???????) 0.8 1 0.5 2.3
MMEDR 1 (2.3/7) 47/70 67

24
Experiments and Evaluation
25
Experimental Setup

Model the problem as information retrieval (IR)
task
Retrieve all similar pairs of words from
Bulgarian and Russian lists of words
Measure similarity between 200 x 200 40,000
Bulgarian-Russian pairs of words
163 pairs annotated as similar by linguist
39,837 considered unrelated
Rank the 40,000 pairs by MMEDR algorithm
Evaluate the quality of the ranking with 11pt
interpolated average precision

26
Resources

Textual resources
The first 200 words from the Russian novel The
Lord of the World (????????? ????) by Alexander
Belyayev
The first 200 words form the Bulgarian
translation of the novel
Grammatical resources (for lemmatization)
Grammatical dictionary of Bulgarian
1M wordforms and 70,000 lemmata
Grammatical dictionary of Russian
1.5M wordforms and 100,000 lemmata

27
Results

MMEDR significantly outperforms traditional
orthographic similarity measures

28
Results Produced Ranking
29
Conclusion

We proposed orthographical similarity measure
algorithm for Bulgarian / Russian
Outperforms traditional orthographic similarity
measures
Accuracy is still far from 100
Evaluation performed with stop words included
No publications on orthographic similarity for
Bulgarian / Russian
Can not compare the results with others

30
Future Work

Combine the ideas of MMEDR with machine learning
techniques
Automatically learning transformation rules for
n-grams correspondences
Perform evaluation with stop words excluded
Evaluation for different pairs of languages

31
Questions?
A Knowledge-Rich Approach to Measuring the
Similarity between Bulgarian and Russian Words

Write a Comment

User Comments (0)

About PowerShow.com

A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words - PowerPoint PPT Presentation

A KnowledgeRich Approach to Measuring the Similarity between Bulgarian and Russian Words

Preslav Nakov, Sofia University 'St. Kliment Ohridski' Elena Paskaleva, ... to account typical cross-lingual correspondences between Bulgarian and Russian, ... – PowerPoint PPT presentation