Free construction of a free dictionary of synonyms using computer science - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Free construction of a free dictionary of synonyms using computer science

Description:

Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College November ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 28
Provided by: Viggo7
Category:

less

Transcript and Presenter's Notes

Title: Free construction of a free dictionary of synonyms using computer science


1
Free construction of a free dictionary of
synonyms using computer science
  • Viggo Kann and Magnus RosellKTH, Stockholm
  • Talk given by Viggo at Amherst College November
    11, 2006

2
Examples of English synonyms
  • Smith A Dictionary of Synonymous Words in the
    English Language 1889
  • CLASS. Order. Rank. Degree. Classification.
    Grade.
  • Websters Dictionary of Synonyms 1942
  • classify. Alphabetize, pigeonhole, assort, sort.
    Ana. Order, arrange, systematize, methodize,
    marshal.

3
Goals
  • To construct a Swedish dictionary of synonyms as
    a list of synonymous pairs
  • I dont want to work a lot
  • I dont want to pay anyone to work
  • The resulting list should be free

4
Ideas
  • Automatically construct a large set of word pairs
    that might be synonyms
  • Use ten thousands of people, who are each willing
    to make a small contribution without payment, to
    check the word pairs

5
More ideas
  • Use the Lexin on-line Swedish-English dictionary
    web site, that had 9 millions (now 17 M) of
    lookups each month
  • Users visit Lexin to translate words, and are
    thus probably motivated to help me
  • Each time a user makes a lookup, give her the
    opportunity to decide whether two words are
    synonyms or not

6
My plan
  1. Construct lots of possible synonyms
  2. Sort out bad synonym pairs automatically
  3. Ask lots of users if the rest of the pairs are
    good synonyms
  4. Analyze the gradings done by the users and decide
    which pairs to keep

7
Step 1 Construct lots of possible synonyms
  • If we have access to a Swedish-English dictionary
    SE and an English-Swedish dictionary ES, try to
    translate each word to English and back again to
    Swedish
  • (w,v) ?y y?SE(w) ? v?ES(y) or(w,v) ?y
    y?SE(w) ? y?SE(v)
  • 616 000 word pairs were generated

8
Step 2 Sort out bad synonym pairs automatically
  • Use RI (Random Indexing)Kanerva, Kristoferson,
    Holst 2000to measure the distance between words
    represented in a large vector space
  • Keep pairs that have small enough distance in the
    vector space

9
Random Indexing
  • Each word w is assigned a random label vector Lw
    of thousand elements
  • For each word w construct a context vector Cw by
    adding the random vectors for the words appearing
    in the context of each occurrence of w in a large
    corpus

10
Random Indexing settings
  • Context 4 words to the left and 4 to the
    rightStop words were removed
  • Dimensionality 1800
  • 5 corpora from different domains were used, for
    example newspapers and medical texts

11
Number of pairs for different cos thresholds (435
000 of 616 000 pairs occurred in corpus)
12
Step 3 Ask lots of users if the rest of the
pairs are good synonyms
  • When a user has sent a word to the Lexin
    dictionary he receives the translation followed
    by a question like
  • Are 'spread' and 'lengthen' synonyms? Answer
    using a scale from 0 to 5 where 0 means 'I dont
    agree' and 5 means 'I do fully agree', or answer
    'I dont know'

13
After answering the user may
  • grade new randomly chosen word pair
  • look up word in the synonym dictionary
  • suggest new synonymous word pair
  • download synonym dictionary in XML

14
(No Transcript)
15
Step 4 Analyzing the gradings done by the users
  • 1.2 millions gradings were made in less than 2
    months
  • Grading statistics were analyzed on several
    occasions
  • Some users sent comments

16
Keeping the users happy!
  • Many users said that there were too many bad
    pairs
  • Lots of pairs were graded 0 (not at all synonyms)
    by all users. After some weeks 25 000 such pairs
    were removed. Later 60 000 more pairs were
    removed, improving the quality of the remaining
    pairs considerably.

17
User gradings first two months
18
More interesting gradings 2006
19
Distribution of mean gradings of word pairs after
two months
20
Distribution of mean gradings of word pairs 2006
21
Analysis of the pairs graded 0Distance (cosine)
in RI space
22
Some statistics (November 2006)
  • 2.5 M user gradings done
  • 67 000 pairs (graded 2) in dictionary
  • 90 000 pairs suggested by users
  • 50 000 unique pairs suggested
  • 14 000 of them have been accepted

23
Example Synonyms to klass (class)
  • 5 rang (grade)rank (rank)slag (kind)
  • 4 kategori (category)stånd (social
    class)årskurs (grade)
  • 3 fack (sphere)grad (degree)grupp
    (group)kvalitet (quality)nivå (level)ordning
    (order)
  • 3 skikt (layer)sort (sort)standard
    (standard)stil (style)
  • 2 storleksordning (magnitude)typ (type)
  • 1 poäng (point)stadga (stability)
  • 0 uppdrag (mission)utbilda (educate)

24
How to prevent abuse?
  • Many gradings of a word pair are needed before
    its considered to be good
  • The pair to be graded is randomly picked from a
    very large list
  • Word pairs suggested by users are spell checked
    before they are added to the very large list

25
People's definition of synonymy
  • Exact meaning of 'synonym' wasnt defined
  • Users will grade using their intuitive
    understanding of the concept of synonymy and the
    words in the pair
  • The produced dictionary will use the people's own
    definition of synonymy Hopefully this is exactly
    what they want!

26
The peoples synonym dictionary on the web
  • http//lexin.nada.kth.se/cgi-bin/synlex

27
Lessons learned
  • The list of suggested synonyms should be huge
  • Try to improve the quality of the list
    automatically as much as possible,Random
    indexing is useful for this, also try tagging and
    using other dictionaries
  • Use the 0 answers early to remove bad pairs that
    only irritate the users
Write a Comment
User Comments (0)
About PowerShow.com