Morphology 3 Unsupervised Morphology Induction - PowerPoint PPT Presentation

Loading...

PPT – Morphology 3 Unsupervised Morphology Induction PowerPoint presentation | free to download - id: 212098-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Morphology 3 Unsupervised Morphology Induction

Description:

Unsupervised Learning of Natural Language Morphology Using MDL. John Goldsmith ... Work in progress with Mikhail Belkin (U of Chicago) ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 88
Provided by: sudeshn
Learn more at: http://ltrc.iiit.ac.in
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Morphology 3 Unsupervised Morphology Induction


1
Morphology 3Unsupervised Morphology Induction
  • Sudeshna Sarkar
  • IIT Kharagpur

2
LinguisticaUnsupervised Learning of Natural
Language Morphology Using MDL
  • John Goldsmith
  • Department of Linguistics
  • The University of Chicago

3
Unsupervised learning
  • Input untagged text in orthographic or phonetic
    form
  • with spaces (or punctuation) separating words.
  • But no tagging or text preparation.
  • Output
  • List of stems, suffixes, and prefixes
  • List of signatures.
  • A signature a list of all suffixes (prefixes)
    appearing in a given corpus with a given stem.
  • Hence, a stem in a corpus has a unique signature.
  • A signature has a unique set of stems associated
    with it

4
(example of signature in English)
  • NULL.ed.ing.s
  • ask call point
  • ask asked asking asks
  • call called calling calls
  • point pointed pointing points

5
output
  • Roots (stems of stems) and the inner structure
    of stems
  • Regular allomorphy of stems
  • e.g., learn delete stem-final e in English
    before ing and ed

6
Essence of Minimum Description Length (MDL)
  • Jorma Rissanen Stochastic Complexity in
    Statistical Inquiry (1989)
  • Work by Michael Brent and Carl de Marcken on
    word-discovery using MDL
  • We are given
  • a corpus, and
  • a probabilistic morphology, which technically
    means that we are given a distribution over
    certain strings of stems and affixes.

7
  • The higher the probability is that the morphology
    assigns to the (observed) corpus, the better that
    morphology is as a model of that data.
  • Better said
  • -1 log probability (corpus) is a measure of how
    well the morphology models the data the smaller
    that number is, the better the morphology models
    the data.
  • This is known as the optimal compressed length of
    the data, given the model.
  • Using base 2 logs, this number is a measure in
    information theoretic bits.

8
Essence of MDL
  • The goodness of the morphology is also measured
    by how compact the morphology is.
  • We can measure the compactness of a morphology in
    information theoretic bits.

9
How can we measure the compactness of a
morphology?
  • Lets consider a naïve version of description
    length count the number of letters.
  • This naïve version is nonetheless helpful in
    seeing the intuition involved.

10
Naive Minimum Description Length
Corpus jump, jumps, jumping laugh, laughed,
laughing sing, sang, singing the, dog, dogs
total 62 letters
Analysis Stems jump laugh sing sang dog (20
letters) Suffixes s ing ed (6 letters) Unanalyzed
the (3 letters) total 29 letters.
Notice that the description length goes UP if we
analyze sing into sing
11
Essence of MDL
  • The best overall theory of a corpus is the one
    for which the sum of
  • log prob (corpus)
  • length of the morphology
  • (thats the description length) is the smallest.

12
Essence of MDL
13
Overall logic
  • Search through morphology space for the
    morphology which provides the smallest
    description length.

14
  1. Application of MDL to iterative search of
    morphology-space, with successively finer-grained
    descriptions

15
Corpus
Pick a large corpus from a language -- 5,000 to
1,000,000 words.
16
Corpus
Feed it into the bootstrapping heuristic...
Bootstrap heuristic
17
Corpus
Bootstrap heuristic
Out of which comes a preliminary
morphology, which need not be superb.
Morphology
18
Corpus
Bootstrap heuristic
Feed it to the incremental heuristics...
Morphology
incremental heuristics
19
Corpus
Out comes a modified morphology.
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
20
Corpus
Is the modification an improvement? Ask MDL!
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
21
Corpus
If it is an improvement, replace the morphology...
Bootstrap heuristic
modified morphology
Morphology
Garbage
22
Corpus
Send it back to the incremental heuristics
again...
Bootstrap heuristic
modified morphology
incremental heuristics
23
Continue until there are no improvements to try.
Morphology
modified morphology
incremental heuristics
24
1. Bootstrap heuristic
  • A function that takes words as inputs and gives
    an initial hypothesis regarding what are stems
    and what are affixes.
  • In theory, the search space is enormous each
    word w of length w has at least w analyses,
    so search space has at least members.

25
Better bootstrap heuristics
  • Heuristic, not perfection! Several good
    heuristics. Best is a modification of a good idea
    of Zellig Harris (1955)
  • Current variant
  • Cut words at certain peaks of successor
    frequency.
  • Problems can over-cut can under-cut and can
    put cuts too far to the right (aborti-
    problem). Not a problem!

26
Successor frequency
n
g o v e r
Empirically, only one letter follows gover n
27
Successor frequency
e
i
m
g o v e r n
o
s

Empirically, 6 letters follows govern n
28
Successor frequency
g o v e r n m
e
Empirically, 1 letter follows governm e
g o v e r 1 n 6 m 1 e
peak of successor frequency
29
Lots of errors
9 18 11 6 4 1 2 1 1 2
1 1
c o n s e r v a t i
v e s
wrong
right
wrong
30
Even so
  • We set conditions
  • Accept cuts with stems at least 5 letters in
    length
  • Demand that successor frequency be a clear peak
    1 N 1 (e.g. govern-ment)
  • Then for each stem, collect all of its suffixes
    into a signature and accept only signatures with
    at least 5 stems to it.

31
2. Incremental heuristics
  • Course-grained to fine-grained
  • 1. Stems and suffixes to split
  • Accept any analysis of a word if it consists of a
    known stem and a known suffix.
  • 2. Loose fit suffixes and signatures to split
    Collect any string that precedes a known suffix.
  • Find all of its apparent suffixes, and use MDL to
    decide if its worth it to do the analysis. Well
    return to this in a moment.

32
Incremental heuristic
  • 3.Slide stem-suffix boundary to the left Again,
    use MDL to decide.
  • How do we use MDL to decide?

33
Using MDL to judge a potential stem
  • act, acted, action, acts.
  • We have the suffixes NULL, ed, ion, and s, but no
    signature NULL.ed.ion.s
  • Lets compute cost versus savings of signature
    NULL.ed.ion.s
  • Savings
  • Stem savings 3 copies of the stem act thats 3
    x 4 12 letters almost 60 bits.

34
Cost of NULL.ed.ing.s
  • A pointer to each suffix

To give a feel for this
Total cost of suffix list about 30 bits. Cost of
pointer to signature total cost is -- all
the stems using it chip in to pay for its cost,
though.
35
  • Cost of signature about 45 bits
  • Savings about 60 bits
  • so MDL says Do it! Analyze the words as stem
    suffix.
  • Notice that the cost of the analysis would have
    been higher if one or more of the suffixes had
    not already existed.

36
Todays presentation
  1. The task unsupervised learning
  2. Overview of program and output
  3. Overview of Minimum Description Length framework
  4. Application of MDL to iterative search of
    morphology-space, with successively finer-grained
    descriptions
  5. Mathematical model
  6. Current capabilities
  7. Current challenges

37
Model
  • A model to give us a probability of each word in
    the corpus (hence, its optimal compressed
    length) and
  • A morphology whose length we can measure.

38
Frequency of analyzed word
x means the count of xs in the corpus (token
count)
W is analyzed as belonging to Signature s, stem
T and suffix F.
Where W is the total number of words.
Actually what we care about is the log of this
39
(No Transcript)
40
Next, lets see how to measurethe length of a
morphology
  • A morphology is a set of 3 things
  • A list of stems
  • A list of suffixes
  • A list of signatures with the associated stems.
  • Well make an effort to make our grammars consist
    primarily of lists, whose length is conceptually
    simple.

41
Length of a list
  • A header telling us how long the list is, of
    length (roughly) log2 N, where N is the length.
  • N entries. Whats in an entry?
  • Raw lists a list of strings of letters, where
    the length of each letter is log2 (26) the
    information content of a letter (we can use a
    more accurate conditional probability).
  • Pointer lists

42
Lists
  • Raw suffix list
  • ed
  • s
  • ing
  • ion
  • able
  • Signature 1
  • Suffixes
  • pointer to ing
  • pointer to ed
  • Signature 2
  • Suffixes
  • pointer to ing
  • pointer to ion

The length of each pointer is
-- usually cheaper than the letters themselves
43
  • The fact that a pointer to a symbol has a length
    that is inversely proportional to its frequency
    is the key
  • We want the shortest overall grammar so
  • That means maximizing the re-use of units (stems,
    affixes, signatures, etc.)

44
structure
Number of letters
Signatures, which well get to shortly
45
Information contained in the Signature component
list of pointers to signatures
ltXgt indicates the number of distinct elements in X
46
Repair heuristics using MDL
  • We could compute the entire MDL in one state of
    the morphology make a change compute the whole
    MDL in the proposed (modified) state and
    compared the two lengths.

Original morphology Compressed data
Revised morphology compressed data
lt gt
47
  • But its better to have a more thoughtful
    approach.
  • Lets define

Then the size of the punctuation for the 3 lists
is
Then the change of the size of the punctuation in
the lists
48
Size of the suffix component, remember
Change in its size when we consider a
modification to the morphology 1. Global effects
of change of number of suffixes 2. Effects on
change of size of suffixes in both states 3.
Suffixes present only in state 1 4. Suffixes
present only in state 2
49
Suffix component change
Suffixes whose counts change
Global effect of change on all suffixes
Contribution of suffixes that appear only in
State1
Contribution of suffixes that appear only in
State 2
50
Current research projects
  1. Allomorphy Automatic discovery of relationship
    between stems (lovlove, winwinn)
  2. Use of syntax (automatic learning of syntactic
    categories)
  3. Rich morphology other languages (e.g., Swahili),
    other sub-languages (e.g., biochemistry
    sub-language) where the mean morphemes/word is
    much higher
  4. Ordering of morphemes

51
Allomorphy Automatic discovery of relationship
between stems
  • Currently learns (unfortunately, over-learns) how
    to delete stem-final letters in order to simplify
    signatures.
  • E.g., delete stem-final e in English before
    suffixes ing, -ed, -ion (etc.).

52
Automatic learning of syntactic categories
  • Work in progress with Mikhail Belkin (U of
    Chicago)
  • Pursuing Shi and Maliks 1997 application of
    spectral graph theory (vision)
  • Finding eigenvector decomposition of a graph that
    represents bigrams and trigrams

53
Rich morphologies
  • A practical challenge for use in data-mining and
    information retrieval in patent applications
    (de-oxy-ribo-nucle-ic, etc.)
  • Swahili, Hungarian, Turkish, etc.

54
(No Transcript)
55
Unsupervised Knowledge-Free Morpheme Boundary
Detection
  • Stefan Bordag
  • University of Leipzig
  • Example
  • Related work
  • Part One Generating training data
  • Part Two Training and Applying a Classificator
  • Preliminary results
  • Further research

56
Example clearly early
  • The examples used throughout this presentation
    are clearly and early
  • In one case, the stem is clear and in the other
    early
  • Other word forms of same lemmas
  • clearly clearest, clear, clearer, clearing
  • early earlier, erliest
  • Semantically related words
  • clearly logically, really, totally, weakly,
  • early morning, noon, day, month, time,
  • Correct morpheme boundaries analysis
  • clearly ? clear-ly but not clearl-y or
    clea-rly
  • early ? early or earl-y but not ear-ly

57
Three approaches to morpheme boundary detection
  • Three kinds of approaches
  • Genetic Algorithms and the Minimum Description
    Length model
  • (Kazakov 97 01), (Goldsmith 01), (Creutz 03
    05)
  • This approach utilizes only word list, not the
    context information for each word from corpus.
  • This possibly results in an upper limit on
    achievable performance (especially with regards
    to irregularities).
  • One advantage is that smaller corpora sufficient
  • Semantics based
  • (Schone Jurafsky 01), (Baroni 03)
  • General problem of this approach with examples
    like deeply and deepness where semantic
    similarity is unlikely
  • Letter Successor Variety (LSV) based
  • (Harris 55), (Hafer Weiss 74) first
    application, but low performance
  • Also applied only to a word list
  • Further hampered by noise in the data

58
2. New solution in two parts
clear-ly lately early
compute LSV
s LSV freq multiletter bigram
The talk 1 Talk was 1
Talk speech 20 Was is 15
The talk wasvery informative
similar words
cooccurrences
sentences
train classifier
clear-ly late-ly early
apply classifier
59
2.1. First part Generating training data with
LSV and distributed Semantics
  • Overview
  • Use context information to gather common direct
    neighbors of the input word ? they are most
    probably marked by the same grammatical
    information
  • Frequency of word A and B is nA and nB
  • Frequency of cooccurrence of A with B is nAB
  • Corpus size is n
  • Significance computation is Poisson approximation
    of log-likelihood (Dunning 93) (Quasthoff Wolff
    02)

60
Neighbors of clearly
  • Most significant left neighbors
  • very
  • quite
  • so
  • Its
  • most
  • its
  • shows
  • results
  • thats
  • stated
  • Quite
  • Most significant right neighbors
  • defined
  • written
  • labeled
  • marked
  • visible
  • demonstrated
  • superior
  • stated
  • shows
  • demonstrates
  • understood

Its clearly labeled
clearly
very clearly shows
61
2.2. New solution as combination of two existing
approaches
  • Overview
  • Use context information to gather common direct
    neighbors of the input word ? they are most
    probably marked by the same grammatical
    information
  • Use these neighbor cooccurrences to find words
    that have similar cooccurrence profiles ? those
    that are surrounded by the same cooccurrences
    bear mostly the same grammatical marker

62
Similar words to clearly
  • Most significant left neighbors
  • very
  • quite
  • so
  • Its
  • most
  • its
  • shows
  • results
  • thats
  • stated
  • Quite
  • Most significant right neighbors
  • defined
  • written
  • labeled
  • marked
  • visible
  • demonstrated
  • superior
  • stated
  • shows
  • demonstrates
  • understood

weakly legally closely clearly greatly linearly
really
63
2.3. New solution as combination of two existing
approaches
  • Overview
  • Use context information to gather common direct
    neighbors of the input word ? they are most
    probably marked by the same grammatical
    information
  • Use these neighbor cooccurrences to find words
    that have similar cooccurrence profiles ? those
    that are surrounded by the same cooccurrences
    bear mostly the same grammatical marker
  • Sort those words by edit distance and keep 150
    most similar ? since further words only add
    random noise

64
Similar words to clearly sorted by edit distance
Sorted List clearly closely greatly legally linea
rly really weakly
  • Most significant
  • left neighbors
  • very
  • quite
  • so
  • Its
  • most
  • its
  • shows
  • results
  • thats
  • stated
  • Quite
  • Most significant
  • right neighbors
  • defined
  • written
  • labeled
  • marked
  • visible
  • demonstrated
  • superior
  • stated
  • shows
  • demonstrates
  • understood

65
2.4. New solution as combination of two existing
approaches
  • Overview
  • Use context information to gather common direct
    neighbors of the input word ? they are most
    probably marked by the same grammatical
    information
  • Use these neighbor cooccurrences to find words
    that have similar cooccurrence profiles ? those
    that are surrounded by the same cooccurrences
    bear mostly the same grammatical marker
  • Sort those words by edit distance and keep 150
    most similar ? since further words only add
    random noise
  • Compute letter successor variety for each
    transition between two characters of the input
    word
  • Report boundaries where the LSV is above
    threshold

66
2.5. Letter successor variety
  • Letter successor variety Harris (55)
  • where word-splitting occurs if the number of
    distinct letters that follows a given sequence of
    characters surpasses the threshold.
  • Input are the 150 most similar words
  • Observing how many different letters occur after
    a part of the string
  • c- In the given list after c- 5 letters
  • cl- only 3 letters
  • cle- only 1 letter
  • -ly but reversed before ly 16 different
    letters (16 different stems preceding the suffix
    ly)
  • c l e a r l y
  • 28 5 3 1 1 1 1 1 f. left (thus after
    cl 5 various letters)
  • 1 1 2 1 3 16 10 14 f. right (thus before
    -y 10 var. letters)

67
2.5.1. Balancing factors
  • LSV score for each possible boundary is not
    normalized and needs to be weighted against
    several factors that otherwise add noise
  • freq Frequency differences between beginning and
    middle of word
  • multiletter Representation of single phonemes
    with several letters
  • bigram Certain fixed combinations of letters
  • Final score s for each possible boundary is then
  • s LSV freq multiletter bigram

68
2.5.2. Balancing factors Frequency
  • LSV is not normalized against frequency
  • 28 different first letters within 150 words
  • 5 different second letters within 11 words,
    beginning with c
  • 3 different third letters within 4 words,
    beginning with cl
  • Computing frequency weight freq
  • 4 out of 11 begin with cl- then weight is 4/11
  • c l e a r l y
  • 150 11 4 1 1 1 1 1 of 11 4
    begin with cl
  • 0.1 0.4 0.3 1 1 1 1 1
    from left

69
2.5.3. Balancing factors Multiletter Phonemes
  • Problem Two or more letters which together
    represent one phoneme carry away the nominator
    for the overlap factor quotient
  • Letter split variety
  • s c h l i m m e
  • 7 1 7 2 1 1 2
  • 2 1 1 1 2 4 15
  • Computing overlap factor
  • 150 27 18 18 6 5 5 5
  • 2 2 2 2 3 7 105 150
  • thus at this point the LSV 7 is
    weighted 1 (18/18), but since sch is one phoneme,
    it should have been 18/150 !
  • Solution Ranking of bi- and trigrams, highest
    receives weight of 1.0
  • Overlap factor is recomputed as weighted average
  • In this case that means 1.0 27/150, since sch
    is the highest trigram and has a weight of 1.0.

70
2.5.4. Balancing factors Bigrams
  • It is obvious that th in English is almost
    never to be divided
  • Computation of bigram ranking over all words in
    word list and give 0.1 weight to highest ranked
    and 1.0 to lowest ranked.
  • LSV score then multiplied with resulting weight.
  • Thus, the German ch- which is the highest ranked
    bigram receives a penalty of 0.1 and thus it is
    nearly impossible that it becomes a morpheme
    boundary

71
2.5.5. Sample computation
  • Compute letter successor variety
  • c l e a r - l y e
    a r l y
  • 28 5 3 1 1 1 1 1 40
    5 1 1 2 1
  • 1 1 2 1 3 16 10 10 1
    2 1 4 6 19
  • Balancing Frequencies
  • 150 11 4 1 1 1 1 1 150
    9 2 2 2 1
  • 1 1 2 2 5 76 90 150 1
    2 2 6 19 150
  • Balancing Multiletter weights
  • Bi l 0.4 0.1 0.5 0.2 0.5 0.0
    0.2 0.2 0.5 0.0
  • Tri r 0.1 0.1 0.1 0.1 0.0
    0.0 0.1 0.0
  • Bi l 0.5 0.2 0.5 0.0 0.1 0.3
    0.5 0.0 0.1 0.3
  • Tri r 0.1 0.1 0.0 0.0 0.2
    0.0 0.0 0.2
  • Balancing Bigram weight
  • 0.1 0.5 0.2 0.5 0.0 0.1
    0.2 0.5 0.0 0.1
  • Left and Right LSV scores
  • 0.1 0.3 0.0 0.4 1.0 0.9
    0.0 0.0 0.5 1.7
  • 0.3 0.9 0.1 0.0 12.4 3.7
    1.0 0.0 0.7 0.2
  • Computing right score for clear-ly
  • 16(76/900.176/150)/(1.00.1)(1-0.0)12.4

72
Second Part Training and Applying classifier
root
  • Any word list can be stored in a trie
    (Fredkin60) or in a more efficient version of a
    trie, a PATRICIA compact tree (PCT)
    (Morrison68)
  • Example
  • clearly
  • early
  • lately
  • clear
  • late

r
y
e
a
l
a
e
r
a
e
t
l
a
l
c
e
a


l
l

End or beginning of word
c


73
3.1. PCT as a Classificator
root
clear
ly
late
root
ear

late

clear
ly
late
ly2
1
1
cl


ear

late

ly1
ly1
1
1


cl

ly1
ly1
1
Apply deepest found node
retrieve known information

ly1
Amazing?ly
add known information
dear?ly
clear-ly, late-ly, early, Clear, late
amazing-ly
dearly
74
4. Evaluation
  • Boundary measuring each boundary detected can be
    correct or wrong (precision) or boundaries can be
    not detected (recall)
  • First evaluation is global LSV with the proposed
    improvements

75
Evaluating LSV Precision vs. Recall
76
Evaluating LSV F-measure
77
Evaluating combination Precision vs. Recall
78
Evaluating combination F-measure
79
Comparing combination with global LSV
80
4.1. Results
  • German newspaper corpus with 35 million sentences
  • English newspaper corpus with 13 million
    sentences

t5 German English
lsv Precision 80,20 70,35
lsv Recall 34,52 10,86
lsv F-measure 48,27 18,82
combined Precision 68,77 52,87
combined Recall 72,11 52,56
combined F-measure 70,40 55,09
81
4.2. Statistics
en lsv en comb tr lsv tr comb fi lsv fi comb
Corpus size 13 million 13 million 1 million 1 million 4 million 4 million
nunmber of word( form)s 167.377 167.377 582.923 582.923 1.636.336 1.636.336
analysed words 49.159 94.237 26.307 460.791 68.840 1.380.841
boundaries 70.106 131.465 31.569 812.454 84.193 3.138.039
morph. length 2,60 2,56 2,29 3,03 2,32 3,73
length of analysed words 8,97 8,91 9,75 10,62 11,94 13,34
length of unanalysed words 7,56 6,77 10,12 8,15 12,91 10,47
morphemes per word 2,43 2,40 2,20 2,76 2,22 3,27
82
Assessing true error rate
  • Typical sample list of words considered as wrong
    due to CELEX
  • Tau-sende Tausend-e
  • senegales-isch-e senegalesisch-e
  • sensibelst-en sens-ibel-sten
  • separat-ist-isch-e separ-at-istisch-e
  • tris-t trist
  • triump-hal triumph-al
  • trock-en trocken
  • unueber-troff-en un-uebertroffen
  • trop-f-en tropf-en
  • trotz-t-en trotz-ten
  • ver-traeum-t-e vertraeumt-e
  • Reasons
  • Gender e (in (Creutz Lagus 05) for example
    counted as correct)
  • compounds (sometimes separated, sometimes not)
  • -t-en Error
  • With proper names isch often not analyzed
  • Connecting elements

83
4.4. Real example
Ver-trau-enskrise Ver-trau-ensleute Ver-trau-ens-m
ann Ver-trau-ens-sache Ver-trau-ensvorschuß Ver-tr
au-ensvo-tum Ver-trau-ens-würd-igkeit Ver-traut-es
Ver-trieb-en Ver-trieb-spartn-er Ver-triebene Ver
-triebenenverbände Ver-triebs-beleg-e
  • Orien-tal
  • Orien-tal-ische
  • Orien-tal-ist
  • Orien-tal-ist-en
  • Orien-tal-ist-ik
  • Orien-tal-ist-in
  • Orient-ier-ung
  • Orient-ier-ungen
  • Orient-ier-ungs-hilf-e
  • Orient-ier-ungs-hilf-en
  • Orient-ier-ungs-los-igkeit
  • Orient-ier-ungs-punkt
  • Orient-ier-ungs-punkt-e
  • Orient-ier-ungs-stuf-e

84
5. Further research
  • Examine quality on various language types
  • Improve trie-based classificator
  • Possibly combine with other existing algorithms
  • Find out how to acquire morphology of
    non-concatenative languages
  • Deeper analysis
  • find deletions
  • alternations
  • insertions
  • morpheme classes etc.

85
References
  • (Argamon et al. 04) Shlomo Argamon, Navot Akiva,
    Amihood Amir, and Oren Kapah. Effcient
    unsupervized recursive word segmentation using
    minimun desctiption length. In Proceedings of
    Coling 2004, Geneva, Switzerland, 2004.
    GLDV-Tagung, pages 93-99, Leipzig, March 1998.
    Deutscher Universitätsverlag.
  • (Baroni 03) Marco Baroni. Distribution-driven
    morpheme discovery A computational/experimental
    study. Yearbook of Morphology, pages 213-248,
    2003. France, http//www.sle.sharp.co.uk/senseval2
    /, 5-6 July 2001.
  • (Creutz Lagus 05) Mathias Creutz and Krista
    Lagus. Unsupervised morpheme segmentation and
    morphology induction from text corpora using
    morfessor 1.0. In Publications in Computer and
    Information Science, Report A81. Helsinki
    University of Technology, March 2005.
  • (Déjean 98) Hervé Déjean. Morphemes as necessary
    concept for structures discovery from untagged
    corpora. In D.M.W. Powers, editor,
    NeMLaP3/CoNLL98 Workshop on Paradigms and
    Grounding in Natural Language Learning, ACL,
    pages 295-299, Adelaide, January 1998.
  • (Dunning 93) T. E. Dunning. Accurate methods for
    the statistics of surprise and coincidence.
    Computational Linguistics, 19(1)61-74, 1993.

86
6. References II
  • (Goldsmith 01) John Goldsmith. Unsupervised
    learning of the morphology of a natural language.
    Computational Linguistics, 27(2)153-198, 2001.
  • (Hafer Weiss 74) Margaret A. Hafer and Stephen
    F. Weiss. Word segmentation by letter successor
    varieties. Information Storage and Retrieval,
    10371-385, 1974.
  • (Harris 55) Zellig S. Harris. From phonemes to
    morphemes. Language, 31(2)190-222, 1955.
  • (Kazakov 97) Dimitar Kazakov. Unsupervised
    learning of naive morphology with genetic
    algorithms. In A. van den Bosch, W. Daelemans,
    and A. Weijters, editors, Workshop Notes of the
    ECML/MLnet Workshop on Empirical Learning of
    Natural Language Processing Tasks, pages 105-112,
    Prague, Czech Republic, April 1997.
  • (Quasthoff Wolff 02) Uwe Quasthoff and
    Christian Wolff. The poisson collocation measure
    and its applications. In Second International
    Workshop on Computational Approaches to
    Collocations. 2002.
  • (Schone Jurafsky 01) Patrick Schone and Daniel
    Jurafsky. Language-independent induction of part
    of speech class labels using only language
    universals. In Workshop at IJCAI-2001, Seattle,
    WA., August 2001. Machine Learning Beyond
    Supervision.

87
E. Gender-e vs. Frequency-e
vs. other-e andere 8.4 keine 6.8 rote
11.6 stolze 8.0 drehte 10.8 winzige
9.7 lustige 13.2 rufe 4.4 Dumme 12.6
vs. Gender-e Schule 8.4 Devise 7.8 Sonne
4.5 Abendsonne 5.3 Abende 5.5 Liste
6.5
Frequency-e Affe 2.7 Junge 5.3 Knabe
4.6 Bursche 2.4 Backstage 3.0
About PowerShow.com