Learning linguistic structure - PowerPoint PPT Presentation

1 / 120
About This Presentation
Title:

Learning linguistic structure

Description:

... savings: 3 copies of the stem act: that's 3 x 3 = 9 letters = 40.5 bits (taking 4.5 bits/letter) ... discovery of relationship between stems (lov~love, win ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 121
Provided by: johnagol
Category:

less

Transcript and Presenter's Notes

Title: Learning linguistic structure


1
Learning linguistic structure
  • John Goldsmith
  • February 7, 2003

2
  • A large part of the field of computational
    linguistics has moved during the 1990s from
  • developing grammars, speech recognition engines,
    etc., that simply work, to
  • developing systems that learn language-specific
    parameters from large amounts of data.

3
  • Prima facie, this may appear to be a divergence
    between the work of linguists and computational
    linguists,
  • but I think it should be interpreted as quite the
    opposite.

4
Task of even traditional linguistics
  • Is to develop grammars of human languages and
    more generally
  • To understand the relationship between data and
    grammars.

Linguistic theory
Data
Grammar
Thats the goal, at least were exceedingly far
from it, though.
5
A bit more about the goal
  • Whats the input?
  • Data which comes to the learner, in acoustic
    form, unsegmented
  • Sentences not broken up into words
  • Words not broken up into their components
    (morphemes).
  • Words not assigned to lexical categories (noun,
    verb, article, etc.)

With a meaning representation?
6
The relationship between data and grammar
  • Is the goal of the discipline
  • Is a reasonable characterization of what the
    child does
  • Can be accounted for by a theory of language
    learning.

7
Idealization of the language-learning scheme
  • Segment the soundstream into words the words
    form the lexicon of the language.
  • Discover internal structure of words this is the
    morphology of the language.
  • Infer a set of lexical categories for words each
    word is assigned to (at least) one lexical
    category.
  • Infer a set of phrase-structure rules for the
    language.

8
Idealization?
  • While these tasks are individually coherent, we
    make no assumption that any one must be completed
    before another can be begun.

9
Todays task
  • Learning the morphology of a language, given
    knowledge of the words of the language, and of a
    large sample of utterances.

10
Goals
  • Given a corpus, learn
  • The set of word-roots, prefixes, and suffixes,
    and principles of combinations
  • Principles of automatic alternations (e.g., e
    drops before the suffixes ing,ity and
    ed, but not before s)
  • Some suffixes have one grammatical function
    (-ness) while others have more (e.g., -s song-s
    versus sing-s).

11
Why?
  • Practical applications
  • Automatic stemming for multilingual information
    retrieval
  • A corpus broken into morphemes is far superior to
    a corpus broken into words for statistically-drive
    n machine translation
  • Develop morphologies for speech recognition
    automatically

12
Theoretically
  • There is a strong bias currently in linguistics
    to underestimate the difficulty of language
    learning
  • For example, to identify language learning with
    the selection of a phrase-structure grammar, or
    with the independent setting of a small number of
    parameters.

13
Morphology
  • The learning of morphology is a very difficult
    task, in the sense that every word W of length
    W can potentially be divided into 1, 2, , L
    morphemes mi, constrained only by Smi W
    and thats ignoring labeling (which is the stem,
    which the affix).
  • The number of potential morphologies for a given
    corpus is enormous.

14
So the task is a reality check for discussions of
language learning
15
Ideally
  • We would like to pose the problem of
    grammar-selection as an optimization problem, and
    cut our task into two parts
  • Specification of the objective function to be
    optimized, and
  • Development of practical search techniques to
    find optima in reasonable time.

16
Current status
  • Linguistica a C Windows-based program
    available for download at
  • http//humanities.uchicago.edu/faculty/goldsmith/L
    inguistica2000
  • Technical discussion in
  • Computational Linguistics (June 2001)
  • Good results with 5,000 words, very fine-grained
    results with 500,000 words (corpus length, not
    lexicon count), especially in European languages.

17
Todays talk
  • Specify the task in explicit terms
  • Minimum Description Length analysis what it is,
    and why it is reasonable for this task how it
    provides our optimization criteria.
  • Search heuristics (1) bootstrap heuristic, and
    (2) incremental heuristics.
  • Morphology assigns a probability distribution
    over its words.
  • Computing the length of the morphology.

18
Todays talk (continued)
6. Results 7. Some work in progress learning
syntax to learn about morphology
19
Given a text (but no prior knowledge of its
language), we want
  • List of stems, suffixes, and prefixes
  • List of signatures.
  • A signature a list of all suffixes (prefixes)
    appearing in a given corpus with a given stem.
  • Hence, a stem in a corpus has a unique signature.
  • A signature has a unique set of stems associated
    with it

20
Example of signature in English
  • NULL.ed.ing.s
  • ask call point
  • summarizes
  • ask asked asking asks
  • call called calling calls
  • point pointed pointing points

21
We would like to characterize the discovery of a
signature as an optimization problem
  • Reasonable tack formulate the problem in terms
    of Minimum Description Length (Rissanen, 1989)

22
Todays talk
  • Specify the task in explicit terms
  • Minimum Description Length analysis what it is,
    and why it is reasonable for this task how it
    provides our optimization criteria.
  • Search heuristics (1) bootstrap heuristic, and
    (2) incremental heuristics.
  • Morphology assigns a probability distribution
    over its words.
  • Computing the length of the morphology.

23
Minimum Description Length (MDL)
  • Jorma Rissanen Stochastic Complexity in
    Statistical Inquiry (1989)
  • Work by Michael Brent and Carl de Marcken on
    word-discovery using MDL in the mid-1990s.

24
Essence of MDL
  • If we are given
  • a corpus, and
  • a probabilistic morphology, which technically
    means that we are given a distribution over
    certain strings of stems and affixes.
  • Then we can compute an over-all measure
    (description length) which we can seek to
    minimize over the space of all possible analyses.

25
Description length of a corpus C, given a
morphology M
  • The length, in bits, of the shortest formulation
    of the morphology expressible on a given Turing
    machine
  • Optimal compressed length of the corpus, using
    that morphology .

26
Probabilistic morphology
  • To serve this function, the morphology must
    assign a distribution over the set of words it
    generates, so that the optimal compressed length
    of an actual, occurring corpus (the one were
    learning from) is -1 log probability it assigns.

27
Essence of MDL
  • The goodness of the morphology is also measured
    by how compact the morphology is.
  • We can measure the compactness of a morphology in
    information theoretic bits.

28
How can we measure the compactness of a
morphology?
  • Lets consider a naïve version of description
    length count the number of letters.
  • This naïve version is nonetheless helpful in
    seeing the intuition involved.

29
Naive Minimum Description Length
Corpus jump, jumps, jumping laugh, laughed,
laughing sing, sang, singing the, dog, dogs
total 62 letters
Analysis Stems jump laugh sing sang dog (20
letters) Suffixes s ing ed (6 letters) Unanalyzed
the (3 letters) total 29 letters.
Notice that the description length goes UP if we
analyze sing into sing
30
Essence of MDL
  • The best overall theory of a corpus is the one
    for which the sum of
  • -1 log prob (corpus)
  • length of the morphology
  • (thats the description length) is the smallest.

31
Essence of MDL
32
Overall logic
  • Search through morphology space for the
    morphology which provides the smallest
    description length.

33
Brief foreshadowing of our calculation of the
length of the morphology
  • A morphology is composed of three lists a list
    of stems, a list of suffixes (say), and a list of
    ways in which the two can be combined
    (signatures).
  • Information content of a list

34
Stem list
35
Todays talk
  • Specify the task in explicit terms
  • Minimum Description Length analysis what it is,
    and why it is reasonable for this task how it
    provides our optimization criteria.
  • Search heuristics (1) bootstrap heuristic, and
    (2) incremental heuristics.
  • Morphology assigns a probability distribution
    over its words.
  • Computing the length of the morphology.

36
Bootstrap heuristic
  • Find a method to locate likely places to cut a
    word.
  • Allow no more than 1 cut per word (i.e., maximum
    of 2 morphemes).
  • Assume this is stem suffix.
  • Associate with each stem an alphabetized list of
    its suffixes call this its signature.
  • Accept only those word analyses associated with
    robust signatures

37
  • where a robust signature is one with a minimum
    of 5 stems (and at least two suffixes).
  • Robust signatures are pieces of secure structure.

38
Heuristic to find likely cuts
  • Best is a modification of a good idea of Zellig
    Harris (1955)
  • Current variant
  • Cut words at certain peaks of successor
    frequency.
  • Problems can over-cut can under-cut and can
    put cuts too far to the right (aborti-
    problem). Not a problem!

39
Successor frequency
n
g o v e r
Empirically, only one letter follows gover n
40
Successor frequency
e
i
m
g o v e r n
o
s

Empirically, 6 letters follows govern n
41
Successor frequency
g o v e r n m
e
Empirically, 1 letter follows governm e
g o v e r 1 n 6 m 1 e
peak of successor frequency
42
Lots of errors
9 18 11 6 4 1 2 1 1 2
1 1
c o n s e r v a t i
v e s
wrong
right
wrong
43
Even so
  • We set conditions
  • Accept cuts with stems at least 5 letters in
    length
  • Demand that successor frequency be a clear peak
    1 N 1 (e.g. govern-ment)
  • Then for each stem, collect all of its suffixes
    into a signature and accept only signatures with
    at least 5 stems to it.

44
2. Incremental heuristics
  • Enormous amount of detail being skippedlets
    look at one simple case
  • Loose fit suffixes and signatures to split
    Collect any string that precedes a known suffix.
  • Find all of its apparent suffixes, and use MDL to
    decide if its worth it to do the analysis.

45
Using MDL to judge a potential stem and
potential signature
  • Suppose we find act, acted, action, acts.
  • We have the suffixes NULL, ed, ion, and s, but
    not the signature NULL.ed.ion.s
  • Lets compute cost versus savings of signature
    NULL.ed.ion.s

46
savings
  • Savings
  • Stem savings 3 copies of the stem act thats 3
    x 3 9 letters 40.5 bits (taking 4.5
    bits/letter).
  • Suffix savings ed, ing, s 6 letters, another 27
    bits.
  • Total of 67.5 bits--

47
Cost of NULL.ed.ing.s
  • A pointer to each suffix

To give a feel for this
Total cost of suffix list about 30 bits. Cost of
pointer to signature total cost is -- all
the stems using it chip in to pay for its cost,
though.
48
  • Cost of signature about 43 bits
  • Savings about 67 bits
  • Slight worsening in the compressed length of
    these 4 words.
  • so MDL says Do it! Analyze the words as stem
    suffix.
  • Notice that the cost of the analysis would have
    been higher if one or more of the suffixes had
    not already existed.

49
Todays talk
  • Specify the task in explicit terms
  • Minimum Description Length analysis what it is,
    and why it is reasonable for this task how it
    provides our optimization criteria.
  • Search heuristics (1) bootstrap heuristic, and
    (2) incremental heuristics.
  • Morphology assigns a probability distribution
    over its words.
  • Computing the length of the morphology.

50
Frequency of analyzed word
W is analyzed as belonging to Signature s, stem
T and suffix F.
x means the count of xs in the corpus (token
count)
Where W is the total number of words.
Actually what we care about is the log of this
51
(No Transcript)
52
Todays talk
  • Specify the task in explicit terms
  • Minimum Description Length analysis what it is,
    and why it is reasonable for this task how it
    provides our optimization criteria.
  • Search heuristics (1) bootstrap heuristic, and
    (2) incremental heuristics.
  • Morphology assigns a probability distribution
    over its words.
  • Computing the length of the morphology.

53
The length of a morphology
  • A morphology is a set of 3 things
  • A list of stems
  • A list of suffixes
  • A list of signatures with the associated stems.
  • Well make an effort to make our grammars consist
    primarily of lists, whose length is conceptually
    simple.

54
Length of a list
  • A header telling us how long the list is, of
    length (roughly) log2 N, where N is the length.
  • N entries. Whats in an entry?
  • Raw lists a list of strings of letters, where
    the length of each letter is log2 (26) the
    information content of a letter (we can use a
    more accurate conditional probability).
  • Pointer lists A list of pointers to the entries.
  • Someday the information contained in the meaning
    of each morpheme

55
Connections across lists
  • Raw suffix list
  • ed
  • s
  • ing
  • ion
  • able
  • Signature 1
  • Suffixes
  • pointer to ing
  • pointer to ed
  • Signature 2
  • Suffixes
  • pointer to ing
  • pointer to ion

The length of each pointer is
-- usually cheaper than the letters themselves
56
  • The fact that a pointer to a symbol has a length
    that is inversely proportional to its frequency
    is the key
  • We want the shortest overall grammar so
  • That means maximizing the re-use of units (stems,
    affixes, signatures, etc.)

57
structure
Number of letters
Signatures, which well get to shortly
58
Information contained in the Signature component
list of pointers to signatures
ltXgt indicates the number of distinct elements in X
59
Repair heuristics using MDL
  • We could compute the entire MDL in one state of
    the morphology make a change compute the whole
    MDL in the proposed (modified) state and
    compared the two lengths.

Original morphology Compressed data
Revised morphology compressed data
lt gt
60
  • But its better to have a more thoughtful
    approach.
  • Lets define

Then the size of the punctuation for the 3 lists
is
Then the change of the size of the punctuation in
the lists
61
Size of the suffix component, remember
Change in its size when we consider a
modification to the morphology 1. Global effects
of change of number of suffixes 2. Effects on
change of size of suffixes in both states 3.
Suffixes present only in state 1 4. Suffixes
present only in state 2
62
Suffix component change
Suffixes whose counts change
Global effect of change on all suffixes
Contribution of suffixes that appear only in
State1
Contribution of suffixes that appear only in
State 2
63
Digression on entropy, MDL, and morphology
  • Why using MDL is closely related to measuring the
    complexity of the space of possible vocabularies

You better save this for another day, John
youve only got 15 minutes left.
64
Todays talk (continued)
6. Results 7. Some work in progress learning
syntax to learn about morphology
65
How good?
  • In practice, on a large naturally-occurring
    corpus of a European language precision and
    recall in the low 80.
  • Precision proportion of predicted cuts that are
    correct
  • Recall proportion of actual cuts that are
    predicted.

66
  • These numbers go to the high 98 if we use an
    artificial corpus with all of the inflected forms
    of a word.

67
  • Real life challenges include
  • alumnus
  • Johnson, Acheson, Adrianople
  • adenomas
  • Adirondacks
  • Abolition
  • Los Angeles

68
Todays talk (continued)
6. Results 7. Some work in progress learning
syntax to learn about morphology
69
Current research projects
  • Allomorphy Automatic discovery of relationship
    between stems (lovlove, winwinn)
  • Use of syntax (automatic learning of syntactic
    categories)
  • Rich morphology other languages (e.g., Swahili),
    other sub-languages (e.g., biochemistry
    sub-language) where the mean morphemes/word is
    much higher
  • Ordering of morphemes

70
Allomorphy Automatic discovery of relationship
between stems
  • Currently learns (unfortunately, over-learns) how
    to delete stem-final letters in order to simplify
    signatures.
  • E.g., delete stem-final e in English before
    suffixes ing, -ed, -ion (etc.).

71
Automatic learning of syntactic categories
  • Work in progress with Misha Belkin
  • Finding eigenvector decomposition of a graph that
    represents word neighbors.

Using eigenvectors of the bigram graph to infer
morpheme identity. With Mikhail Belkin.
Proceedings of the Morphology/Phonology Learning
Workshop of ACL-02. Association for Computational
Linguistics..
72
Disambiguating morphs?
  • Automatic learning of morphology can provide us
    with a signature associated with a given stem
  • Signature alphabetized list of affixes
    associated with a given stem in a corpus.

73
For example
  • Signature NULL.ed.ing.s
  • aid, ask, call, claim, help,kick
  • Signature NULL.ed.ing
  • add, assist, attend, consider
  • Signature NULL.s
  • achievement, acre, action, administrator, affair

74
  • The signature
  • NULL.ed.ing
  • is much more a subsignature of
  • NULL.ed.ing.s
  • than
  • NULL.s
  • is because of ss ambiguity (noun, verb).

75
How can we determine whether a given morph (ed,
s) represents more than 1 morpheme?
  • I dont think that we can do this on the basis of
    morphological information.

76
Goal find a way of describing syntactic behavior
in a way that is dependent only on a corpus.
  • That is, in a fashion that is language-independent
    but corpus-dependent though the global
    structure that is induced from 2 corpora from the
    same language will be very similar.

77
French
Finite verbs
plural nouns
Fem. sg. nouns
78
With such a method
  • We can look at words formed with the same
    suffix, putting words into buckets based on the
    signature their stem is in
  • Bucket 1 (NULL.ed.ing.s) aided, asked, called
  • Bucket 2 (NULL.ed.ing) added, assisted,
    attended.
  • Q do the average positions from each of the
    buckets form a tight cluster?

79
  • If the average locations of each bucket of ed
    words form a tight cluster, then ed is not
    ambiguous.
  • If the average locations of each bucket (from
    distinct signatures) does not form a tight
    cluster, the morpheme is not the same across
    signatures.

80
Method
  • Not a clustering method neither top-down nor
    bottom-up.
  • Two step procedure
  • 1. Construct a nearest-neighbor graph.
  • 2. Reduce the graph to 2-dimensions by means of
    eigenvector decomposition.

81
Nearest neighbors
  • Following a long list of researchers
  • We begin by assuming that a word Ws distribution
    can be described by a vector L describing all of
    its left-hand neighbors and a vector R describing
    all of its right-hand neighbors.

82
  • V Size of corpus vocabulary V
  • Lw,Rw are vectors that live in RV.
  • If V is ordered alphabetically, then
  • Lw (4, 0, 0, 0, )

of occurrences of abandoned before w
of occurrences of a before w
of occurrences of abatuna before w
83
Similarity of syntactic behavior is modeled as
closeness of L-vectors
  • where closeness of 2 vectors is modeled as the
    angle between them.

84
Construct a (non-directed) graph
  • Its vertices are the words W in V.
  • For each word W
  • Pick the K most-similar words (K 20, 50) (by
    angle of L-vector)
  • Add an edge to the graph connecting W to each of
    those words.

85
Canonical matrix representation of a graph
  • M(i,j) 1 iff there is an edge connecting wi and
    wj that is,
  • iff wi and wj are similar words as regards how
    they interact with the word immediately to the
    left.

86
Where is this matrix M?
  • Its a point in a space of size V(V-1)/2. Not
    very helpful, really.
  • How can we optimally reduce it to a space of
    small dimension?
  • Find the eigenvectors of the normalized laplacian
    of the graph.
  • See Chung, Malik and Shi, Belkin and Niyogi

87
A graph and its matrix M
  • The degree of a vertex ( word) is the number of
    edges adjacent (linked) to it.
  • Notice that this is not fixed across words.
  • The degree of vertex vi is the sum of the entries
    of the ith row.

88
The laplacian of the graph
  • Let D VxV diagonal matrix s.t.
  • diagonal entry M(i,i) degree of vi
  • D M is the Laplacian of the graph.
  • Its rows sum to 0.

89
Normalized laplacian
  • For each i, divide all entries in the ith row by
    vd(i).
  • For each i, divide all entries in the ith column
    by vd(i).
  • Result Diagonal elements are all 1.
  • Generally

90
Eigenvector decomposition
  • The eigenvectors form a spectrum, ranked by the
    value of their eigenvalues.
  • Eigenvalues run from 0 to 2 (L is positive
    semi-definite).
  • The eigenvector with 0 eigenvalue reflects words
    frequency.
  • But the next smallest gives us a good
    representation of the words

91
  • in the sense that the values associated with
    each word show how close the words are in the
    original graph.
  • We can graph the first two eigenvectors of the
    Left (or Right) graph each word is located at
    the coordinates corresponding to it in the
    eigenvector(s)

92
masculine plurals
Spanish (left)
fem. plurals
feminine sg nouns
masc. sg. nouns
past participles
finite verbs
93
German (left)
Neuter sg nouns
numbers, centuries
Fem. sg. nouns
Names of places
94
English (right)
nouns
modals
prepositions
of
to
95
English (left)
infinitives
past verbs
the
modals
96
Results of experiment
  • If we define the size of the minimal box that
    includes all of the vocabulary as being 1 by 1,
    then we find a small ( lt 0.10 ) average distance
    to mean for unambiguous suffixes (e.g., -ed
    (English), -ait (French) ) only for them.

97
Measure
  • To repeat we find the virtual location of the
    conflation of all of the stems of a given
    signature, plus the suffix in questione.g.,
    NULL.ed.ing_ed
  • We do this for all signatures containing ed
  • We compute average distance to the mean.

98
Average lt 0.10
Average gt 0.10
99
Rich morphologies
  • A practical challenge for use in data-mining and
    information retrieval in patent applications
    (de-oxy-ribo-nucle-ic, etc.)
  • Swahili, Hungarian, Turkish, etc.

100
The End
101
Appendices
102
Corpus
Pick a large corpus from a language -- 5,000 to
1,000,000 words.
103
Corpus
Feed it into the bootstrapping heuristic...
Bootstrap heuristic
104
Corpus
Bootstrap heuristic
Out of which comes a preliminary
morphology, which need not be superb.
Morphology
105
Corpus
Bootstrap heuristic
Feed it to the incremental heuristics...
Morphology
incremental heuristics
106
Corpus
Out comes a modified morphology.
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
107
Corpus
Is the modification an improvement? Ask MDL--
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
108
Corpus
If it is an improvement, replace the morphology...
Bootstrap heuristic
modified morphology
Morphology
Garbage
109
Corpus
Send it back to the incremental heuristics
again...
Bootstrap heuristic
modified morphology
incremental heuristics
110
Continue until there are no improvements to try.
Morphology
modified morphology
incremental heuristics
111
Skip
  • Consider the space of all words of length L,
    built from an alphabet of size b.
  • How many ways are there to build a vocabulary of
    size N?Call that U(b,L,N).
  • Clearly,

112
  • Compare that with the operation (choosing a set
    of N words of length L, alphabet size b) with the
    operation of choosing a set of T stems (of length
    t) and a set of F suffixes (of length f), where t
    f L.
  • If we take the complexity of each task to be
    measured by the log of its size, then were
    asking the size of

Skip
113
is easy to approximate, however.
Skip
remember
114
Skip
The number of bits needed to list all the
words the analysis
The length of all the pointers to all the
words the compressed corpus
Thus the log of the number of vocabularies
description length of that vocabulary, in the
terms weve been using
115
That means that the differences in the sizes of
the spaces of possible vocabularies is equal to
the difference in the description length in the
two cases hence,
Skip
Difference of complexity of simplex word
analysis and complexity of analyzed word
analysis log U(b,L,N) log U(b,t,T) log
U(b,f,F)
Difference in size of morphologies
Difference in size of compressed data
116
Skip
  • But weve (over)simplified in this case by
    ignoring the frequencies inherent in real
    corpora. Whats of great interest in real life is
    the fact that some suffixes are used often,
    others rarely, and similarly for stems.

117
Skip
  • We know something about the distribution of
    words, but nothing about distribution of stems
    and especially suffixes.
  • But suppose we wanted to think about the
    statistics of vocabulary choice in which words
    could be selected more than once.

118
  • We want to select N words of length L, and the
    same word can be selected. How many ways of doing
    this are there?
  • You can have any number of occurrence of a word,
    and 2 sets of the same number of them are
    indistinguishable. How many such vocabularies are
    there, then?

Skip
119
where Z(i) is the number of words of frequency i.
(Z stands for Zipf).
Skip
We dont know much about frequencies of
suffixes, but Zipfs law says that
hence for a morpheme set that obeyed the Zipf
distribution
120
Skip
End of digression
Write a Comment
User Comments (0)
About PowerShow.com