Title: Selforganizing word representations for fast sentence processing
1Self-organizing word representations for fast
sentence processing
Stefan Frank Nijmegen Institute for Cognition and
Information Nijmegen, The Netherlands
2Relations among words Syntagmatic and
paradigmatic
horizontal syntagmatic relations
3Relations among words Syntagmatic and
paradigmatic
horizontal syntagmatic relations
vertical paradigmatic relations
4Syntagmatic and paradigmatic relations
5Syntagmatic and paradigmatic relations
6Syntagmatic and paradigmatic relations
7Representing words as vectors
- In several models, like LSA (Landauer Dumais,
1997) and HAL (Burgess, Livesay Lund, 1998)
each word corresponds to a vector in a
high-dimensional state space. - Distances between vectors encode relations
between the represented words.
8Representing words as vectors
Projections of a small part of word space onto
two dimensions(Burgess, Livesay, Lund, 1998)
listened
talked
crawled
ran
walked
cat
wolf
dog
9Representing words as vectors
- Vectors are near each other in state space if the
corresponding words belong to the same word class
and/or have similar meaning. - Words are organized paradigmatically.
- LSA en HAL account for some experimental
findings, e.g., semantic priming and synonym
judgement (and false memories?)
10Paradigmatic versus syntagmatic organization
- Why would the mental lexicon be organized
paradigmatically? - This makes it easy to find similar words, because
these are nearby in space. - But we dont use words to do semantic priming or
synonym judgement We use words to make
sentences. - For fast speaking and understanding, we need fast
access to words that are likely to occur next. - This calls for a syntagmatic organization.
11Linking the two types of organization
Hypothesis A paradigmatic organization of words
facilitates a syntagmatic organization of
word-sequences
- Words are organized paradigmatically because this
allows for fast sentence processing/production
12A computational modelling recipe
Ingredients a simplified language, a measure for
syntagmaticity, and a measure of paradigmaticity
- Construct a dynamical system (or recurrent neural
network) - Feed the system with sentences one word at a
time. Its state at each moment represents the
word sequence (input) so far. - Adjust word representations to increase the
syntagmatic organization of the system states. - Show that the resulting word representations are
organized paradigmatically.
13The language lexiconFarka Crocker (2006)
- Total 72 words
- Nouns
- Proper John, Kate, Mary, Steve
- Singular boy, girl, cat, dog,
- Plural boys, girls, cats, dogs,
- Mass bread, meat, fish
- Verbs
- Auxiliary do(es), is, are, were
- Transitive eat(s), chase(s), like(s)
- Intransitive eat(s), bark(s), walk(s)
- Adjectives crazy, good, happy,
- Function words where, who, what, the, that,
those, .
14The language example sentencesFarka Crocker
(2006)
- Declaratives
- the good boy eats .
- smart Kate who eats meat feeds the dog .
- a girl is sleazy .
- those are women .
- Interrogatives
- where is the hungry cat .
- does Steve run .
- what does the man who the happy girls see do .
- does Mary wanna eat bread .
- Imperatives
- sing .
- walk the dog .
15A dynamical connectionist modelNetwork
architecture
connections to k lt n units
16A dynamical connectionist modelState-space
trajectories
- A simple, discrete, linear dynamical system
- xt1 Wxt yt1
input to FRN at time step t
FRN connect-ion weights
n-dimensional FRN state vector at time step t (x0
1)
- nn-matrix W has small random values.
- The sequence x0, , xt is the trajectory through
state space resulting from input sequence y1, ,
yt - Trajectories are syntagmatic if the distance
between xt and xt1 reflects the unlikelihood of
input yt1 .
17A dynamical connectionist modelWord
representation
- Each word w is represented by a k-dimensional
vector vw (with k lt n). - If word w occurs at time step t, the input vector
yt equals vw (up to the kth element the rest is
0). - The objective is to obtain a syntagmatic
organisation of trajectories x0,,xt by adjusting
the word vectors v - Matrix W is not adjusted
18A dynamical connectionist model Adapting word
representations
- Initially, all vectors v are random between -1
and 1 - 5000 training sentences, e.g., The girls are nice
. - A 2D-example (n 2, k 1)
x0
and receives y1 (vthe , 0)T
x1 Wx0y1
?vthe
19A dynamical connectionist model Adapting word
representations
- Initially, all vectors v are random between -1
and 1 - 5000 training sentences, e.g., The girls are nice
. - A 2D-example (n 2, k 1)
x0
and receives y1 (vthe , 0)T
x1 Wx0y1
so that x1 moves closer to x0
?vgirls
- Input is y2 (vgirls , 0)T
x2 Wx1y2
20A dynamical connectionist model Adapting word
representations
- Initially, all vectors v are random between -1
and 1 - 5000 training sentences, e.g., The girls are nice
. - A 2D-example (n 2, k 1)
x0
and receives y1 (vthe , 0)T
x1 Wx0y1
so that x1 moves closer to x0
- Input is y2 (vgirls , 0)T
x2 Wx1y2
21Measuring syntagmaticity
- Compare
- Grammatical test (i.e., non-training) sentences
- Pseudosentences random word strings with the
same length distribution, word frequencies, and
number of word repetitions as test sentences. - If trajectories are organized syntagmatically,
the total trajectory length for test sentences
(Lest) should be shorter than for pseudosentences
(Lpseudo). - Syntagmaticity Lpseudo/Ltest ( 1 before
training).
22Measuring paradigmaticity
- If words are represented paradigmatically,
similar words have similar vectors - Divide the words into groups according to word
class and meaning - Names John, Kate, Mary, Steve
- Mass nouns bread, meat, fish
- Articles a, the
- Singular auxiliary verbs does, is, was
- Plural auxiliary verbs do, are, were
- etcetera...
23Measuring paradigmaticity
- If words are represented paradigmatically,
similar words have similar vectors - Divide the words into groups according to word
class and meaning - If words are organized paradigmatically, the
average distances within groups (dwithin) should
be smaller than between groups (dbetween) - Paradigmaticity dbetween/dwithin ( 1 before
training).
24Results
n 100 k 50
25Word representations plotted by first two
principal components
26(No Transcript)
27Conclusion
- Word representations were adapted to decrease the
lengths of state-space trajectories resulting
from training sentences. - As a result, words that occur after similar
contexts received similar representations. - A paradigmatic organization of the mental lexicon
can facilitate fast sentence processing.
28Bonus SlidesSyntagmaticity and reading-time
predictions
- In a really good syntagmatic organization, the
distance through state space travelled at a time
step, should correspond to the improbability of
the input word of that time step. - Word probabilities are known, because sentences
were produced by a known probabilistic grammar. - The correlation between the amount of information
( 2log(Pr)) conveyed by words (of test
sentences) and the resulting state-space
distances can serve as an alternative measure for
syntagmaticity
29Bonus SlidesSyntagmaticity and reading-time
predictions
- Presumably, it takes more time to travel a larger
distance through state space. - More informative words take longer to process.
- Hale (2003) word-reading times reflect word
information. - If Hale is right, the model could make
reading-time predictions when trained on a corpus
of natural text.
30Bonus SlidesSyntagmaticity and reading-time
predictions
31Bonus SlidesAdjusting word representations
- The current state vector is xt and the input is
yt . - New state xt1 Wxt yt .
- If there is nothing else to learn, the state
vector can stay the same xt1 xt ? yt xt -
Wxt . - But there are other things to learn, so take
small steps?yt ?(xt - Wxt - yt), with
learning rate ? .001. - If the input word was w, its representation vw is
adjusted to result in ?yt (for the first k
elements) - ?vw ?(xt - Wxt - yt),
- using only the first k elements of xt and the
first k rows of W.