Title: ON THE USE OF LINGUISTIC CORPORA IN CONNECTIONIST MODELLING
1ON THE USE OF LINGUISTIC CORPORA IN CONNECTIONIST
MODELLING
- Kari Hiltula
- University of Tampere
- Finnish language
2Outline of the presentation
- A focus on the data in modelling
- Corpus as a basis for the training environment of
a model - Training a connectionist model
- The relationship between the model and the real
thing - The learning situation redefined
- Towards modelling meaning-induced learning
3A focus on the data in modelling
- Models of language learning are seen to partake
in the debate between connectionist vs. symbolic
theories of cognition (Pinker Prince 1988) - As a result, the discussion has focussed more on
the appropriate mechanism(s) of a model than on
the actual data - What a connectionist model comes to represent
depends mostly on the data it has been trained
with
4Corpus as a basis for the learning environment of
a model (1)
- The training data of a connectionist model is
often based on lexical and frequency data derived
from a corpus, which could be a general written
language corpus (e.g., The Brown Corpus), a
particular literary text in an electronic format,
a dictionary, etc. - The training data may consist of simplified
patterns to scale down the original problem for
the purposes of modelling but preserve the
relative frequency of the patterns in the chosen
corpus
5Corpus as a basis for the learning environment of
a model (2)
- A common conception of a connectionist model an
attempt to approximate or mimic the acquisitional
situation of a young native learner (MacWhinney
et al. 1989, p. 263) - The training data or set can be regarded as a
phenomenon-relevant (e.g., past tense learning)
sample of the actual language environment
confronted by the child - So far no exact criteria of how to choose a
representative corpus for that sample
6Corpus as a basis for the learning environment of
a model (3)
- The difficulty of measuring the match between the
actual and model input - These numbers, although accurate, may justly be
regarded with a certain degree of suspicion with
regard to their appropriateness as a measure of a
childs input, as the Brown corpus (from which
the frequency data derives) is a sample not of
childrens (or child-directed) spontaneous
speech, but of written, edited, adult-to-adult
communication such as novels, magazines, and
newspapers. On the other hand, measuring only
child-directed or child-initiated speech could
also be misleading as most children certainly
listen to adult conversation (and even edited
adult speech, e.g., on television). (Plunkett
Juola 1999, p. 467468.)
7Training a connectionist model (1)
- A training set is essentially an input for the
learning model (here supervised neural network),
together with the desired output - During training, the chosen set of verbs, nouns,
or other patterns relevant to the phenomenon
being modelled are presented to the model several
hundred times - The output of the model is tested at various
points during training, and at the end of
training (e.g., with a new set of patterns)
8Training a connectionist model (2)
- The model is trained until it has learned the
particular input-output mapping (base form of a
verb/noun ? inflected form, e.g., past tense of a
verb or plural form of a noun) to a certain
criterion what is often examined is the U-shaped
learning curve - To sum up In order to train a model, the
modeller has to define the network type and
algorithm, the patterns that represent the
mapping, representational format of the patterns,
and the training regime
9The relationship between the model and the real
thing (1)
- The model is an hypothesis of how particular
mental processes take place - It is useful here to recall the theoretical
assumptions of the model, namely that childrens
overregularization errors can be explained in
terms of their attempt to systematize the
relationship between phonological representations
of verb stems known to them and phonological
representations of the past tense forms known to
them. (Plunkett Marchman 1996, p. 303, italics
in original.)
10The relationship between the model and the real
thing (2)
- The mapping (e.g. base form ? inflected form)
represents the kind of environment under the
influence of which the learning (e.g., the
English past tense) takes place - Some unanswered questions
- Why start with the base form?
- Any other forms in the environment that may have
an influence on learning? (gerund forms in
English cf. Finnish a common past tense and
plural marker -i-)
11The relationship between the model and the real
thing (3)
- In the connectionist literature, the training set
has been interpreted either as a) a (mutatis
mutandis) actual input, or b) an already
interpreted input for the learning model or agent - A problem It is difficult to conceive the
training set both as a sample of the actual
learning environment (based on a corpus) and as
to-be-internalized data processed by a putatively
mental mechanism
12The relationship between the model and the real
thing (4)
- One solution is to distinguish the uptake
(internalized) and the input (actual)
environment - In essence, the modeller specifies both the
uptake and the input environment in the
assessment of the degree to which absolute token
frequencies influence the saliency of the
training item. As a result, the incidence of low
frequency forms in the uptake environment are
inflated relative to the hypothesized input
environment. (Plunkett Marchman 1996, p. 303.)
13The relationship between the model and the real
thing (5)
- The distinction creates a further problem the
training set is a hypothesis of the data
to-be-internalized by the child, based on a
hypothesis of the actual input data for the
child, which in turn is most often based on the
corpus or other data from which the absolute
token frequencies derive - As a consequence, the choice of the original
corpus has considerable influence on the
composition of the training set and thus to
possible hypotheses
14The learning situation redefined (1)
- The samples of observations (a corpus, a child
language study, etc.) that are made use of in the
training data, together with other decisions on
the construction of the model are an example of
(theoretical and contemplative) observers
knowledge (as defined by Itkonen 2005, p. 187) - A question Can the leap from observers
knowledge to (practical) agents knowledge in a
fully trained model be justified or is it only
stipulated?
15The learning situation redefined (2)
- The leap is not justified if the modeller simply
claims that the learning properties stem from the
model itself - on the contrary, the models have
specifically been trained to accomplish certain
tasks - To interpret a model of language learning as
characterizing agents knowledge, some notion of
the role of meaning in the formation of that
knowledge should be considered
16The learning situation redefined (3)
- Semantic cues eventually used as a guide for
learning must first be recognized and excluded
from other equivalent cues by the learner,
whereas the modeller has power over which cues to
include in the training data of the model, e.g.,
cues on class membership, gender, etc. - Paying attention to whatever relevant cues there
are requires conceptualization (see, e.g.,
Mandler 2004, p. 188)
17The learning situation redefined (4)
- The question of meaning hardly arises if the
training set is seen as a learning environment
external to the agent - If the set is seen as a hypothesis of salient
to-be-internalized data, the question of meaning
is presupposed (by letting the model have access
to crucial semantic cues) or simply ignored - So far actual production or comprehension data
not used in training the models
18Towards modelling meaning-induced learning
- Instead of seeing the training set as
representing a certain grammatical domain as such
(in the mind) of the learner, it should optimally
focus on a particular setting under which that
domain is active - A comparison between a model accomplishing a
certain task and human performance may call for a
delimitation of the corpus base - What is needed is a theory of pragmatics
compatible with connectionist modelling
19REFERENCES
- Esa Itkonen. 2005. Analogy as structure and
process. John Benjamins, Amsterdam. - Brian MacWhinney, Jared Leinbach, Roman Taraban,
and Janet McDonald. 1989. Language Learning cues
or rules? Journal of Memory and Language
28255277. - Jean Matter Mandler. 2004. The foundations of
mind. Oxford University Press, Oxford. - Steven Pinker and Alan Prince. 1988. On language
and connectionism analysis of a parallel
distributed processing model of language
acquisition. Cognition 2873193. - Kim Plunkett and Patrick Juola. 1999. A
connectionist model of English past tense and
plural morphology. Cognitive Science 23, 463490. - Kim Plunkett and Virginia Marchman. 1996.
Learning from a connectionist models of the
acquisition of the English past tense. Cognition,
61299308.