Simulation of Language Acquisition - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Simulation of Language Acquisition

Description:

Explains and predicts empirical data (observations, experimental results) ... Consistence and completeness can sometimes be proven. Falsifiable through simulations ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 82
Provided by: walt144
Category:

less

Transcript and Presenter's Notes

Title: Simulation of Language Acquisition


1
Simulation of Language Acquisition
  • Walter Daelemans
  • (CNTS, University of Antwerp)
  • walter.daelemans_at_ua.ac.be
  • http//www.cnts.ua.ac.be/walter
  • EMLAR 2005 Utrecht

2
Overview
  • Theories, computational models and simulations
  • Machine Learning
  • Generalization versus abstraction
  • Eager versus lazy learning
  • Memory-based models of language acquisition and
    processing
  • Case Study 1 Stress acquisition
  • TiMBL crash course and demonstration
  • Case Study 2 German plural

3
Simulation (1)
  • Theory
  • Explains and predicts empirical data
    (observations, experimental results)
  • Cogsci in terms of knowledge representation,
    acquisition, and processing framework
  • Problems
  • Verbal
  • Sometimes vague, underspecified
  • Every theoretical description, however exact,
    turns out to contain errors when you try to
    implement it ( Hugo Brandt Corstius, second law
    of Computational Linguistics)

4
Simulation (2)
  • Computational Model
  • Translation of a theory into specific symbol
    representation and processing framework
    (algorithms and data structures)
  • Advantages
  • Precise formulation
  • Explicit in all details
  • Consistence and completeness can sometimes be
    proven
  • Falsifiable through simulations
  • Simulations
  • A computational model with specific parameter
    settings used to mimic specific empirical data

5
Machine Learningas a model for acquisition
  • Cognitive architecture
  • Competence (knowledge representation)
  • Performance (search)
  • Acquisition (search)
  • Bias
  • Restrictions on input and output representations
  • Restrictions on learning algorithm
  • Restrictions on knowledge representation
    formalism

6
Experience
BIAS
Learning Component
Search
Rj
Ri
Rk
Output
Input
Rl
Performance Component
7
Generalisation ? Abstraction
abstraction
Rule Induction Connectionism Statistics Handcrafti
ng
(Fill in your most hated linguist here)
generalisation
- generalisation
Memory-Based Learning
Table Lookup
- abstraction
8
Nativism ? Rule-Based
nativist
Hard-wired neural networks Innate
probabilities? Innate exemplars?
Innate mental rules
rule-based
- rule-based
Rule Induction
Connectionism Statistics Memory-Based Learning
empiricist
9
Machine Learning crash course
  • The field of machine learning is concerned with
    the question of how to construct computer
    programs that automatically learn with
    experience. (Mitchell, 1997)
  • Dynamic process learner L shows improvement on
    task T after learning.
  • Getting rid of programming.
  • Handcrafting versus learning.
  • Machine Learning is task-independent.

10
Machine Learning Roots
  • Information theory
  • Artificial intelligence
  • Pattern recognition
  • Took off during 70s
  • Major algorithmic improvements during 80s
  • Forking neural networks, data mining

11
Machine Learning 2 types
  • Theoretical ML (what can be proven to be
    learnable by what?)
  • Gold, identification in the limit
  • Valiant, probably approximately correct learning
  • Empirical ML (on real or artificial data)
  • Evaluation Criteria
  • Accuracy
  • Quality of solutions
  • Time complexity
  • Space complexity
  • Noise resistance

12
Empirical ML Key Terms 1
  • Instances individual examples of input-output
    mappings of a particular type
  • Input consists of features
  • Features have values
  • Values can be
  • Symbolic (e.g. letters, words, )
  • Binary (e.g. indicators)
  • Numeric (e.g. counts, signal measurements)
  • Output can be
  • Symbolic (classification linguistic symbols, )
  • Binary (discrimination, detection, )
  • Numeric (regression)

13
Empirical ML Key Terms 2
  • A set of instances is an instance base
  • Instance bases come as labeled training sets or
    unlabeled test sets (you know the labeling, the
    learner does not)
  • A ML experiment consists of training on the
    training set, followed by testing on the disjoint
    test set
  • Generalization performance (accuracy, precision,
    recall, F-score) is measured on the output
    predicted on the test set
  • Splits in train and test sets should be
    systematic n-fold cross-validation
  • 10-fold CV
  • Leave-one-out testing
  • Significance tests on pairs or sets of (average)
    CV outcomes

14
Empirical ML 2 Flavors
  • Eager
  • Learning
  • abstract model from data
  • Classification
  • apply abstracted model to new data
  • Lazy
  • Learning
  • store data in memory
  • Classification
  • compare new data to data in memory

15
Eager vs Lazy Learning
  • Eager
  • Decision tree induction
  • CART, C4.5
  • Rule induction
  • CN2, Ripper
  • Hyperplane discriminators
  • Winnow, perceptron, backprop, SVM
  • Probabilistic
  • Naïve Bayes, maximum entropy, HMM
  • (Hand-made rulesets)
  • Lazy
  • k-Nearest Neighbour
  • MBL, AM
  • Local regression

16
-etje
Rule Induction
-kje
Coda last syl
Nucleus last syl
17
-etje
MBL
-kje
Coda last syl
?
Nucleus last syl
18
Eager vs Lazy Learning
  • Decision trees keep the smallest amount of
    informative decision boundaries (in the spirit of
    MDL, Rissanen, 1983)
  • Rule induction keeps smallest number of rules
    with highest coverage and accuracy (MDL)
  • Hyperplane discriminators keep just one
    hyperplane (or vectors that support it)
  • Probabilistic classifiers convert data to
    probability matrices
  • k-NN retains every piece of information available
    at training time

19
Eager vs Lazy Learning
  • Minimal Description Length principle
  • Ockhams razor
  • Length of abstracted model (covering core)
  • Length of productive exceptions not covered by
    core (periphery)
  • Sum of sizes of both should be minimal
  • More minimal models are better
  • Learning compression dogma
  • In ML, length of abstracted model has been focus
    not storing periphery

20
Eager vs Lazy So?
  • Highly relevant to language modeling
  • In language data, what is core? What is
    periphery?
  • Often little or no noise productive exceptions
  • (Sub-)subregularities, pockets of exceptions
  • disjunctiveness and polymorphism
  • Some important elements of language have
    different distributions than the normal one
  • E.g. word forms have a Zipfian distribution
  • Hard to distinguish noise from exceptions on the
    basis of
  • Frequency
  • Typicality

21
(No Transcript)
22
ML and Natural Language
  • Apparent conclusion ML could be an interesting
    tool to do psycholinguistic modeling
  • Next to probability theory, information theory,
    statistical analysis (natural allies)
  • More and more annotated data available
  • Skyrocketing computing power and memory

23
Case Study
  • Exemplar-based acquisition of Dutch Stress
  • (Durieux / Gillis / Daelemans)

24
MBL Use memory traces of experiences as a basis
for analogical reasoning, rather than using rules
or other abstractions extracted from experience
and replacing the experiences.
This rule of nearest neighbor has considerable
elementary intuitive appeal and probably
corresponds to practice in many situations. For
example, it is possible that much medical
diagnosis is influenced by the doctor's
recollection of the subsequent history of an
earlier patient whose symptoms resemble in some
way those of the current patient. (Fix and
Hodges, 1952, p.43)
25
MBL Acquisition
  • Language process is represented by a set of
    exemplars in memory
  • Exemplars act as models
  • Learning is incremental storage of exemplars
  • Compression and Metrics
  • Exemplar consists of set of (mostly symbolic)
    features

26
MBL Processing
  • New instances of a performance process are solved
    through
  • Memory retrieval
  • Analogical (Similarity-Based) Reasoning
  • Similarity metric
  • Language (faculty) - independent
  • Adaptive (feature and exemplar weighting)

27
Operationalization
  • Basis k nearest neighbor algorithm
  • store all examples in memory
  • to classify a new instance X, look up the k
    examples in memory with the smallest distance
    D(X,Y) to X
  • let each nearest neighbor vote with its class
  • classify instance X with the class that has the
    most votes in the nearest neighbor set
  • Choices
  • similarity metric
  • number of nearest neighbors (k)
  • voting weights

28
The Overlap distance function
  • Count the number of mismatching features

29
The MVDM distance function
  • Estimate a numeric distance between pairs of
    values
  • e is more like i than like p in a phonetic
    task
  • book is more like document than like the in
    a parsing task

30
Feature weighting in the distance function
  • Mismatching on a more important feature gives a
    larger distance
  • Factor in the distance function

31
Entropy IG Formulas
32
Exemplar weighting
  • Scale the distance of a memory instance by some
    externally computed factor
  • Smaller distance for good instances
  • Bigger distance for bad instances

33
Distance weighting
  • Relation between larger k and smoothing
  • Make more distant neighbors contribute less in
    the class vote
  • Linear inverse of distance (w.r.t. max)
  • Inverse of distance
  • Exponential decay

34
Learning word stressA case study
  • Learn primary stress
  • Compare MBL with PP/UG
  • Match acquisition and processing data
  • Durieux, G. (2003) Computermodellen en
    klemtoon. Fonologische Kruispunten, BICN.
  • Daelemans, W., Gillis, S., and Durieux, G.
    (1994). The acquisition of stress A
    data-oriented approach." Computational
    Linguistics 20 421-451.
  • Daelemans, W., Gillis, S., Durieux, G., and Van
    den Bosch, A. (1993). Learnability and
    markedness Dutch stress assignment. In T.M.
    Ellison and J.M. Scobbie (Eds.), Computational
    Phonology . Edinburgh Working Papers in Cognitive
    Science, 8, pp. 157-178.

35
MBL for psychology
  • Similarity metric
  • Analogy engine
  • Feature weighting
  • Relevance assignment
  • Information fusion
  • Value weighting
  • Implicit concept formation
  • Exemplar weighting
  • Recency, priming
  • Distance-weighted extrapolation
  • Distributions, probabilities
  • Local modeling
  • Heterogeneity and density

36
Dominant Linguistic Approach
  • Principles and Parameters, UG
  • Typology
  • Acquisition
  • Formalism Metrical trees, metrical grids
  • Stress prominence relations between
    constituents in a hierarchical structure

37
YOUPIE (Dresher Kaye, 1990)
  • Assumptions
  • 11 parameters (216 languages)
  • Task-specific system for learning stress (domain
    knowledge)
  • Core grammar only
  • Learning
  • Cue-based parameter setting results in a grammar
    of stress
  • Performance
  • Generate tree with grammar and algorithmically
    determine stress location

38
PLD
Cue-based Learning
word
?
?
?
1 0 1 0 0 0 0 1 1 0 1 UG-stress Grammar and
Assignment rules
39
Parameters (with setting for Dutch)
40
MBL
  • Assumptions
  • Lexical storage and generalization
  • Generic learning method, no task-specific
    linguistic knowledge
  • Core and periphery
  • Learning
  • Based on storage of exemplars
  • Performance
  • Similarity-based reasoning with feature weighting
    on stored exemplars

41
PLD
Storage
Syllable-structure representations Retrieval
or Similarity-based reasoning on exemplars
word
Stress pattern
42
YOUPIE tested
  • Experimental design
  • 216 languages
  • 117 items per language generated by YOUPIE
    performance component (no exceptions, core only)
  • For each language, grammar learned with YOUPIE
    cue-based learning component
  • Results
  • For 60 of the languages, YOUPIE reconstructs the
    original parameter setting with which the words
    were generated
  • For 21 convergence is to a compatible setting
  • For 19 of the languages errors in one or more
    stress patterns
  • Upper Boundary!
  • Perfect input, no exceptions to be learned

43
MBLP vs.Youpie
44
Discussion
  • No significant quantitative difference in
    performance
  • Clear qualitative difference
  • YOUPIE more languages perfectly learned
  • MBLP fewer errors per language
  • Issues
  • Real language data
  • Core and periphery
  • Acquisition
  • Processing

45
Dutch stress
  • Stress on one of the last three syllables
  • Predictable, but not completely
  • E.g., py-a-ma ca-na-da pa-ra-plu
  • Words not covered by the parameter-configuration
    for Dutch need lexical marking with exception
    features (one, two or completely idiosyncratic)

46
MBLP on Dutch data
  • CELEX, 4868 monomorphemes
  • Exemplar encoding schemes
  • For each of the three final syllables
  • S1 syllable weight (SL, L, H, SH)
  • S2 nucleus and coda (complete rhymes, VC)
  • S3 nucleus and coda (separate features,
    phonemes)
  • S4 onset, nucleus, and coda (phonemes)
  • Class final, penultimate, ante-penultimate

47
Results
48
Language Acquisition
  • Learning rules or learning lexical items?
  • Rules (Hochberg 88 Spanish, Nouveau 93 Dutch)
  • Lexical learning lacks generalization capacity
  • Lexical learning incompatible with acquisition
    data
  • Imitation task
  • Errors increase with irregularity
  • Tendency to regularization (but irregularization
    occurs)
  • By stress shift
  • By changing structure of repeated word

49
Error Percentages
50
Discussion
  • MBLP error correlates with markedness like
    childrens errors
  • MBLP has a tendency for regularization like
    children
  • Direction of stress shifts
  • Structural changes from inspection of nearest
    neighbors
  • Irregularization and differences 3 and 4
    year-olds on marked patterns hard to explain in
    rule-based context
  • Rule learning is not the only possible
    explanation for the language acquisition data

51
Adult processing
  • Rule-based stress grammar and set of irregular
    words, marked in the lexicon
  • Known words rule application except when blocked
    by lexicon
  • Unknown words rule application
  • MBLP lexical storage and analogy
  • Known words look-up
  • Unknown words analogy

52
Experimental set-up
  • Stimuli
  • Create pseudo-words and transcribe them (encoding
    4)
  • Have a machine learner assign stress (regular or
    irregular)

53
Experimental set-up
  • Method
  • 18 adult participants
  • Reading task
  • 3 independent judges, consensus
  • Result
  • Main effect for regularity-variable (ANOVA p lt
    .001) regular stress only in regular conditions
  • In all conditions, participants do the same as
    model prediction (ANOVA p lt .001)

54
Results
55
Results
56
Discussion
  • Adult speakers sometimes prefer marked stress
    patterns for non-words
  • These cases are partially predictable with an
    MBLP model and are problematic in a rule-based
    model (regularization only)
  • BUT
  • MBLP has a significantly better match with
    participant behavior in the regular conditions
  • Hypothesis differences between mental lexicon
    and celex
  • Using a set-up with a population of machine
    learners using different samples from celex
    explains the variability

57
Summary
  • Goal put MBLP to the test on a concrete
    linguistic problem of sufficient complexity by
    comparing it to
  • Linguistic theory
  • Child language acquisition data
  • Adult processing data
  • Results
  • MBLP and YOUPIE (PP/UG) comparable
  • MBLP can learn core as well as periphery using
    superficial representations
  • MBLP shows same errors and tendencies as children
    learning stress placement
  • MBLP better predictor of human adult behaviour
    with non-words

58
Overall Conclusion
  • Exemplar-based models should be taken as a
    serious alternative for rule-based/PP/UG/dual
    route type theories
  • Workable operationalisation of analogy
  • Adequacy
  • Similar results in morphology and syntax
    (grammatical relations, chunking, pp-attachment)
  • Well see

59
Simulation with TiMBL
  • Demonstration German plural

60
TiMBLhttp//ilk.uvt.nl/timbl
  • Tilburg Memory-Based Learner
  • Available for research and education
  • Lazy learning, extending k-NN and IB1
  • Optimized search for NN
  • Internal structure tree, not flat instance base
  • Tree ordered by chosen feature weight
  • Many built-in optional metrics feature weights,
    distance function, distance weights, exemplar
    weights,

61
Current practice
  • Default TiMBL settings
  • k1, Overlap, GR, no distance weighting
  • Work well for some morpho-phonological tasks
  • Rules of thumb
  • Combine MVDM with bigger k
  • Combine distance weighting with bigger k
  • Very good bet higher k, MVDM, GR, distance
    weighting
  • Especially for sentence and text level tasks

62
  • usage Timbl -f data-file -t test-file
    options
  • Algorithm and Metric options
  • -a n algorithm.
  • 0 or IB1 IB1 (default)
  • 1 or IG IGTree
  • 2 or TRIBL TRIBL
  • 3 or IB2 IB2
  • 4 or TRIBL2 TRIBL2
  • -m s use feature metrics as specified in
    string s
  • format GlobalMetricMetricRangeMetric
    Range
  • e.g. -mON3I2,5-7
  • D Dot product. (Global only. numeric features
    implied)
  • O weighted Overlap. (default)
  • M Modified value difference.
  • N numeric values.
  • I Ignore named values.

63
  • -w 0 No Weighting.
  • 1 Weight using GainRatio. (default)
  • 2 Weight using InfoGain
  • 3 Weight using Chi-square
  • 4 Weight using Shared Variance
  • f use Weights from file 'f'.
  • -b n number of lines used for bootstrapping
    (IB2 only).
  • -d val weight neighbors as function of their
    distance
  • Z all the same weight. (default)
  • ID Inverse Distance.
  • IL Inverse Linear.
  • EDa Exponential Decay with factor a. (no
    whitespace!)
  • EDab Exponential Decay with factor a and
    b. (no whitespace!)
  • -k n k nearest neighbors (default n 1).

64
  • -q n TRIBL treshold at level n.
  • -L n MVDM treshold at level n.
  • -R n solve ties at random with seed n.
  • -t f test using file 'f'.
  • -t leave_one_out test with Leave One Out,using
    IB1.
  • -t cross_validate Cross Validate Test,using IB1.
  • _at_f test using files and options described
    in file 'f'.
  • Supported options d e F k m o p q R
    t u v w x -
  • -t ltfilegt is mandatory

65
  • Input options
  • -f f read from Datafile 'f'.
  • -f f OR use filenames from 'f' for CV test
  • -F format Assume the specified inputformat.
  • (Compact, C4.5, ARFF, Columns, Binary,
    Sparse )
  • -l n length of Features (Compact format
    only).
  • -i f read the InstanceBase from file 'f'.
    (skips phase 1 2 )
  • -u f read value_class probabilities from
    file 'f'.
  • -P d read data using path 'd'.
  • -s use exemplar weights from the input
    file
  • -s0 silently ignore the exemplar weights
    from the input file

66
  • Output options
  • -e n estimate time until n patterns tested.
  • -I f dump the InstanceBase in file 'f'.
  • -n f create names file 'f'.
  • -p n show progress every n lines. (default
    p 100,000)
  • -U f save value_class probabilities in file
    'f'.
  • -V Show VERSION.
  • v or -v level set or unset verbosity level,
    where level is
  • s work silently.
  • o show all options set.
  • f show Calculated Feature Weights.
    (default)
  • p show MVD matrices.
  • e show exact matches.
  • as show advanced statistics. (memory
    consuming)
  • cm show Confusion Matrix.
  • cs show per Class Statistics. (implies
    vas)
  • di add distance to output file.
  • db add distribution of best matched to
    output file
  • k add a summary for all k neigbors to
    output file (sets -x)

67
  • -W f save current Weights in file 'f'.
  • or - do or don't save test result () to
    file.
  • -o s use s as output filename.
  • -O d save output using path 'd'.
  • Internal representation options
  • -B n number of bins used for
    discretization of numeric feature values
  • -c n clipping frequency for prestoring
    MVDM matrices
  • -D Don't store distributions.
  • (saves memory, but disables vDB
    option)
  • H or -H write hashed trees (default H)
  • -M n size of MaxBests Array
  • -N n Number of features (default 2500)
  • -T n ordering of the Tree
  • DO none.
  • GRO using GainRatio
  • IGO using InformationGain
  • ( and many others)
  • x or -x Do or don't use the exact match
    shortcut.

68
Data Representation
  • Symbolic features
  • segmental information (syllable structure)
  • stress
  • gender
  • German Plural ( 25,000 from CELEX)
  • Vorlesung (lecture) l e - z U N F en
  • Classes e (e)n s er - U- Uer Ue

69
Cognitive Architectures of Inflectional Morphology
Dual Route
  • Dual Route (Pinker, Clahsen, Marcus )
  • Rules for regular cases
  • (over)generalization
  • default behaviour
  • Associative memory for exceptions
  • irregularization / family effects
  • Single Route (RM, MacWhinney, Plunkett, Elman,
    )
  • Frequency-based regularity

Suffix-class
Memory
Failure
Pattern
Rule
Associator
Input Features
70
German Plural
  • Notoriously complex but routinely acquired (at
    age 5)
  • Evidence for Dual Route ?
  • -s suffix is default/regular (novel words,
    surnames, acronyms, )
  • -s suffix is infrequent (least frequent of the
    five most important suffixes)

71
(No Transcript)
72
The default status of -s
  • Similar item missing Fnöhk-s
  • Surname, product name Mann-s
  • Borrowings Kiosk-s
  • Acronyms BMW-s
  • Lexicalized phrases Vergissmeinnicht-s
  • Onomatopoeia, truncated roots, derived nouns, ...

73
(No Transcript)
74
Discussion
  • Three classes of plurals ((-en -)(-e -er))(s)
  • the former 4 suffixes seem regular, can be
    accurately learned using information from
    phonology and gender
  • -s is learned reasonably well but information is
    lacking
  • Hypothesis more features are needed
    (syntactic, semantic, meta-linguistic, ) to
    enrich the lexical similarity space
  • No difference in accuracy and speed of learning
    with and without Umlaut
  • Overall generalization accuracy very high 95
    (90)
  • Schema-based learning (Köpcke).

,,,,i,r,M e
75
(No Transcript)
76
(No Transcript)
77
Acquisition DataSummary of previous studies
  • Existing nouns
  • (Park 78 Veit 86 Mills 86 Schamer-Wolles 88
    Clahsen et al. 93 Sedlak et al. 98)
  • Children mainly overapply -e or -(e)n
  • -s plurals are learned late
  • Novel words
  • (Mugdan 77 MacWhinney 78 Phillis Bouma 80
    Schöler Kany 89)
  • Children inflect novel words with -e or -(e)n
  • More irregular plural forms produced than
    defaults

78
MBLP simulation
  • model overapplies mainly -en and -e
  • -s is learned late and imperfectly
  • Mainly but not completely parallel to input
    frequency (more -s overgeneralization than -er
    generalization)

79
Bartke, Marcus, Clahsen (1995)
  • 37 children age 3.6 to 6.6
  • pictures of imaginary things, presented as
    neologisms
  • names or roots
  • rhymes of existing words or not
  • choice -en or -s
  • results
  • children are aware that unusual sounding words
    require the default
  • children are aware that names require the default

80
MBLP simulation
  • sort CELEX data according to rhyme
  • compare overgeneralization
  • to -en versus to -s
  • percentage of total number of errors
  • results
  • when new words dont rhyme more errors are made
  • overgeneralization to -en drops below the level
    of overgeneralization to -s

81
Conclusions
  • Computational models in language acquisition
    shouldnt necessarily be connectionist
  • From rule induction to exemplar-based models
  • TiMBL may be useful as software for computational
    psycholinguistics
Write a Comment
User Comments (0)
About PowerShow.com