Machine Learning for (Psycho-)Linguistics - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Machine Learning for (Psycho-)Linguistics

Description:

RBF-style gaussian voting function (Shepard, 1987) Linear voting function (Dudani, 1976) ... German Plural. Notoriously complex but routinely acquired (at age ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 49
Provided by: walterda
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning for (Psycho-)Linguistics


1
Machine Learning for (Psycho-)Linguistics
  • Walter Daelemans
  • daelem_at_uia.ua.ac.be
  • http//cnts.uia.ac.be
  • CNTS, University of Antwerp
  • ILK, Tilburg University
  • QITL-02

2
Outline
  • Machine Learning of Language
  • Induction of rules and classes
  • Learning by Analogy
  • Case Studies
  • Discovery of phonological categories and
    morphological rules
  • A single-route model of morphological processing
  • Issues
  • Probabilities versus symbolic structure induction
  • Nativism versus empiricism
  • Exemplar analogy versus rules

3
Experience
BIAS
Learning Component
Search
Rj
Ri
Rk
Output
Input
Rl
Performance Component
4
Problems with Probabilities
  • Explanation
  • Also applies to neural networks
  • Event relevance
  • Especially in unsupervised learning (clustering)
  • Incorporation of linguistic knowledge
  • Smoothing zero-frequency events

5
(symbolic) machine learning
  • Rule induction (understandable induced theories)
  • Inductive Logic Programming (incorporating
    linguistic knowledge)
  • Memory-based learning (similarity-based smoothing
    of sparse data, feature weighting)

6
Common Fallacies
  • Rules nativism
  • (and connections empiricism)
  • Generalization abstraction
  • (and memory table-lookup)

7
Rule-Based ? Innate
  • Rules can be induced from primary linguistic data
    as well
  • Applications in Linguistics
  • Evaluation and comparison of linguistic
    hypotheses
  • Discovery of linguistic generalizations and
    categories

8
Allomorphy in Dutch Diminutive
  • one of the more spectacular phenomena of modern
    Dutch morphophonemics Trommelen (1983)
  • Base form of Noun tje (5 variants)
  • Linguistic theory (from Te Winkel 1862)
  • Rime last syllable, stress, morphological
    structure,
  • Trommelen 1983
  • Local phenomenon, stress morphological
    structure do not play a role
  • CELEX data (3900 nouns)
  • - b i - z _at_ m A nt ? je

9
Allomorphs
10
Decision Tree Learning
  • Given a data set, construct a decision tree that
    reflects the structure of the domain
  • A decision tree is a tree where
  • non-leaf nodes represent features (tests)
  • branches leading out of a test represent possible
    values for the feature
  • leaf nodes represent outcomes (classes)
  • Decision Tree can be translated into a set of
    IF-THEN rules (with further optimization)
  • Value grouping

11
Decision Tree Construction
  • Given a set of examples T
  • If T contains one or more cases all belonging to
    the same class C, then the decision tree for T is
    a leaf node with category C.
  • If T contains different classes then
  • Choose a feature, and partition T into subsets
    that have the same value for the feature chosen.
    The decision tree consists of a node containing
    the feature name, and a branch for each value
    leading to a subset.
  • Apply the procedure recursively to subsets
    created this way.

12
Induced rule set
  • Default class is -tje
  • IF coda last is /lm/ or /rm/ THEN -pje
  • IF nucleus last is bimoraic AND coda last is
    /m/ THEN -pje
  • IF coda last is /N/ THEN
  • IF nucleus penultimate is empty or schwa THEN
    -etje ELSE -kje
  • IF nucleus last is short and coda last is
    nas or liq THEN -etje
  • IF coda last is obstruent THEN -je

13
Results
  • Problem is almost perfectly learnable (98.4)
  • More than last syllable is needed for a full
    solution
  • Only rime of last syllable (not stress or onset)
    is relevant
  • Induced Categories
  • Nasals, liquids, obstruents, short vowels,
    bimoraic vowels (consists of vowels, diphtongs,
    schwa)
  • Task-dependent categories? Category formation is
    dependent on the task to be learned, not
    absolute, not language-independent

14
Conclusions Rule Induction in Linguistics
  • Falsify existing linguistic theories
  • Evaluate role of linguistic information sources
  • (Re)discover interesting linguistic rules (
    supervised learning)
  • (Re)discover interesting linguistic categories (
    unsupervised learning)
  • Empiricist alternative for (mostly nativist)
    rule-based systems

15
There is one small problem
  • Current methodology for comparative machine
    learning experiments is not reliable (especially
    with small data)
  • Different runs of the algorithm provide different
    resulting rule sets
  • Algorithm can be tweaked to get high performance
    with any information source combination
  • Algorithm is highly sensitive to training data,
    feature selection, algorithm parameter settings,
  • Only to be used as a heuristic
  • As with your own rule induction module

16
Word Sense Disambiguation (do) Similar
experience, material, say, then,
keywords
Local Context
47.9
49.0
Default
59.5
60.8
Optimized parameters LC
61.0
Optimized parameters
60.8
17
Generalisation ? Abstraction
Rule Induction Connectionism Inductive Logic
Programming Statistics Handcrafting
abstraction
(Fill in your most hated linguist here)
generalisation
- generalisation
Memory-Based Learning
Table Lookup
- abstraction
18
MBL Use memory traces of experiences as a basis
for analogical reasoning, rather than using rules
or other abstractions extracted from experience
and replacing the experiences.
This rule of nearest neighbor has considerable
elementary intuitive appeal and probably
corresponds to practice in many situations. For
example, it is possible that much medical
diagnosis is influenced by the doctor's
recollection of the subsequent history of an
earlier patient whose symptoms resemble in some
way those of the current patient. (Fix and
Hodges, 1952, p.43)
19
-etje
Rule Induction
-kje
Coda last syl
Nucleus last syl
20
-etje
MBL
-kje
Coda last syl
?
Nucleus last syl
21
Memory-Based Learning
  • Basis k nearest neighbor algorithm
  • store all examples in memory
  • to classify a new instance X, look up the k
    examples in memory with the smallest distance
    D(X,Y) to X
  • let each nearest neighbor vote with its class
  • classify instance X with the class that has the
    most votes in the nearest neighbor set
  • Choices
  • similarity metric
  • number of nearest neighbors (k)
  • voting weights

22
Metrics
ib1
ib1-ig
ib1-mvdm
23
Metrics (2)
  • Voting options
  • Equal weight for each nearest neighbor
  • Distance weighted voting
  • Inverse distance 1/D(X,Y) (Wettschereck, 1994)
  • RBF-style gaussian voting function (Shepard,
    1987)
  • Linear voting function (Dudani, 1976)

(NB weighted NN distribution can be used as
conditional probability)
24
MBL Acquisition
  • Inflectional process is represented by a set of
    exemplars in memory
  • Exemplars act as models
  • Learning is incremental storage of exemplars
  • Compression and Metrics
  • Exemplar consists of set of (mostly symbolic)
    features

25
MBL Processing
  • New instances of a performance process are solved
    through
  • Memory-lookup
  • Analogical (Similarity-Based) Reasoning
  • Similarity metric
  • Language (faculty) - independent
  • Adaptive (feature and exemplar weighting)

26
The properties of language processing tasks
  • Language processing tasks are mappings between
    linguistic representation levels that are
  • context-sensitive (but mostly local!)
  • complex (sub/ir/regularity), pockets of
    exceptions
  • Similar representations at one linguistic level
    correspond to similar representations at the
    other level
  • Several information sources interact in (often)
    unpredictable ways at the same level
  • Data is sparse

27
fit the bias of MBL
  • Inference is based on Similarity-Based /
    Analogical Reasoning
  • Adaptive data fusion / relevance assignment is
    available through feature weighting
  • It is a non-parametric approach
  • Similarity-based smoothing is implicit
  • Regularities and subregularities / exceptions can
    be modeled uniformly

28
German and Dutch plurals
29
Data Representation
  • Symbolic features
  • segmental information (syllable structure)
  • stress
  • gender
  • German Plural ( 25,000 from CELEX)
  • Vorlesung (lecture) l e - z U N F en
  • Classes e (e)n s er - U- Uer Ue
  • Dutch Plural ( 62,000 from CELEX)
  • ontruiming (evacuation) 0 - O nt 1 r L - 0 m I N
    en
  • Classes (e)n s (-eren, -i, -a, )

30
Cognitive Architectures of Inflectional Morphology
Dual Route
  • Dual Route (Pinker, Clahsen, Marcus )
  • Rules for regular cases
  • (over)generalization
  • default behaviour
  • Associative memory for exceptions
  • irregularization / family effects
  • Single Route (RM, MacWhinney, Plunkett, Elman,
    )
  • Frequency-based regularity

Suffix-class
Memory
Failure
Pattern
Rule
Associator
Input Features
31
German Plural
  • Notoriously complex but routinely acquired (at
    age 5)
  • Evidence for Dual Route ?
  • -s suffix is default/regular (novel words,
    surnames, acronyms, )
  • -s suffix is infrequent (least frequent of the
    five most important suffixes)

32
(No Transcript)
33
The default status of -s
  • Similar item missing Fnöhk-s
  • Surname, product name Mann-s
  • Borrowings Kiosk-s
  • Acronyms BMW-s
  • Lexicalized phrases Vergissmeinnicht-s
  • Onomatopoeia, truncated roots, derived nouns, ...

34
(No Transcript)
35
Discussion
  • Three classes of plurals ((-en -)(-e -er))(s)
  • the former 4 suffixes seem regular, can be
    accurately learned using information from
    phonology and gender
  • -s is learned reasonably well but information is
    lacking
  • Hypothesis more features are needed
    (syntactic, semantic, meta-linguistic, ) to
    enrich the lexical similarity space
  • No difference in accuracy and speed of learning
    with and without Umlaut
  • Overall generalization accuracy very high 95
  • Schema-based learning (Köpcke).

,,,,i,r,M e
36
(No Transcript)
37
(No Transcript)
38
Acquisition DataSummary of previous studies
  • Existing nouns
  • (Park 78 Veit 86 Mills 86 Schamer-Wolles 88
    Clahsen et al. 93 Sedlak et al. 98)
  • Children mainly overapply -e or -(e)n
  • -s plurals are learned late
  • Novel words
  • (Mugdan 77 MacWhinney 78 Phillis Bouma 80
    Schöler Kany 89)
  • Children inflect novel words with -e or -(e)n
  • More irregular plural forms produced than
    defaults

39
MBL simulation
  • model overapplies mainly -en and -e
  • -s is learned late and imperfectly
  • Mainly but not completely parallel to input
    frequency (more -s overgeneralization than -er
    generalization)

40
Bartke, Marcus, Clahsen (1995)
  • 37 children age 3.6 to 6.6
  • pictures of imaginary things, presented as
    neologisms
  • names or roots
  • rhymes of existing words or not
  • choice -en or -s
  • results
  • children are aware that unusual sounding words
    require the default
  • children are aware that names require the default

41
MBL simulation
  • sort CELEX data according to rhyme
  • compare overgeneralization
  • to -en versus to -s
  • percentage of total number of errors
  • results
  • when new words dont rhyme more errors are made
  • overgeneralization to -en drops below the level
    of overgeneralization to -s

42
Dutch Plural
  • Suffixes -en and -s are both defaults, and are in
    complementary distribution
  • Selection of -en or -s governed by
  • phonological structure of the base noun (stressed
    vs. unstressed last syllable)
  • morphological structure (suffix of the base noun)
  • loan word status
  • semantic feature person vs. thing
  • both are possible after /?/
  • (Baayen et al. 2001)

43
Feature Relevance
44
Accuracy on CELEX
  • Methodology
  • Leave-one-out
  • Results
  • MBL 94.9 accuracy
  • Prec Rec F?
  • -(e)n 95.8 97.2 96.4
  • -s 93.8 91.4 92.6
  • -i 82.0 77.2 79.5
  • without stress 94.9 accuracy
  • last syllable with stress 92.6 accuracy
  • last syllable without stress 92.4 accuracy
  • rhyme last syllable 89.6 accuracy

45
Accuracy on pseudo-words
  • Methodology
  • Train Celex (all) and Celex (1000 most frequent
    types)
  • Test 8 10 pseudo-words (Baayen et al., 2001)
  • dreip - workel - bastus - bestroeting - kloertje
  • stape - stree - kadisme
  • Results accuracy number of decisions equal to
    subject majority for each item
  • Subjects 87.5
  • MBL (all) 83.8
  • MBL (top 1000) 90.0

46
muidus, muidi nn modus, modi
Low frequency and loan word nearest neighbours
Celex bias
47
Conclusions Memory-Based Single Route
  • MBLP picks up the main schemata of Dutch and
    German plural formation and their exceptions
    without recourse to explicit rules or a dual
    route architecture
  • MBLP trained on (part of) CELEX matches subject
    behavior on pseudo words and acquisition data
  • Segmental information suffices to reliably
    predict plural in Dutch and most plurals in
    German, additional information needed for German
    -s
  • Heterogeneity and density in lexical exemplar
    space as source of behavior predictions

48
Overall Conclusions
  • Advantages of symbolic machine learning methods
    over pure statistics
  • As a methodology for inducing interpretable
    linguistic generalizations and categories
  • As a way of introducing an operationalisation of
    analogy-based methods into (psycho)linguistics
Write a Comment
User Comments (0)
About PowerShow.com