Title: Adding typology to lexicostatistics: a combined approach to language classification
1Adding typology to lexicostatistics a combined
approach to language classification
- ASJP Consortium
- lt Dik Bakker et al. mult. gt
2Overview
Project ASJP (Started January 2007) (Automated
Similarity Judgment Program)
Language Classification
2
3Overview
Project ASJP (Automated Similarity Judgment
Program) Overall goal Automatic reconstruction
of language relationships
Language Classification
3
4Overview
Project ASJP (Automated Similarity Judgment
Program) Overall goal Automatic reconstruction
of language relationships Basis Distance matrix
between individual languages based on
lexical elements
Language Classification
4
5Overview
Project ASJP (Automated Similarity Judgment
Program) Overall goal Automatic reconstruction
of language relationships Basis Distance matrix
between individual languages Method
Lexicostatistics mass comparison of basic
lexical items
Language Classification
5
6Overview
Project ASJP (Automated Similarity Judgment
Program) As in traditional lexicostatistics,
but
Language Classification
6
7Overview
Project ASJP (Automated Similarity Judgment
Program) As in traditional lexicostatistics,
but 1. use of computational algorithms and
tools
Language Classification
7
8Overview
Project ASJP (Automated Similarity Judgment
Program) As in traditional lexicostatistics,
but 1. use of computational algorithms and
tools 2. methodology from classification in
biology
Language Classification
8
9Overview
Project ASJP (Automated Similarity Judgment
Program) As in traditional lexicostatistics,
but 1. use of computational algorithms and
tools 2. methodology from classification in
biology 3. extended by all relevant data
available
Language Classification
9
10Caveat
ASJP goal Reconstruction of relationships
between languages NOT better than experts in
classification of areas/groups
Language Classification
10
11Caveat
ASJP goal Reconstruction of relationships
between languages NOT better than experts in
classification of areas/groups BUT 1.
Optimize lexicostatistics on basis of expert
knowledge on well-explored areas
Language Classification
11
12Caveat
ASJP goal Reconstruction of relationships
between languages NOT better than experts in
classification of areas/groups BUT 1.
Optimize lexicostatistics on basis of expert
knowledge 2. Provide method and tools to assess
and improve classifications for un(der)explored
areas
Language Classification
12
13Overview
Current collaborators Dik Bakker David Beck
Oleg Belyaev Cecil H. Brown Pamela Brown
Matthew Dryer Dmitry Egorov Pattie Epps
Anthony Grant Eric W. Holman
Hagen Jung Johann-Mattis List Robert Mailhammer
André Müller Uri Tadmor Matthias Urban Viveka
Velupillai Søren Wichmann Kofi Yakpo
14Overview
Current collaborators Dik Bakker David Beck
Oleg Belyaev Cecil H. Brown Pamela Brown
Matthew Dryer Dmitry Egorov Pattie Epps
Anthony Grant Eric W. Holman
Hagen Jung Johann-Mattis List Robert Mailhammer
André Müller Uri Tadmor Matthias Urban Viveka
Velupillai Søren Wichmann Kofi Yakpo
Language Classification
14
15Overview ASJP system
LEX
16Overview ASJP system
LEX
Method
ASJP software
17Overview ASJP system
LEX
ASJP software
distance matrix
18Overview ASJP system
LEX
ASJP software
distance matrix
19Overview ASJP system
LEX
ASJP software
distance matrix
CLASSIF software
20Existing Expert Classifications
LEX
ETHN WALS EXPRT
ASJP software
EVALUATION
distance matrix
CLASSIF software
STAT software
21Existing Expert Classifications
LEX
ETHN WALS EXPRT
Method
ASJP software
CALIBRATION
distance matrix
CLASSIF software
STAT software
22LEX
ETHN WALS EXPRT
GEO GRAPH
ASJP software
distance matrix
MAP software
CLASSIF software
STAT software
23LEX
ETHN WALS EXPRT
GEO GRAPH
HIST FACTS
ASJP software
distance matrix
CLASSIF software
STAT software
MAP software
24LEX
ETHN WALS EXPRT
GEO GRAPH
HIST FACTS
TYPOL DATA
ASJP software
distance matrix
CLASSIF software
STAT software
MAP software
25Today
LEX
ASJP software
distance matrix
CLASSIF software
26Today
LEX
TYPOL DATA
ASJP software
distance matrix
CLASSIF software
27List of basic lexical items
28Lexical items
Word list Morris Swadesh (1955) 100 basic
meanings
29(No Transcript)
30Lexical items
Swadesh list assumptions
Language Classification
30
31Lexical items
Swadesh list - Word in most languages
Language Classification
31
32Lexical items
Swadesh list - Word in most languages -
Inherited rather than borrowed
Language Classification
32
33Lexical items
Swadesh list - Word in most languages -
Inherited rather than borrowed - Relatively
stable over time
Language Classification
33
34Lexical items
Swadesh list - Word in most languages -
Inherited rather than borrowed - Relatively
stable over time - Easily accessible (fieldwork
/ grammars)
Language Classification
34
35Lexical items
Languages transcribed to date - Over 3500
languages (incl. dialects around 45 of lgs of
the world)
Language Classification
35
36Languages currently collected
Language Classification
36
37Lexical items further reduction
Reduction of the full Swadesh list
38Lexical items further reduction
Reduction of the full Swadesh list 1. Not the
complete list, only most stable items
Language Classification
38
39Lexical items further reduction
- Reduction of the full Swadesh list
- 1. Not the complete list, only most stable items
- 2. Not full IPA representation, but generalized
coding
Language Classification
39
40Lexical items further reduction
- 1. Not the complete list
- Most stable items least formal variation in
- well-established genetic groups (Dryers
genera)
Language Classification
40
41Lexical items further reduction
- 1. Not the complete list
- Most stable items least formal variation in
- well-established genetic groups (Dryers
genera) - Nichols (1995) lg pairs (wordkwordk)
- all pairs
Language Classification
41
42Lexical items further reduction
- 1. Not the complete list
- Most stable items least formal variation in
- well-established genetic groups (Dryers
genera) - Nichols (1995) lg pairs (wordkwordk)
- all pairs
- ? What is optimal number ?
Language Classification
42
43Ethnologue Classification
WALS Classification
? Stability ? -
Goodman-Kruskal Pearson
Language Classification
44Ethnologue Classification
WALS Classification
? Stability ? -
Language Classification
45Ethnologue Classification
WALS Classification
46Ethnologue Classification
WALS Classification
47Ethnologue Classification
WALS Classification
40
48Ethnologue Classification
WALS Classification
49Ethnologue Classification
WALS Classification
Language Classification
49
5040 Most Stable
Language Classification
50
51Lexical items transcription
2. NOT full IPA but ASJPcode 7 Vowels 34
Consonants
All other phonemes to closest sound (automatic)
Language Classification
51
52Abaza (Caucasian) Meaning IPA PERSON ????'??
???s LEAF b??? SKIN ??az? HORN ?'???
?a NOSE p?n?'a TOOTH p??
Language Classification
52
53Abaza (Caucasian) Meaning IPA ASJPcode PERSON
????'?????s Xw3Cw"yXw3s LEAF b??? bxy3 SKIN
??az? Cwazy HORN ?'????a Cw"3Xwa NOSE p?n
?'a p3nc"a TOOTH p?? p3c
Language Classification
53
54Loss of information?
- Shown for representative groups
- ASJP as good for separating language families as
- full IPA
Language Classification
54
55Loss of information?
- Shown for representative groups
- ASJP as good for separating language families as
- full IPA
- More accurate for precise genetic classification
- than IPA (under our current method)
Language Classification
55
56Comparing words and languages
57Comparing words
Most successful measure to date Levenshtein
Distance
58Comparing words
Levenshtein Distance (LD) Number of
transformations (changes additions) to get
from the shorter form to the longer form
Language Classification
58
59Comparing words
Levenshtein Distance (LD) Number of
transformations (changes additions) to get
from the shorter form to the longer form
A L T A S J P
Language Classification
59
60Comparing words
Levenshtein Distance (LD) Number of
transformations (changes additions) to get
from the shorter form to the longer form
A L T A S J P x x x 3
Language Classification
60
61Comparing words
- Levenshtein Distance (LD)
- Number of transformations (changes additions)
- to get from the shorter form to the longer form
- 1. Normalization
- ? LDN ( LD / Lmax ) ? 0.0 1.0
Language Classification
61
62Comparing words
- Levenshtein Distance (LD)
- Number of transformations (changes additions)
- to get from the shorter form to the longer form
- 1. Normalization
- ? LDN ( LD / Lmax ) ? 0.0 1.0
- 2. Eliminate background noise
- LDND ( LDN / LDNdifferent pairs )
Language Classification
62
63Classifying languages
64Language Classification
64
65Swadesh (3500)
AJSP
Language Classification
65
66Swadesh (3500)
AJSP
distance matrix
67Language Classification
67
68Language Classification
68
69Language Classification
69
703500 languages 240.000.000 comp
Language Classification
70
71Processing problems
Language Classification
71
72Solution parallel processing
Language Classification
72
73Swadesh (3500)
AJSP
distance matrix
MEGA4
http//www.megasoftware.net/ DNA patterns
74Swadesh (3500)
AJSP
distance matrix
Neighbour Joining
MEGA4
75 SEE COMPLETE TREE-OF-THE-MONTH ON
email.eva.mpg.de/wichmann/ASJPHomePage
Language Classification
75
76Mayan (34 / 69 Ethn)
LDND Mega4
77Mayan (34 / 69) lt all only gt
LDND Mega4
78Mayan (34 / 69)
cholan
LDND Mega4
79Mayan (34 / 69)
tzeltalan
cholan
cholan
LDND Mega4
80Mayan (34 / 69)
tzeltalan
cholan
yucatecan
LDND Mega4
81Mayan (34 / 69)
Ethnologue/experts
yucatecan
tzeltalan
cholan
LDND Mega4
82ASJP and genetic classification
- Method works at a global level
83ASJP and genetic classification
- Method works at a global level - Often also at
the lowest levels
Language Classification
83
84ASJP and genetic classification
- Method works at a global level - Often also at
the lowest levels - Refinement necessary at
intermediate level
85Adding typological data
86Trying to improve the fit
Enrich lexical with typological data
Haspelmath, M., M. Dryer, D. Gil B. Comrie
(eds) (2005). The World Atlas Of Language
Structures. Oxford Oxford University Press
WALS Online
http//wals.info/
87Lexical plus typological data
Swadesh (3500)
WALS (2580)
lt 140 FEATURES gt
ASJP
distance matrix
TREE SFTW
Language Classification
87
88SWALSH
ASJP
distance matrix
TREE SFTW
89Improving the fit
- Enrich lexical with typological data
- NOT 11 with ASJP languages
90SWALSH (1250)
ASJP
distance matrix
TREE SFTW
Language Classification
90
91Improving the fit
- Enrich lexical with typological data
- NOT 11 with ASJP languages
- WALS matrix very UNevenly filled (16)
- cf. Cysouw (2008) STUF 61.3
92Improving the fit
- Enrich lexical with typological data
- NOT 11 with ASJP languages
- WALS features very unevenly filled
- ? Determine most stable features
93Feature Stability
Nichols (1995) metric for S(Ftrk) in Gx pairs
(valkvalk) all pairs
Language Classification
93
94Feature Stability
ASJP metric for Stability Ftrk For Gx pairs
(valkvalk) all pairs
Language Classification
94
95Feature Stability
ASJP metric for stability Ftrk For Gx pairs
(valkvalk) all pairs
all pairs
Size differences between G
Language Classification
95
96Feature Stability
ASJP metric for stability Ftrk SFk pairs
(valkvalk) all pairs
all pairs
Language Classification
96
97Feature Stability
ASJP metric for stability Ftrk SFk pairs
(valkvalk) all pairs
pairs (valkvalk) U all
pairs all pairs
Background noise
Language Classification
97
98Feature Stability
ASJP metric for stability Ftrk SFk pairs
(valkvalk) all pairs
pairs (valkvalk) U all
pairs all pairs (1 U)
Normalization SFk comparable
Language Classification
98
99Most stable WALS features
31. Sex-based and Non-sex-based Gender
Systems 0.81 118. Predicative Adjectives 0.74
30. Number of Genders 0.73 119. Nominal
and Locational Predication 0.71 29.
Syncretism in Verbal Person/Number Marking 0.71
Language Classification
99
100Most instable WALS features
128. Utterance Complement Clauses 0.07 115.
Negative Indefinite Pronouns/Predicate
Negation 0.07 59. Possessive
Classification 0.01 135. Red and
Yellow -0.07 58. Obligatory Possessive
Inflection -0.25
Language Classification
100
101Correlation with Ethnologue
Min ftrs 20
Language Classification
101
102Correlation with Ethnologue
Min ftrs 40 20
Language Classification
102
103Correlation with Ethnologue
Min ftrs 60 40 20
Language Classification
103
104Correlation with Ethnologue
Min ftrs 80 60 40 20
Language Classification
104
105Correlation with Ethnologue
Min ftrs 100 80 60 40 20
Language Classification
105
106Correlation with Ethnologue
Min ftrs 100 80 60 40 20
? Stability ? -
Language Classification
106
107Correlation with Ethnologue
Min ftrs 100 80 60 40 20
20
Language Classification
107
108Correlation with Ethnologue
Min ftrs 100 80 60 40 20
40
Language Classification
108
109Correlation with Ethnologue
Min ftrs 100 80 60 40 20
60
Language Classification
109
110Correlation with Ethnologue
Min ftrs 100 80 60 40 20
85
Language Classification
110
111Correlation with Ethnologue
Min ftrs 100 80 60 40 20
Language Classification
111
112WALS
Language Classification
112
113Swadesh40
WALS
Language Classification
113
114Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships
WALS!
Language Classification
114
115Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships What about a combination?
Language Classification
115
116Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
Only Sw40
Only WALS
Language Classification
116
117Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
Only Sw40
Only WALS
Language Classification
117
118Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
8515
Only Sw40
Only WALS
Language Classification
118
119Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
7030
Only Sw40
Only WALS
Language Classification
119
1200.91
Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
5050
Only Sw40
Only WALS
Language Classification
120
121Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
3565
Only Sw40
Only WALS
Language Classification
121
122Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships A combined, balanced approach is
superior, but
Language Classification
122
123Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships A combined, balanced approach is
superior, but at a much higher cost per
language than just lexicostatistics 84 WALS
to be filled in
Language Classification
123
124Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships A combined, balanced approach is
superior, but at a much higher cost per
language Continue extension/optimization of
lexical method
Language Classification
124
125Publications 2008 - 2009
1. Brown, Cecil H., Eric W. Holman, Søren
Wichmann Viveka Velupillai (2008). Automated
Classification of the Worlds languages a
description of the method and prelimary results.
Sprachtypologie und Universalienforschung 61
285-308. 2. Holman, E. W., S. Wichmann, C. H.
Brown, V. Velupillai, A. Müller D. Bakker
(2008) 'Advances in automated language
classification'. In A. Arppe, K. Sinnemäke and U.
Nikanne (eds) Quantitative Investigations in
Theoretical Linguistics. Helsinki University of
Helsinki, 40-43. 3. Holman, E. W., S. Wichmann,
C. H. Brown, V. Velupillai, A. Müller D. Bakker
(2008). Explorations in automated language
classification. Folia Linguistica 42-2,
331-354. 4. Bakker, D., A. Müller, V. Velupillai,
S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R.
Mailhammer, A. Grant, E. W. Holman (2009).
Adding typology to lexicostatistics a combined
approach to language classification. Linguistic
Typology 13, 167-179.
126?
127ASJP
Overall goal - Method Tools for
Reconstruction of Language Relationships
Derived goals - Critical assessment and
refinement of existing classifications - Classify
newly described and unclassified languages -
Search for (ir)regularities in family
reconstructions - Test hypotheses about
families - Experimentally find an optimal dating
method - Automatically detect borrowings
Language Classification
127