Adding typology to lexicostatistics: a combined approach to language classification - PowerPoint PPT Presentation

1 / 127
About This Presentation
Title:

Adding typology to lexicostatistics: a combined approach to language classification

Description:

90. white. 11. one. 31. bone. 51. breasts. 71. say. 91. black. 12. two. 32. grease. 52. heart ... 1. Not the complete list, only most stable items ... – PowerPoint PPT presentation

Number of Views:569
Avg rating:3.0/5.0
Slides: 128
Provided by: dik1
Category:

less

Transcript and Presenter's Notes

Title: Adding typology to lexicostatistics: a combined approach to language classification


1
Adding typology to lexicostatistics a combined
approach to language classification
  • ASJP Consortium
  • lt Dik Bakker et al. mult. gt

2
Overview
Project ASJP (Started January 2007) (Automated
Similarity Judgment Program)
Language Classification
2
3
Overview
Project ASJP (Automated Similarity Judgment
Program) Overall goal Automatic reconstruction
of language relationships
Language Classification
3
4
Overview
Project ASJP (Automated Similarity Judgment
Program) Overall goal Automatic reconstruction
of language relationships Basis Distance matrix
between individual languages based on
lexical elements
Language Classification
4
5
Overview
Project ASJP (Automated Similarity Judgment
Program) Overall goal Automatic reconstruction
of language relationships Basis Distance matrix
between individual languages Method
Lexicostatistics mass comparison of basic
lexical items
Language Classification
5
6
Overview
Project ASJP (Automated Similarity Judgment
Program) As in traditional lexicostatistics,
but
Language Classification
6
7
Overview
Project ASJP (Automated Similarity Judgment
Program) As in traditional lexicostatistics,
but 1. use of computational algorithms and
tools
Language Classification
7
8
Overview
Project ASJP (Automated Similarity Judgment
Program) As in traditional lexicostatistics,
but 1. use of computational algorithms and
tools 2. methodology from classification in
biology
Language Classification
8
9
Overview
Project ASJP (Automated Similarity Judgment
Program) As in traditional lexicostatistics,
but 1. use of computational algorithms and
tools 2. methodology from classification in
biology 3. extended by all relevant data
available
Language Classification
9
10
Caveat
ASJP goal Reconstruction of relationships
between languages NOT better than experts in
classification of areas/groups
Language Classification
10
11
Caveat
ASJP goal Reconstruction of relationships
between languages NOT better than experts in
classification of areas/groups BUT 1.
Optimize lexicostatistics on basis of expert
knowledge on well-explored areas
Language Classification
11
12
Caveat
ASJP goal Reconstruction of relationships
between languages NOT better than experts in
classification of areas/groups BUT 1.
Optimize lexicostatistics on basis of expert
knowledge 2. Provide method and tools to assess
and improve classifications for un(der)explored
areas
Language Classification
12
13
Overview
Current collaborators Dik Bakker David Beck
Oleg Belyaev Cecil H. Brown Pamela Brown
Matthew Dryer Dmitry Egorov Pattie Epps
Anthony Grant Eric W. Holman
Hagen Jung Johann-Mattis List Robert Mailhammer
André Müller Uri Tadmor Matthias Urban Viveka
Velupillai Søren Wichmann Kofi Yakpo
14
Overview
Current collaborators Dik Bakker David Beck
Oleg Belyaev Cecil H. Brown Pamela Brown
Matthew Dryer Dmitry Egorov Pattie Epps
Anthony Grant Eric W. Holman
Hagen Jung Johann-Mattis List Robert Mailhammer
André Müller Uri Tadmor Matthias Urban Viveka
Velupillai Søren Wichmann Kofi Yakpo
Language Classification
14
15
Overview ASJP system
LEX
16
Overview ASJP system
LEX
Method
ASJP software
17
Overview ASJP system
LEX
ASJP software
distance matrix
18
Overview ASJP system
LEX
ASJP software
distance matrix
19
Overview ASJP system
LEX
ASJP software
distance matrix
CLASSIF software
20
Existing Expert Classifications
LEX
ETHN WALS EXPRT
ASJP software
EVALUATION
distance matrix
CLASSIF software
STAT software
21
Existing Expert Classifications
LEX
ETHN WALS EXPRT
Method
ASJP software
CALIBRATION
distance matrix
CLASSIF software
STAT software
22
LEX
ETHN WALS EXPRT
GEO GRAPH
ASJP software
distance matrix
MAP software
CLASSIF software
STAT software
23
LEX
ETHN WALS EXPRT
GEO GRAPH
HIST FACTS
ASJP software
distance matrix
CLASSIF software
STAT software
MAP software
24
LEX
ETHN WALS EXPRT
GEO GRAPH
HIST FACTS
TYPOL DATA
ASJP software
distance matrix
CLASSIF software
STAT software
MAP software
25
Today
LEX
ASJP software
distance matrix
CLASSIF software
26
Today
LEX
TYPOL DATA
ASJP software
distance matrix
CLASSIF software
27
List of basic lexical items
28
Lexical items
Word list Morris Swadesh (1955) 100 basic
meanings

29
(No Transcript)
30
Lexical items
Swadesh list assumptions

Language Classification
30
31
Lexical items
Swadesh list - Word in most languages

Language Classification
31
32
Lexical items
Swadesh list - Word in most languages -
Inherited rather than borrowed

Language Classification
32
33
Lexical items
Swadesh list - Word in most languages -
Inherited rather than borrowed - Relatively
stable over time

Language Classification
33
34
Lexical items
Swadesh list - Word in most languages -
Inherited rather than borrowed - Relatively
stable over time - Easily accessible (fieldwork
/ grammars)

Language Classification
34
35
Lexical items
Languages transcribed to date - Over 3500
languages (incl. dialects around 45 of lgs of
the world)

Language Classification
35
36
Languages currently collected
Language Classification
36
37
Lexical items further reduction
Reduction of the full Swadesh list

38
Lexical items further reduction
Reduction of the full Swadesh list 1. Not the
complete list, only most stable items

Language Classification
38
39
Lexical items further reduction
  • Reduction of the full Swadesh list
  • 1. Not the complete list, only most stable items
  • 2. Not full IPA representation, but generalized
    coding


Language Classification
39
40
Lexical items further reduction
  • 1. Not the complete list
  • Most stable items least formal variation in
  • well-established genetic groups (Dryers
    genera)


Language Classification
40
41
Lexical items further reduction
  • 1. Not the complete list
  • Most stable items least formal variation in
  • well-established genetic groups (Dryers
    genera)
  • Nichols (1995) lg pairs (wordkwordk)
  • all pairs


Language Classification
41
42
Lexical items further reduction
  • 1. Not the complete list
  • Most stable items least formal variation in
  • well-established genetic groups (Dryers
    genera)
  • Nichols (1995) lg pairs (wordkwordk)
  • all pairs
  • ? What is optimal number ?


Language Classification
42
43
Ethnologue Classification
WALS Classification
? Stability ? -
Goodman-Kruskal Pearson
Language Classification
44
Ethnologue Classification
WALS Classification
? Stability ? -
Language Classification
45
Ethnologue Classification
WALS Classification
46
Ethnologue Classification
WALS Classification
47
Ethnologue Classification
WALS Classification
40
48
Ethnologue Classification
WALS Classification
49
Ethnologue Classification
WALS Classification
Language Classification
49
50
40 Most Stable
Language Classification
50
51
Lexical items transcription
2. NOT full IPA but ASJPcode 7 Vowels 34
Consonants
All other phonemes to closest sound (automatic)

Language Classification
51
52
Abaza (Caucasian) Meaning IPA PERSON ????'??
???s LEAF b??? SKIN ??az? HORN ?'???
?a NOSE p?n?'a TOOTH p??
Language Classification
52
53
Abaza (Caucasian) Meaning IPA ASJPcode PERSON
????'?????s Xw3Cw"yXw3s LEAF b??? bxy3 SKIN
??az? Cwazy HORN ?'????a Cw"3Xwa NOSE p?n
?'a p3nc"a TOOTH p?? p3c
Language Classification
53
54
Loss of information?
  • Shown for representative groups
  • ASJP as good for separating language families as
  • full IPA


Language Classification
54
55
Loss of information?
  • Shown for representative groups
  • ASJP as good for separating language families as
  • full IPA
  • More accurate for precise genetic classification
  • than IPA (under our current method)


Language Classification
55
56
Comparing words and languages
57
Comparing words
Most successful measure to date Levenshtein
Distance

58
Comparing words
Levenshtein Distance (LD) Number of
transformations (changes additions) to get
from the shorter form to the longer form

Language Classification
58
59
Comparing words
Levenshtein Distance (LD) Number of
transformations (changes additions) to get
from the shorter form to the longer form
A L T A S J P

Language Classification
59
60
Comparing words
Levenshtein Distance (LD) Number of
transformations (changes additions) to get
from the shorter form to the longer form
A L T A S J P x x x 3

Language Classification
60
61
Comparing words
  • Levenshtein Distance (LD)
  • Number of transformations (changes additions)
  • to get from the shorter form to the longer form
  • 1. Normalization
  • ? LDN ( LD / Lmax ) ? 0.0 1.0


Language Classification
61
62
Comparing words
  • Levenshtein Distance (LD)
  • Number of transformations (changes additions)
  • to get from the shorter form to the longer form
  • 1. Normalization
  • ? LDN ( LD / Lmax ) ? 0.0 1.0
  • 2. Eliminate background noise
  • LDND ( LDN / LDNdifferent pairs )


Language Classification
62
63
Classifying languages
64


Language Classification
64
65
Swadesh (3500)
AJSP
Language Classification
65
66
Swadesh (3500)
AJSP
distance matrix
67

Language Classification
67
68

Language Classification
68
69

Language Classification
69
70
3500 languages 240.000.000 comp
Language Classification
70
71
Processing problems

Language Classification
71
72
Solution parallel processing

Language Classification
72
73
Swadesh (3500)
AJSP
distance matrix
MEGA4
http//www.megasoftware.net/ DNA patterns
74
Swadesh (3500)
AJSP
distance matrix
Neighbour Joining
MEGA4
75
SEE COMPLETE TREE-OF-THE-MONTH ON
email.eva.mpg.de/wichmann/ASJPHomePage
Language Classification
75
76
Mayan (34 / 69 Ethn)
LDND Mega4
77
Mayan (34 / 69) lt all only gt
LDND Mega4
78
Mayan (34 / 69)
cholan
LDND Mega4
79
Mayan (34 / 69)
tzeltalan
cholan
cholan
LDND Mega4
80
Mayan (34 / 69)
tzeltalan
cholan
yucatecan
LDND Mega4
81
Mayan (34 / 69)
Ethnologue/experts
yucatecan
tzeltalan
cholan
LDND Mega4
82
ASJP and genetic classification

- Method works at a global level

83
ASJP and genetic classification

- Method works at a global level - Often also at
the lowest levels

Language Classification
83
84
ASJP and genetic classification

- Method works at a global level - Often also at
the lowest levels - Refinement necessary at
intermediate level

85
Adding typological data
86
Trying to improve the fit
Enrich lexical with typological data
Haspelmath, M., M. Dryer, D. Gil B. Comrie
(eds) (2005). The World Atlas Of Language
Structures. Oxford Oxford University Press
WALS Online
http//wals.info/

87
Lexical plus typological data
Swadesh (3500)
WALS (2580)

lt 140 FEATURES gt
ASJP
distance matrix
TREE SFTW
Language Classification
87
88
SWALSH
ASJP
distance matrix
TREE SFTW
89
Improving the fit
  • Enrich lexical with typological data
  • NOT 11 with ASJP languages


90
SWALSH (1250)
ASJP
distance matrix
TREE SFTW
Language Classification
90
91
Improving the fit
  • Enrich lexical with typological data
  • NOT 11 with ASJP languages
  • WALS matrix very UNevenly filled (16)
  • cf. Cysouw (2008) STUF 61.3


92
Improving the fit
  • Enrich lexical with typological data
  • NOT 11 with ASJP languages
  • WALS features very unevenly filled
  • ? Determine most stable features


93
Feature Stability
Nichols (1995) metric for S(Ftrk) in Gx pairs
(valkvalk) all pairs

Language Classification
93
94
Feature Stability
ASJP metric for Stability Ftrk For Gx pairs
(valkvalk) all pairs


Language Classification
94
95
Feature Stability
ASJP metric for stability Ftrk For Gx pairs
(valkvalk) all pairs
all pairs
Size differences between G

Language Classification
95
96
Feature Stability
ASJP metric for stability Ftrk SFk pairs
(valkvalk) all pairs
all pairs

Language Classification
96
97
Feature Stability
ASJP metric for stability Ftrk SFk pairs
(valkvalk) all pairs
pairs (valkvalk) U all
pairs all pairs

Background noise
Language Classification
97
98
Feature Stability
ASJP metric for stability Ftrk SFk pairs
(valkvalk) all pairs
pairs (valkvalk) U all
pairs all pairs (1 U)


Normalization SFk comparable
Language Classification
98
99
Most stable WALS features
31. Sex-based and Non-sex-based Gender
Systems 0.81 118. Predicative Adjectives 0.74
30. Number of Genders 0.73 119. Nominal
and Locational Predication 0.71 29.
Syncretism in Verbal Person/Number Marking 0.71

Language Classification
99
100
Most instable WALS features
128. Utterance Complement Clauses 0.07 115.
Negative Indefinite Pronouns/Predicate
Negation 0.07 59. Possessive
Classification 0.01 135. Red and
Yellow -0.07 58. Obligatory Possessive
Inflection -0.25

Language Classification
100
101
Correlation with Ethnologue
Min ftrs 20

Language Classification
101
102
Correlation with Ethnologue
Min ftrs 40 20

Language Classification
102
103
Correlation with Ethnologue
Min ftrs 60 40 20

Language Classification
103
104
Correlation with Ethnologue
Min ftrs 80 60 40 20

Language Classification
104
105
Correlation with Ethnologue
Min ftrs 100 80 60 40 20

Language Classification
105
106
Correlation with Ethnologue
Min ftrs 100 80 60 40 20
? Stability ? -

Language Classification
106
107
Correlation with Ethnologue
Min ftrs 100 80 60 40 20

20
Language Classification
107
108
Correlation with Ethnologue
Min ftrs 100 80 60 40 20

40
Language Classification
108
109
Correlation with Ethnologue
Min ftrs 100 80 60 40 20

60
Language Classification
109
110
Correlation with Ethnologue
Min ftrs 100 80 60 40 20

85
Language Classification
110
111
Correlation with Ethnologue
Min ftrs 100 80 60 40 20

Language Classification
111
112
WALS
Language Classification
112
113
Swadesh40
WALS
Language Classification
113
114
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships
WALS!

Language Classification
114
115
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships What about a combination?

Language Classification
115
116
Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
Only Sw40
Only WALS
Language Classification
116
117
Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
Only Sw40
Only WALS
Language Classification
117
118
Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
8515
Only Sw40
Only WALS
Language Classification
118
119
Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
7030
Only Sw40
Only WALS
Language Classification
119
120
0.91
Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
5050
Only Sw40
Only WALS
Language Classification
120
121
Ftrs Lgs 100 79 80 109 60 139 40 218 20 341
3565
Only Sw40
Only WALS
Language Classification
121
122
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships A combined, balanced approach is
superior, but

Language Classification
122
123
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships A combined, balanced approach is
superior, but at a much higher cost per
language than just lexicostatistics 84 WALS
to be filled in

Language Classification
123
124
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic
relationships A combined, balanced approach is
superior, but at a much higher cost per
language Continue extension/optimization of
lexical method

Language Classification
124
125
Publications 2008 - 2009
1. Brown, Cecil H., Eric W. Holman, Søren
Wichmann Viveka Velupillai (2008). Automated
Classification of the Worlds languages a
description of the method and prelimary results.
Sprachtypologie und Universalienforschung 61
285-308. 2. Holman, E. W., S. Wichmann, C. H.
Brown, V. Velupillai, A. Müller D. Bakker
(2008) 'Advances in automated language
classification'. In A. Arppe, K. Sinnemäke and U.
Nikanne (eds) Quantitative Investigations in
Theoretical Linguistics. Helsinki University of
Helsinki, 40-43. 3. Holman, E. W., S. Wichmann,
C. H. Brown, V. Velupillai, A. Müller D. Bakker
(2008). Explorations in automated language
classification. Folia Linguistica 42-2,
331-354. 4. Bakker, D., A. Müller, V. Velupillai,
S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R.
Mailhammer, A. Grant, E. W. Holman (2009).
Adding typology to lexicostatistics a combined
approach to language classification. Linguistic
Typology 13, 167-179.
126
?
127
ASJP
Overall goal - Method Tools for
Reconstruction of Language Relationships
Derived goals - Critical assessment and
refinement of existing classifications - Classify
newly described and unclassified languages -
Search for (ir)regularities in family
reconstructions - Test hypotheses about
families - Experimentally find an optimal dating
method - Automatically detect borrowings
Language Classification
127
Write a Comment
User Comments (0)
About PowerShow.com