Title: New approaches to language and prehistory from typology, genetics, and quantitative linguistics
1New approaches to language andprehistory from
typology, genetics,and quantitative linguistics
Søren Wichmann MPI-EVA Leiden University
2Lecture III The utility of phylogenetic
algorithms and software
3Software used
PHYLIP (http//evolution.genetics.washington.edu/p
hylip.html) Splitstree (www.splitstree.org) PAUP
(www.sinauer.com) MrBayes (www.mrbayes.net) TreeVi
ew (http//taxonomy.zoology.gla.ac.uk/rod/treeview
.html)
4Selecting or weighting features
5Selecting or weighting features
- Assumption-1 The feature value which is most
favored in a given genus is the one that should
be reconstructed for the proto-language of the
genus.
6Selecting or weighting features
- Assumption-1 The feature value which is most
favored in a given genus is the one that should
be reconstructed for the proto-language of the
genus. - Assumption-2 The better represented the most
favored feature value in a given genus is, the
more stable that feature may be assumed to be.
7Selecting or weighting features
- Assumption-1 The feature value which is most
favored in a given genus is the one that should
be reconstructed for the proto-language of the
genus. - Assumption-2 The better represented the most
favored feature value in a given genus is, the
more stable that feature may be assumed to be. - Strategy study the distribution of values of a
given feature for each genus and then calculate
an average of how well represented the best
represented value is throughout all genera in the
WALS sample.
8Selecting or weighting features
- Assumption-1 The feature value which is most
favored in a given genus is the one that should
be reconstructed for the proto-language of the
genus. - Assumption-2 The better represented the most
favored feature value in a given genus is, the
more stable that feature may be assumed to be. - Strategy study the distribution of values of a
given feature for each genus and then calculate
an average of how well represented the best
represented value is throughout all genera in the
WALS sample. - Problem how are we to compare the stability of
features when three variables are involved the
number of occurrences of the best represented
feature value, the number of possible feature
values, and the number of languages for which the
feature is attested in the WALS sample?
9Selecting or weighting features (cont.)
- Exemplification of problem
- How do we compare the stability of the two
following features in Germanic given the
variables indicated?
10Solving such problems by handa simple example
- k (number of possible values) 2 (a and b)
- n (number of languages) 4
- r (number of times that the best represented
feature occurs) - Distributional possibilities
- r
- aaaa 4
- bbbb 4
- aaab 3
- aaba 3
- abaa 3
- baaa 3
- abbb 3
- babb 3
- bbab 3
- bbba 3
- aabb 2
- abab 2
11Solving such problems by handa simple example
- k (number of possible values) 2 (a and b)
- n (number of languages) 4
- r (number of times that the best represented
feature occurs) - Distributional possibilities
Probabilities - r r probability
- aaaa 4 4 2/16
- bbbb 4 3 8/16
- aaab 3 2 6/16
- aaba 3
- abaa 3
- baaa 3
- abbb 3
- babb 3
- bbab 3
- bbba 3
- aabb 2
- abab 2
12A formula for calculating the probability or
p-value for any set of (n, k, r)
k number of possible values n number of
languages r number of times that the best
represented feature occurs
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20A sample of 5 pairs of related languages for
testing different methods (selection criterion
best documented genealogical pairs in WALS
dataset) Athapaskan Slave Navajo Chibch
an Ika Rama Uto-Aztecan Yaqui Comanch
e Oto-Manguean Chalcatongo Mixtec Lealao
Chinantec Carib Hixkaryana Macushi
21Neighbour-Joining using the 17 highest-ranking
features
22Neighbour-Joining using the 17 lowest-ranking
features
23A Neighbour Net representation (using SplitsTree)
24Maximum-Parsimony analysis using Paup
25Bayesian analysis using MrBayes
26Tree generated by (presumed) knowledge of
ancestral states, using 23 informative Native
American founder features
27Effects of using p-values for weighting
Method Branch and Bound Bootstrap Search
(Bootstrap 50 majority-rule consensus tree) No
Matrices, Equal Weighting /----------------------
-------------------------------------------------
Slave(1) --------------------------------------
--------------------------------- Navajo(2)
/------------------------ Yaqui(3)
/----------91----------
\------------------
------ Comanche(4)
------------------------------
----------------- NezPerce(5)
-------------------
---------------------------- HanisCoos(6) \-------
---96-----------
/------------------------
ChalMixtec(7)
----------85----------
\------------------------
LealChinantec(8)
/------------------------ Hixkaryana(9)
\----------54----------
\------------------------ Macushi(10)
28Effects of using p-values for weighting
Method Branch and Bound Bootstrap Search
(Bootstrap 50 majority-rule consensus tree) No
Matrices, Equal Weighting /----------------------
-------------------------------------------------
Slave(1) --------------------------------------
--------------------------------- Navajo(2)
/------------------------ Yaqui(3)
/----------91----------
\------------------
------ Comanche(4)
------------------------------
----------------- NezPerce(5)
-------------------
---------------------------- HanisCoos(6) \-------
---96-----------
/------------------------
ChalMixtec(7)
----------85----------
\------------------------
LealChinantec(8)
/------------------------ Hixkaryana(9)
\----------54----------
\------------------------ Macushi(10)
No Matrices, PValue /--------------------------
---------------------------------------------
Slave(1) --------------------------------------
--------------------------------- Navajo(2)
/------------------------ Yaqui(3)
/----------87----------
\------------------
------ Comanche(4)
------------------------------
----------------- NezPerce(5)
-------------------
---------------------------- HanisCoos(6) \-------
---94-----------
/------------------------
ChalMixtec(7)
----------90----------
\------------------------
LealChinantec(8)
/------------------------ Hixkaryana(9)
\----------60----------
\------------------------ Macushi(10)
29An additional method for enhancing phylogenetic
signals step matrices
Step matrices specify how many steps a language
has to pass through to get from one feature value
to another. The number of steps feeds into the
calculation of the most parsimonious tree.
30An additional method for enhancing phylogenetic
signals step matrices
Step matrices specify how many steps a language
has to pass through to get from one feature value
to another. The number of steps feeds into the
calculation of the most parsimonious tree. A
simple example THE VELAR NASAL (WALS feature
no. 9)
31A consideration one step is harder to take if
the value is universally rare, and easier if the
value is universally common.
32- A consideration one step is harder to take if
the value is universally rare, and easier if the
value is universally common. - We stipulate for the extreme cases that
- going to a feature value shared by 100 of all
languages in - the sample is a non-step, i.e. it should
subtract one - step from the step matrix.
33- A consideration one step is harder to take if
the value is universally rare, and easier if the
value is universally common. - We stipulate for the extreme cases that
- going to a feature value shared by 100 of all
languages in - the sample is a non-step, i.e. it should
subtract one - step from the step matrix.
- going to a value shared by (100/v) of all
languages - (where v the number of values) should
neither add to or - detract from the number of steps.
34- A consideration one step is harder to take if
the value is universally rare, and easier if the
value is universally common. - We stipulate for the extreme cases that
- going to a feature value shared by 100 of all
languages in - the sample is a non-step, i.e. it should
subtract one - step from the step matrix
- going to a value shared by (100/v) of all
languages - (where v the number of values) should
neither add to or - detract from the number of steps
- going to a value that none of all languages have
adds one - step extra to the matrix
35- A consideration one step is harder to take if
the value is universally rare, and easier if the
value is universally common. - We stipulate for the extreme cases that
- going to a feature value shared by 100 of all
languages in - the sample is a non-step, i.e. it should
subtract one - step from the step matrix
- going to a value shared by (100/v) of all
languages - (where v the number of values) should
neither add to or - detract from the number of steps
- going to a value that none of all languages have
adds one - step extra to the matrix
- These stipulations allow us to set up a formula
to modify the steps in the matrix, taking into
account world-wide distributions.
36The formula is of a polynomial nature and has the
following shape s steps added or detracted
(max 1, min -1) w world-wide distribution
(percent of all languages in the sample) v
number of feature values s (v(v 2)/(v
1))w2 ((v2 2)/(v - 1))w 1
37Returning to the example of the velar nasal, we
find that the world-wide distribution in the WALS
sample is No velar nasal 234 ? w 50.0 ? s
-0.38 Non-initial only 88 ? w 18.8 ? s
0.39 Initial and non-initial 146 ? w 31.2 ? s
0.05
Thus the step matrix should be modified as
follows
Original
38Returning to the example of the velar nasal, we
find that the world-wide distribution in the WALS
sample is No velar nasal 234 ? w 50.0 ? s
-0.38 Non-initial only 88 ? w 18.8 ? s
0.39 Initial and non-initial 146 ? w 31.2 ? s
0.05
Thus the step matrix should be modified as
follows
Original
Modified
39Effects of using step matrices for computing
genealogical trees
40What happens when more data is added? (63
languages, 96 features)
41Forcing known, shallow nodes by adding lexical
data
'Acoma rrcaaaqanrarn????aca...rcrc??r?rnn?aaa???
aa 'Apurina' arnaaaaanaaaaanraaaa...rcrrdrarraeq
???d??aa 'Araona' cadddaarnraaanga?aa?...?aaar
?arqrar???d???? 'Arawak' ????????n????rgaa???..
.rcrra?a?r??q???drr?? 'AwaPit'
aannaaadnrana????aan...rdraa?rrqncarradrraa 'Ayma
ra' dacaadrrn?araqga?aan...rcdc??arrrra????ar??
'BarasanoA ????????na?????????n...rrqan?rraaernr
a?rraa 'Cahuilla' nadaadarrranargeaaa?...rcdn?dr
rr??qnaa?aa?? 'CanelaKra' anaaaaara?anaega?aaa...
?nraa?r?raeaaara??aa 'Carib'
rrrrraaanraraaqd?aa?...?crc??arararrrr?ar?? 'Cayu
vava' rnrrdaaanaaaacgaaaad...rcnr??a?aae???????aa
'ChinantecB nrnrraara?arnaea?aa?...rrra??ar?aea?
???rr?? 'Comanche' araaaaaanrarargaaaa?...ncrnr?a
rrnna???arr?? 'CoosHani' crcrrdrdnranaanacaaa...r
cccn?arraeaaaa???aa 'CreePlains aaraaaaanraracgaa
aaa...nccnn?rrraeaaaa?rraa 'EpenaPedee rnrrraaana
araanr?dar...caaaa?rrarcraara???? ..............
............................. 'UrubuKaap'
rrraaaaaraaraegaraa?...rrnaa?rrraeaaarcra?? 'Wara
o' arraaaannaaaaqgaaaar...?naaadararar???
arraa 'Wari' rrraaaaanarraegacnaa...dcrr
ndandraq??????ar 'Wichi'
nrnaarqrnrarargaaaan...?ccrndrrrnranaarrrar 'Wich
ita' nacaaaqanra?aaccnraa...rccr?arrrnnanra?r
raa 'Yagua' araaaaaanaarr????nac...ncrrnd
rrraearrra??aa 'Yaqui' rrnraaarnrarrrgeaa
an...qaaandarraeaaard??ar 'Yuchi'
crdrraqdnaa?aaeqdaaa...rcrd??rnaaer??????aa 'Yuro
k' drdaaaqdnrana????aaa...rcrc??a?rnqaaar
???an 'ZoqueCop' rrraaaarr?anaqgaaaan...rcrc
??r?nrdraaa??raa 'Zuni'
nrnaaardn?ararga?aan...?aaa??arq???????araa
42Forcing known, shallow nodes by adding lexical
data
'Acoma rrcaaaqanrarn????aca...rcrc??r?rnn?aaa???
aa--------- 'Apurina' arnaaaaanaaaaanraaaa...rcr
rdrarraeq???d??aa--------- 'Araona'
cadddaarnraaanga?aa?...?aaar?arqrar???d????------
--- 'Arawak' ????????n????rgaa???...rcrra?a?r??
q???drr??--------- 'AwaPit' aannaaadnrana????aa
n...rdraa?rrqncarradrraa--------- 'Aymara'
dacaadrrn?araqga?aan...rcdc??arrrra????ar??------
--- 'BarasanoA ????????na?????????n...rrqan?rraae
rnra?rraa--------- 'Cahuilla' nadaadarrranargeaa
a?...rcdn?drrr??qnaa?aa??aaaaaaaaa 'CanelaKra'
anaaaaara?anaega?aaa...?nraa?r?raeaaara??aa------
--- 'Carib' rrrrraaanraraaqd?aa?...?crc??arar
arrrr?ar??--------- 'Cayuvava' rnrrdaaanaaaacgaaa
ad...rcnr??a?aae???????aa--------- 'ChinantecB nr
nrraara?arnaea?aa?...rrra??ar?aea????rr??---------
'Comanche' araaaaaanrarargaaaa?...ncrnr?arrnna??
?arr??aaaaaaaaa 'CoosHani' crcrrdrdnranaanacaaa..
.rcccn?arraeaaaa???aa--------- 'CreePlains aaraaa
aanraracgaaaaa...nccnn?rrraeaaaa?rraa--------- 'Ep
enaPedee rnrrraaanaaraanr?dar...caaaa?rrarcraara?
???--------- ...................................
................. 'UrubuKaap'
rrraaaaaraaraegaraa?...rrnaa?rrraeaaarcra??------
--- 'Warao' arraaaannaaaaqgaaaar...?naaad
ararar???arraa--------- 'Wari'
rrraaaaanarraegacnaa...dcrrndandraq??????ar------
--- 'Wichi' nrnaarqrnrarargaaaan...?ccrndr
rrnranaarrrar--------- 'Wichita'
nacaaaqanra?aaccnraa...rccr?arrrnnanra?rraa------
--- 'Yagua' araaaaaanaarr????nac...ncrrnd
rrraearrra??aa--------- 'Yaqui'
rrnraaarnrarrrgeaaan...qaaandarraeaaard??araaaaaa
aaa 'Yuchi' crdrraqdnaa?aaeqdaaa...rcrd??
rnaaer??????aa--------- 'Yurok'
drdaaaqdnrana????aaa...rcrc??a?rnqaaar???an------
--- 'ZoqueCop' rrraaaarr?anaqgaaaan...rcrc??
r?nrdraaa??raa--------- 'Zuni'
nrnaaardn?ararga?aan...?aaa??arq???????araa------
---
43The effect of forced nodes
Black dots forced nodes gray dots correct
nodes emerging from the data
44The effect of known ancestral states for a large
dataset
Ancestral states known
45Ancestral states unknown
46(No Transcript)
47In the larger phylogenetic picture the use of
knowledge of founder effect values has a positive
effect on the classification of languages known
to be related
48Next step verifying hypotheses by traditional
methods
49- Fin -
- Tomorrow a critical evaluation of Dunn et al.s
recent paper in Science on Austronesian/Papuan
and some vistas regarding the classification of
the NEw World language family