Paleo-European languages

Richard W · Post by **Richard W** » Thu Oct 29, 2020 8:06 pm

zompist wrote: ↑Thu Oct 29, 2020 5:08 pm This is regrettable computolatry. Computers are not free of the "subjective element"; they reproduce the biases of their programmers— and in a more pernicious way because people don't realize that they can be biased.

Every decision made in programming such comparisons is subjective: what data to include and exclude, how the data is represented, how differences are measured, how much those differences are weighted.

The point is to make the procedures explicit, and ensure that they are followed. The difficulties are in capturing them, implementing them, and making the explicit processes comprehensible.

zompist wrote: ↑Thu Oct 29, 2020 5:08 pm And that's before even looking at other aspects of the methodology, such as the bad data that Nortaneous pointed out.

Do you mean the table of English words? I can only see two clear errors there, the entries for 'what' and 'white', which are represented having an onset denoted by the undefined sequence wh~. (May one use slashes to delimit folded phonemes?) These two entries are not included in the 40 actually used. Perhaps the entry mini for 'many' bothers you. If I've always rimed 'any' and 'many', then I used to pronounce that word that way. This entry wasn't used either.

zompist wrote: ↑Thu Oct 29, 2020 5:08 pm When computers are used to do things humans can't easily do, that's exciting. But to throw out 200 years of human work to replace it with a high-school-level understanding of language change is not "scientific".

As for "repeatable", why do you think historical linguists is not repeatable? Do you think that if linguists started over, they'd randomly group Indic with Chinese or something? There's not many areas of inquiry where so many eyeballs have looked at such a quantity of data.

Don't you mean Chinese with Germanic? (Or is that non-random.)

One of the specific inconsistencies that ASJP was hoped to solve was in the scoring of cognates on Swadesh lists. It may even work in phonological conservative families. In general though, I don't think the method adopted works, but I'm surprised that it seems to work as well as it does.

The assessment of whether two words are cognate is often subjective. The only hope I see of objectivity is a way of measuring plausibility, but we've a long way to go on that.

zompist wrote: ↑Thu Oct 29, 2020 5:08 pm There are, as these threads have demonstrated, lots of areas of disagreement. Not one of them would be improved by hiding the decisionmaking in a computer program, or pretending that the resulting program is infallible.

The decision making should be exposed in the program, not 'hidden' there. A calculator of random matches might be useful for this thread, but we're not seeing a testable volume of data. Are semantic databases yet ready to be harnessed?

zompist wrote: ↑Thu Oct 29, 2020 5:08 pm The one thing I'll grant you is that the programmers have found a way to automate Joseph Greenberg. They don't seem to have improved on him, however.

So the accusations of his being misled by bad data are groundless? After all, it should be possible to rerun them with corrected data. (Though sometimes the methods papers make frightening mention of computing architectures and mention enormous run times.)

Richard W · Post by **Richard W** » Thu Oct 29, 2020 8:51 pm

WeepingElf wrote: ↑Thu Oct 29, 2020 7:30 pm
zompist wrote: ↑Thu Oct 29, 2020 5:08 pm This is regrettable computolatry. Computers are not free of the "subjective element"; they reproduce the biases of their programmers— and in a more pernicious way because people don't realize that they can be biased.
Also, doing something on a computer doesn't make it any more scientific. If the pot is cracked, no computer will ever remove that crack. Most astrologers today use computer programs to calculate horoscopes; that doesn't make astrology any less superstitious. And the ASJP folks practice precisely the kind of etymology about which Voltaire quipped that "consonants count for little and vowels for nothing at all". No regular sound correspondences; just inspectional similarities. No way to tell actual cognates from loanwords and chance resemblances. This is not even "proper" glottochronology, which counted cognates rather than "similarities" - and has shown not to work because it was based on false assumptions. This is the kind of pseudo-glottochronology crackpots use to "demonstrate" fanciful language relationships. Only with computers - and misused academic credentials in fields that have actually nothing to do with comparative linguistics.

I think you're confusing sciencehood and the use of good theory. Is making well-defined artillery calculations ignoring air resistance 'scientific' or not? They're not practising science until they go back and refine their methods to fix the problems.

The scattergram of date of split against similarity in the ASJP dating paper looks interesting, but the points need a lot of scrutiny. There's no real justification for expecting a straight line. Reading the comments on it, I did wonder if they should instead be yielding dates in 'Austronesian years' - calibration data is generally hard to come by, and points on the graph should really be bars. I know there's a feeling that comparativists shouldn't do dates, but population geneticists aren't shy about publishing dates with enormous error bars.

Richard W · Post by **Richard W** » Thu Oct 29, 2020 10:02 pm

Richard W wrote: ↑Thu Oct 29, 2020 12:50 pm French's loss of final consonants may have helped isolate it; its possibly unique notation for nasalisation (/*/) and the undefined tilde probably don't help.

I'm wrong about the asterisk - it is defined as the nasalisation mark, However, it is also documented as being ignored in the analyses.

Post by **zompist** » Thu Oct 29, 2020 10:42 pm

Richard W wrote: ↑Thu Oct 29, 2020 8:06 pm
zompist wrote: ↑Thu Oct 29, 2020 5:08 pm And that's before even looking at other aspects of the methodology, such as the bad data that Nortaneous pointed out.
Do you mean the table of English words? I can only see two clear errors there, the entries for 'what' and 'white', which are represented having an onset denoted by the undefined sequence wh~. (May one use slashes to delimit folded phonemes?) These two entries are not included in the 40 actually used. Perhaps the entry mini for 'many' bothers you. If I've always rimed 'any' and 'many', then I used to pronounce that word that way. This entry wasn't used either.

* i and ɪ are conflated
* so are u and ʊ, and o and ɔ
* e and ɛ are sometimes conflated and sometimes not
* <E> is used sometimes for ɛ, sometimes for æ
* θ and ð are conflated
* "all" and "small" don't rhyme, and "small" is given the same vowel as "not"
* syllabic r is sometimes written with a shwa, sometimes just as r
* syllabic n always gets a shwa, syllabic l never
* au is sometimes aw, sometimes au

If they can't get their own language right, how much confidence should we have that they get other languages right?

For "fun", I checked their French wordlist. It's just as bad. "Graisse" is not [grais]; "tu" is not [ti]; "graine" is not [gran]; "cheveux" is not [ʃəve]; "plume" and "pied" do not have the same vowel; they can't decide whether to list verbs as infinitives or stems; "moon" is not [len] or [lɛn];
"montagne" and "rouge" do not end in the same consonant; final [j] is sometimes included and sometimes not...

Don't you mean Chinese with Germanic? (Or is that non-random.)

No. I'd still like an answer to the question: why do you think the comparative method is not repeatable?

If your answer is that, at some points, researchers make their own judgments... yes, they do, and so so programmers. They do not become more "scientific" because they are in a program.

The assessment of whether two words are cognate is often subjective.

Maybe so. But cognacy is not a fancy way of saying "similarity", so it cannot be replaced by a complicated way of measuring similarity.

People argue about cognacy because words have complicated histories, and ways of getting things wrong are legion. A computer does not rescue you from this.

The decision making should be exposed in the program, not 'hidden' there.

I don't know their program-- if they make use of neural nets like the cool kids do these days, that is precisely hiding the decisionmaking. If it's all C# or something, then stating your rules that way is far more obfuscatory than writing them down in a language other linguists can read.

Also note that a huge amount of their project is the database itself, full of decisions that are not laid out in code. Did a computer tell them to confuse their lax and tense vowels?

zompist wrote: ↑Thu Oct 29, 2020 5:08 pm The one thing I'll grant you is that the programmers have found a way to automate Joseph Greenberg. They don't seem to have improved on him, however.
So the accusations of his being misled by bad data are groundless? After all, it should be possible to rerun them with corrected data. (Though sometimes the methods papers make frightening mention of computing architectures and mention enormous run times.)

Greenberg's method is mass comparison without the comparative method. For why that is problematic even with good data, see a historical linguistics text.

Talskubilos · Post by **Talskubilos** » Fri Oct 30, 2020 7:39 am

WeepingElf wrote: ↑Thu Oct 29, 2020 7:30 pmFair. But that is indeed an old article, and I have changed my opinion about this since then, and no longer assume a dental-palatal merger in Pre-PIE. It may have happened, but it may just as well have not. We don't know yet; probably, only comparison with external relatives could tell, but those external relatives have not been established yet.

The thing is there's some evidence of that, and I've provided a little bit.

WeepingElf wrote: ↑Thu Oct 29, 2020 7:30 pmAlso, you had misrepresented the palatal consonants I wrote about as "dorsal affricates" such that I could not remember writing about them because i never did.

That's right.

WeepingElf · Post by **WeepingElf** » Fri Oct 30, 2020 12:14 pm

Perhaps it was unfair of me to claim that the ASJP people do etymology of the kind Voltaire criticized. I don't really know what they are doing, bit it seems to me, at least, as if they assume that languages change the same way as DNA, namely by random alterations of segments, and on the ground of that assumption, apply procedures geneticists use to compute phylogenetic family trees to languages. Of course, this is not the way languages change, so one can expect their results to be erroneous. At any rate, dating the divergences is difficult, as languages do not "mutate" at a rate that is constant across time and languages (just consider Icelandic vs. English). And, as I have said (and Zompist has, too), adding computer power to a flawed procedure doesn't make it any more scientific.

Richard W · Post by **Richard W** » Fri Oct 30, 2020 12:27 pm

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
Richard W wrote: ↑Thu Oct 29, 2020 8:06 pm
zompist wrote: ↑Thu Oct 29, 2020 5:08 pm And that's before even looking at other aspects of the methodology, such as the bad data that Nortaneous pointed out.
Do you mean the table of English words? I can only see two clear errors there, the entries for 'what' and 'white', which are represented having an onset denoted by the undefined sequence wh~. (May one use slashes to delimit folded phonemes?) These two entries are not included in the 40 actually used. Perhaps the entry mini for 'many' bothers you. If I've always rimed 'any' and 'many', then I used to pronounce that word that way. This entry wasn't used either.
* i and ɪ are conflated

The words are recorded in ASJPcode, as defined In the second reference on the Wikipedia page, Brown et al 2008. The choice to use ASJPcode is regrettable - it greatly reduces the value of their database.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm * so are u and ʊ, and o and ɔ

Them's the rules.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm * e and ɛ are sometimes conflated and sometimes not

I think we're seeing a judgement call for English /e/ between [e] (> ASJP e) and [ɛ] (> ASJP E). In my speech, the vowel of 'breast' is opener than the vowel of 'egg', and for the transcriber they seem to have fallen either side of the boundary.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm * <E> is used sometimes for ɛ, sometimes for æ
* θ and ð are conflated

Them's the rules.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm * "all" and "small" don't rhyme, and "small" is given the same vowel as "not"

We do seem to have an accent mixture between one where the vowels are back and open or mid to one where the vowels are central and open. However, I'm not confident that we aren't seeing an accurate record of an idiolect.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm * syllabic r is sometimes written with a shwa, sometimes just as r
* syllabic n always gets a shwa, syllabic l never

Given that these sequences are in free variation, this isn't entirely wrong, just unhelpful. Arguably both should have been given; alternatively, a decision should have been made for the case of r.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm * au is sometimes aw, sometimes au

This may reflect an allophonic difference, as with the word 'egg'. There is something slightly different about the vowel of 'round', a long vowel arguably in a long sequence of three vocoids between consonants as opposed to a short sequence of three vocoids in 'mountain' or just two vocoids in 'crowd'. Fortunately, ROUND is not one of the 40 meanings used.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm If they can't get their own language right, how much confidence should we have that they get other languages right?

As I mentioned in a previous post, I was worried by their inability to apply their distance metric manually in a published paper.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm For "fun", I checked their French wordlist. It's just as bad. "Graisse" is not [grais]; "tu" is not [ti]; "graine" is not [gran]; "cheveux" is not [ʃəve]; "plume" and "pied" do not have the same vowel; they can't decide whether to list verbs as infinitives or stems; "moon" is not [len] or [lɛn];
"montagne" and "rouge" do not end in the same consonant; final [j] is sometimes included and sometimes not...

There do seem to be multitudinous errors in their transcription. According to their rules, [a] should be transcribed as E and [ɑ] as o; their list reads as though they have merged to ä, which then should be transcribed as a. Am I just behind the times?

They have transcribed "tu" and "cheveux" in accordance with the rules. However, although the rules may be "easy to type" (one of their defences of their system) if one understands the rules, "plume" exhibits interference from IPA - they should of course have entered it as plim, not plym.

There may be a grammatically conditioned sound change going on in French. I checked the infinitives on English Wikipedia, and found that final /ʁ/ was often missing from the sound clips, e.g. in "mourir" and "voir". On the other hand, it was very easy to hear the rhotic in -"tre" and -"dre".

I had evaluated their Thai word list. Thai's five tones are split into two sets of three, and the initial consonants are written differently for the two sets. (There are extra tone marks to cope with the cases where there aren't the alternative consonants.) Their transcriptions reflect the different ways of writing the onset consonants. The one word on the 100-word list that I know to be spelt the etymologically wrong way (ฆ่า 'to kill') is not on the 40-word list. However, for Lao, which has the same system in this respect, they don't distinguish the two sets of tones. The word list composers don't disobey the rules consistently. (Both Thai and Lao have a tone which appears in both sets, the tone for B4 in the Gedney tone box.)

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm I'd still like an answer to the question: why do you think the comparative method is not repeatable?

The obvious answer is the Ehret, Orel and Stolbova two-part refutation of the existence of the comparative method, demonstrated in good faith by application to Afroasiatic.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm If your answer is that, at some points, researchers make their own judgments... yes, they do, and so so programmers. They do not become more "scientific" because they are in a program.

Repeatabilty and documentation make things more 'scientific'. Think of the fun and games we can have with the English, Latin and Greek words for 'wolf'. For that matter, think of the fun and games we can have with the words for 'one' - do they go back to PIE? The *oi bit seems to go back to PIE, the termination bit possibly, but the middle bit is chaotic.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
The assessment of whether two words are cognate is often subjective.
Maybe so. But cognacy is not a fancy way of saying "similarity", so it cannot be replaced by a complicated way of measuring similarity.

People argue about cognacy because words have complicated histories, and ways of getting things wrong are legion. A computer does not rescue you from this.

If you are talking about ASJP, it does not have much reasonable application to mostly well-understood families like Indo-European. Where it may hope to have reasonable application is where there is little more than not very deeply analysed word lists to work on. At this level, cognacy judgements may very well be little more than similarity judgements forced to binary values. One then runs ASJP on Indo-European not to learn about Indo-European, but to learn about the emergent behaviour of ASJP and its uses.

When used for the relatedness of languages, the project's hope is that a comparison of similarity measures will give similar results to a comparison of cognacy judgements, and benefit from consistency. It might even help with semicognacy - the addition or, rarely, subtraction of affixes.

In general, a computer can rescue you from inconsistency.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
The decision making should be exposed in the program, not 'hidden' there.
I don't know their program-- if they make use of neural nets like the cool kids do these days, that is precisely hiding the decisionmaking. If it's all C# or something, then stating your rules that way is far more obfuscatory than writing them down in a language other linguists can read.

Also note that a huge amount of their project is the database itself, full of decisions that are not laid out in code. Did a computer tell them to confuse their lax and tense vowels?

Some people see a virtue in consistency. For example, quality requirements imply that it is better to be consistently mediocre than to be usually mediocre but occasionally excellent. I do demur.

However, there is the view that repeatability is desirable in science.

A neural net feels more like substituting the subjective opinion of the history of training of some algorithm for a person's. It's only advantages are speed of training, speed of application (one hopes) and internal consistency. I had in mind a set of defined processes, which probably requires an English statement, though might conceivably be satisfied by clear computer code. Executing a written process consistently is not something people are particularly good at.

ASJPcode makes coarse phonetics distinctions to reduce the extent to which a difference in accent or transcriber preferences gives misleading assessments of difference. I view that as a dirty quick fix, but better motivated solutions don't seem to have worked.

Raphael · Post by **Raphael** » Fri Oct 30, 2020 12:50 pm

Richard W wrote: ↑Fri Oct 30, 2020 12:27 pm
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm * so are u and ʊ, and o and ɔ
Them's the rules.

zompist wrote: ↑Thu Oct 29, 2020 10:42 pm * <E> is used sometimes for ɛ, sometimes for æ
* θ and ð are conflated
Them's the rules.

Then the worse for the rules, of course.

If you are talking about ASJP, it does not have much reasonable application to mostly well-understood families like Indo-European. Where it may hope to have reasonable application is where there is little more than not very deeply analysed word lists to work on.

To me, this sounds as if you're basically saying, "We can't use this method for subjects about which we know a lot, because in those cases, what we already know about the subjects will tell us that the method's results are no good, but we can use this method for subjects about which we don't yet know much, so we don't have to worry about the method's results clashing with established knowledge".

Umh, no. If a method gives you bad results in cases where you can easily check the quality of the results, that's an argument against using the method in cases where you can't easily check the quality of the results.

At this level, cognacy judgements may very well be little more than similarity judgements forced to binary values. One then runs ASJP on Indo-European not to learn about Indo-European, but to learn about the emergent behaviour of ASJP and its uses.

Which is relevant for linguistic research because... ?

WeepingElf · Post by **WeepingElf** » Fri Oct 30, 2020 2:54 pm

Raphael wrote: ↑Fri Oct 30, 2020 12:50 pm [...]

If you are talking about ASJP, it does not have much reasonable application to mostly well-understood families like Indo-European. Where it may hope to have reasonable application is where there is little more than not very deeply analysed word lists to work on.
To me, this sounds as if you're basically saying, "We can't use this method for subjects about which we know a lot, because in those cases, what we already know about the subjects will tell us that the method's results are no good, but we can use this method for subjects about which we don't yet know much, so we don't have to worry about the method's results clashing with established knowledge".

Umh, no. If a method gives you bad results in cases where you can easily check the quality of the results, that's an argument against using the method in cases where you can't easily check the quality of the results.

Yes. If you have shown that a method gives wrong results where the right results are known, you not only can, you must discard it. It can't be expected to give useful results anywhere.

Raphael wrote: ↑Fri Oct 30, 2020 12:50 pm
At this level, cognacy judgements may very well be little more than similarity judgements forced to binary values. One then runs ASJP on Indo-European not to learn about Indo-European, but to learn about the emergent behaviour of ASJP and its uses.
Which is relevant for linguistic research because... ?

The statement "cognacy judgements may very well be little more than similarity judgements forced to binary values" reveals that whoever made it has no idea what cognacy means at all. To give just two famous examples: Greek theos and Latin deus look similar but aren't cognate; Armenian erku and English two look utterly different but are cognate. The ASJP algorithm would probably mark the former as "cognate" and the latter as "not cognate". Failure to realize this difference is precisely what I meant that ASJP falls back into the kind of etymology about which Voltaire said that "consonants count for little and vowels for nothing at all". It is idle and meaningless, no matter how much computer power one invests in it.

Richard W · Post by **Richard W** » Fri Oct 30, 2020 4:03 pm

Raphael wrote: ↑Fri Oct 30, 2020 12:50 pm
Richard W wrote: ↑Fri Oct 30, 2020 12:27 pm At this level, cognacy judgements may very well be little more than similarity judgements forced to binary values. One then runs ASJP on Indo-European not to learn about Indo-European, but to learn about the emergent behaviour of ASJP and its uses.
Which is relevant for linguistic research because... ?

...it tells you how much credence to give results from the ASJP.

Ares Land · Post by **Ares Land** » Fri Oct 30, 2020 4:12 pm

Richard W wrote: ↑Fri Oct 30, 2020 12:27 pm The words are recorded in ASJPcode, as defined In the second reference on the Wikipedia page, Brown et al 2008. The choice to use ASJPcode is regrettable - it greatly reduces the value of their database.

That's quite the understatement!

ASJPCode really makes no sense. 1) why reduce everything to ASCII? In this day and age? 2) Their metrics ignore correspondances. Even using features would improve things somewhat.

They have transcribed "tu" and "cheveux" in accordance with the rules. However, although the rules may be "easy to type" (one of their defences of their system) if one understands the rules, "plume" exhibits interference from IPA - they should of course have entered it as plim, not plym.

And the rules are poorly defined, I'm afraid. Using ti,S3ve instead of tu, S3vo artificially increases distance with Spanish and Italian.
And really, if they had to merge phonemes (I don't see why, but whatever) using the back vowels would make a lot more sense. (For starters, that's what non-native speakers do... And babies learning the language.)

There may be a grammatically conditioned sound change going on in French. I checked the infinitives on English Wikipedia, and found that final /ʁ/ was often missing from the sound clips, e.g. in "mourir" and "voir".

There's no such sound change. What you're hearing is an artifact of the recording, I'm afraid.

The obvious answer is the Ehret, Orel and Stolbova two-part refutation of the existence of the comparative method, demonstrated in good faith by application to Afroasiatic.

Sorry but... what? That study doesn't refute the comparative method. It shows that proto-Afroasiatic can't be reliably reconstructed, which is hardly suprising.

Repeatabilty and documentation make things more 'scientific'.

The comparative method is repeatable: you can certainly check the correspondances yourself! And the procedures are abundantly documented.

Think of the fun and games we can have with the English, Latin and Greek words for 'wolf'.

Most of these 'fun and games' were about making undemonstrated claims.
For that matter, they do illustrate the dangers of relying on resemblance alone.

ASJPcode makes coarse phonetics distinctions to reduce the extent to which a difference in accent or transcriber preferences gives misleading assessments of difference.

That's too bad, because they made wrong choices in those coarse distinctions, and they gave misleading assessment of difference.
Choosing to transcribe rounded front vowels as <e i> instead of <o u> is a very good example of that by artificially assessing differences between French and other Romance languages.

I do see the use of such a tool for giving some hints of what languages might be related (and certainly not classifications or anything). But it's not ready for production just yet. (There's also the matter of data: no matter how sound your procedures, garbage in, garbage out)

I'm honestly surprised the data and metrics are so simple-minded. Especially since, honestly programming this is trivial.

Take IPA phonemic transcription. Transform these into series of features. /y/ > +Front +Rounded +High, for instance. Calculate the score base on features.
Loop over the word list, then move on to the next pair of language. Honestly I could code something better in a week.

You could do even better by taking correspondances into account: word pairs get extra points if a correspodance reoccurs often enough in the corpus. Harder but by no means unfeasible.

Richard W · Post by **Richard W** » Fri Oct 30, 2020 5:22 pm

WeepingElf wrote: ↑Fri Oct 30, 2020 12:14 pm At any rate, dating the divergences is difficult, as languages do not "mutate" at a rate that is constant across time and languages (just consider Icelandic vs. English).

While Icelandic vocabulary is conservative, its phonetics are not so conservative. I'm mildly* interested in how much the seeming flaws in the ASJP would compensate for the conservatism by treating Icelandic's phonetic changes like vocabulary loss.

*I.e. unlikely to actually do the run.

Post by **zompist** » Fri Oct 30, 2020 6:08 pm

Ares Land wrote: ↑Fri Oct 30, 2020 4:12 pm I do see the use of such a tool for giving some hints of what languages might be related (and certainly not classifications or anything). But it's not ready for production just yet. (There's also the matter of data: no matter how sound your procedures, garbage in, garbage out)

Ares has said what I'd say and more, so I won't add to that.

But this point is worth thinking about some more. As I said, ASJP has automated Greenberg. I don't agree with WeepingElf that this is of no use. Its advocates have said, rightly enough, that mass comparison is always the first step in analysis. The problem comes when its advocates think it can replace the later steps.

I don't know if ASJP thinks that way, but my main point is no one should hide bad data and poor methodology by acting as if the computer is Inherently Better.

It's worth asking: As an automated Greenberg, does this project do any better than Greenberg?

* Has anyone compared ASJP's results with Greenberg & Ruhlen? Which is better and why?
* Are there any putative relationships uncovered by ASJP that were not posited 40 years ago by Greenberg?

And maybe more importantly: is this a problem worth solving? To give them the maximum possible benefit of the doubt, the researchers seem to think that finding putative relationships (to test later) is some kind of huge unsolved problem. But... we have Greenberg. And Voegelin, and Ethnologue. And, you know, literally thousands of researchers who provided the data and most of which are, I'd assume, very interested in finding relationships. The low-hanging fruit is already picked.

A lot of this work seems to me to prejudge the eventual results: the assumption is that huge families like IE and Bantu are "normal", and a hundred-odd families in the Americas must look the same if we looked deeper. Thus the attraction of pretty crappy analyses like Greenberg's work on the Americas. But we really don't know that IE is "normal" and the Americas are not, and we do have historical information that IE spread unusually far.

The sad truth is that doing things right takes decades, especially as none of it is viewed as very important by funding agencies.

A case in point: "Quechumaran", which I've looked at extensively. Greenberg posits this (he is not the first), and I'm sure ASJP does too. About 1/3 of Quechua roots are shared with Aymara. Oho, they're clearly related!!1! Only no, it's not clear at all. The shared roots are almost all identical; indeed, they're closer than most intra-Quechua relationships. This is exactly what we see in cases of massive borrowing: French into English, Latin into most European languages, Chinese into Japanese/Korean/Vietnamese, Arabic into Persian, Persian into Urdu, Spanish into a load of Native American languages. Using a "basic words list" helps, but doesn't solve the problem.

You can put those words aside and look for deeper resemblances. Only, then, the "family" dries up. I'm not aware of any convincing cognates in the remaining 2/3 of the language. (Campbell claims to have found some, but I haven't seen them, and even he doesn't think the case is overwhelming.) And these are neighboring languages, one of the better cases of putative relationships.

WeepingElf · Post by **WeepingElf** » Fri Oct 30, 2020 6:47 pm

Fair. ASJP may not be utterly useless; it may indeed throw up candidates for undetected relationships, which then would have to be examined by conventional means of historical linguistics. Many of the candidates found by means of ASJP, however, will probably turn out to be loanword relationships of the kind zompist mentioned. Loanwords are the likeliest explanation when two languages have a lot of words in common but do not resemble each other grammatically. Like, for instance, Uralic and Yukaghir, which is IMHO best explained by assuming a layer of Samoyedic loanwords in Yukaghir.

My conclusion: ASJP has a non-zero milllinyland rating, though not a very high one, perhaps about 100 millinylands. There are many ideas in circulation that are much crazier!

Also, large language families like Indo-European are probably the exception rather than the rule. They emerge when a population has a considerable edge over its neighbours and can invade them (or when it has the occasion to expand into previously uninhabited territory, as with Polynesian). Typical scenarios are farmers displacing foragers (as probably with Austronesian), or militarily and economically superior conquerors (as certainly with Romance, and probably with Indo-European). Where such advantages do not exist, we get lots of small language families influencing each other in complex ways, making it hard, if not impossible, to untangle that mess.

To get back to the topic of Paleo-European languages, I can only repeat what I have already said here before. Mesolithic Europe probably was full of small language communities, in a situation where it would be very difficult to draw up family trees. Something like the indigenous languages of Australia, New Guinea or North America. Almost certainly, linguistic diversity was highest in the Mediterranean countries where European humans had ridden out the Latest Glacial Maximum, and lowest in Scandinavia which had only recently become habitable.

The Neolithic revolution was, as paleogenetic studies have shown, by expansion of farmers into forager territory, starting from Anatolia. This would have resulted in the establishment of a few larger language families, perhaps even just one. This language family, however, probably was not Indo-European, as there was a later genetic turnover indicating that some other group in turn displaced the Neolithic farmers. This other group probably were the PIE speakers, and their advantage probably was both military (horses) and economic (animal-drawn ploughs and wagons, allowing them to work much larger fields than the Neolithic farmers with their hoes and back baskets; also dairy farming).

And then, there even seem to have been two layers of IE in western Europe, one associated with the Bell Beaker culture in the Copper Age, and one associated with the Urnfield culture and related groups in the Bronze Age. The Bell Beaker languages would be lost, but may have left traces in the Old European Hydronymy (if that is a thing at all; this remains to be explored, and the first thing to do here is to map the darn thing, which AFAIK hasn't done yet); the Italic, Celtic and Germanic branches would belong to the Urnfield & co. layer, but influenced by the Bell Beaker languages. However, I am not really sure about that, even if I use this hypothesis in my conlangs.

Nortaneous · Post by **Nortaneous** » Fri Oct 30, 2020 6:59 pm

zompist wrote: ↑Fri Oct 30, 2020 6:08 pm And maybe more importantly: is this a problem worth solving? To give them the maximum possible benefit of the doubt, the researchers seem to think that finding putative relationships (to test later) is some kind of huge unsolved problem. But... we have Greenberg. And Voegelin, and Ethnologue. And, you know, literally thousands of researchers who provided the data and most of which are, I'd assume, very interested in finding relationships. The low-hanging fruit is already picked.

Intra-family subgrouping is a huge unsolved problem - at least, unsolved enough that it's possible to make potentially original contributions (comprehensive literature reviews are boring, and difficult when the literature is extremely multilingual) with a laptop and a few weekends.

Ethnologue and Glottolog will give you the impression that the subgrouping of e.g. Sino-Tibetan is as settled as that of IE, or Uralic, but it really isn't.

Travis B. · Post by **Travis B.** » Fri Oct 30, 2020 8:28 pm

I imagine an at least marginally better algorithm would use words encoded as segments formed from features rather than a highly inferior ASCII encoding. Of course this would not solve the problem of not being able to discern cognates from non-cognates and like.

Richard W · Post by **Richard W** » Fri Oct 30, 2020 9:19 pm

Ares Land wrote: ↑Wed Oct 28, 2020 4:30 am Well, of course, but that's a pretty big 'if'. I can't say I've read anything really conclusive on the matter. IIRC there are claims of significant pre-Clovis ancestry, especially in South America.

Nelson J.R. Fagundes et al claim in Mitochondrial Population Genomics Supports a Single Pre-Clovis Origin with a Coastal Route for the Peopling of the Americas that a population with maybe 1,000 women populated the Americas with an expansion some time between 18k and 15k BP. He refers to other studies to argue that the Eskimo-Aleut and Na-Dene speakers also descend from this group - though possibly the evidence is only for the female line.

Znex · Post by **Znex** » Fri Oct 30, 2020 9:30 pm

When I was studying Levenshtein distances (I did a paper on the Romani dialect differences a few years ago for my Bachelor), it occurred to me that Levenshtein distances are much more suited for establishing literal language distance or mutual intelligibility rather than any actual genetic distance.

This is clear in the case of French for instance; French is certainly genetically closer to langues d'Oc than Romanian or any other strange choices from Romance, but in terms of how much French has changed in actual form from the other Romance languages, it is certainly set apart in its difference.

Some linguists already recognise this and have been mainly using Levenshtein distances for dialect (or similarly closely related languages) analysis (eg. for the Irish dialects, for North Germanic, for the Slavic languages); but as an indicator of genetic relation itself, if it is to be of any use at all, it certainly needs more tweaking and adjustment to account for systematic sound changes.

Ares Land · Post by **Ares Land** » Sat Oct 31, 2020 5:52 am

Using the Levenshtein metric is a good idea, really. The thing is, it compares symbols, so you have to choose the symbols with care.

You could, say, transcribe mur and muro this way:
'bnouIrq / 'bnouUrtouO and calculate distance based on that.
(If you really want ASCII, fine! just use the letters for features. In addition 'myr > 'bnouIrq could be defined as a lookup table and obtained from IPA through string substitution, not left at the whim of the transcriber.)
To the ASJP choices, I add stress (it's really quite consistent across Romance, so it should get bonus points) and the uvular rhotic (it brings German and French closer together, so it should get bonus points too).

The more I think about it, the more I think the problem of cognates isn't that hard to fix. Add each set of features to a lookup table any time we compare words. Then run through the lexicon again, and add bonus points based on how often a correspondance shows up in the table.

We'd have way better results with complete sentences and words in context. If we had verbs in the future tense, Romanian would get a bit further than the rest. Add plurals and we get Western and Eastern Romance.
With glossed sentences we could go further still. (So the articles in Romanians get bonus points for being cognate and a malus for being postposed)

Really, bring me back a glossed IPA transcript of 300 identical sample sentences (that are not from the Bible) in all documented languages in the Americas. With the methods above, I'll give you putative language families and sprachbunds in two weeks.

In other words, as usual, the algorithms are trivial, the real problem is feeding them with enough quality data.

Creyeditor · Post by **Creyeditor** » Sat Oct 31, 2020 6:52 am

I just wanted to defend the ASJP a bit. A while ago, I heard a talk by someone working on it and he says that weighted Levenshtein distances that take into account phonological similarity when comparing sounds via substitutions and that prioritize regular correspondences over non-systematic ones. One main problem pointed out in the talk was, that when you apply such a hybrid approach to less known language families you often do not have long enough word lists to compare. Also, semantic shift is very hard to implement, because longer word lists often include cognates that are rarer and semantic shifts cannot be easily hard-coded than.
I also seem to recall that ASJP-like methods can recreate some established families with good accuracy with the right algorithms. Oh, and some insider news from the Max-Planck-Institutes: the algorithms are really expensive, because apparently you have to buy them

Zompist Bboard Again

Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages