Paleo-European languages

Richard W · Post by **Richard W** » Sat Oct 31, 2020 9:36 am

Ares Land wrote: ↑Sat Oct 31, 2020 5:52 am We'd have way better results with complete sentences and words in context. If we had verbs in the future tense, Romanian would get a bit further than the rest. Add plurals and we get Western and Eastern Romance.
With glossed sentences we could go further still. (So the articles in Romanians get bonus points for being cognate and a malus for being postposed)

Really, bring me back a glossed IPA transcript of 300 identical sample sentences (that are not from the Bible) in all documented languages in the Americas. With the methods above, I'll give you putative language families and sprachbunds in two weeks.

I trust that last paragraph isn't meant to be of the form "if false then X". How would you compare the Western European definite article with the classical Greek, Arabic and Hebrew definite articles? For that matter, should we count the West Germanic and Greek definite articles as cognate? They have a common origin, but not as articles.

Kuchigakatai · Post by **Kuchigakatai** » Sat Oct 31, 2020 11:13 am

To this day, I don't understand why these computational linguists seemingly push this kind of stuff as a method rather than a tool of a larger toolset. Like, using something like Ares Land's scheme to find candidates, but otherwise getting down to the grunt work of sound changes vs. analogy vs. borrowings anyway. How is that 40-word ASKP thing supposed to be taken seriously?

Richard W · Post by **Richard W** » Sat Oct 31, 2020 12:53 pm

Kuchigakatai wrote: ↑Sat Oct 31, 2020 11:13 am To this day, I don't understand why these computational linguists seemingly push this kind of stuff as a method rather than a tool of a larger toolset. Like, using something like Ares Land's scheme to find candidates, but otherwise getting down to the grunt work of sound changes vs. analogy vs. borrowings anyway. How is that 40-word ASKP thing supposed to be taken seriously?

Salesmanship? Who's paying for this work, and how are they induced to pay for it?

There's a statistical problem that using more data entails adding in worse data. If you are short of data in the 40 words, then the next 60 words will be mostly noise. One needs some way of blending in the 60 appropriately, and justifying the way of blending it in. If one's working with a similarity metrics, and one's blending in method works be extending the data set size, how do you compare a distance based on the 40 stablest meanings, how does one compare it with a distance based on the 50 stablest?

I can't help wondering if Mailhammer has been doing exactly what you suggest - use the tool to group, and then find and marshall the solid comparative evidence.

Ares Land · Post by **Ares Land** » Sat Oct 31, 2020 4:51 pm

Richard W wrote: ↑Sat Oct 31, 2020 9:36 am I trust that last paragraph isn't meant to be of the form "if false then X".

What does that mean?

How would you compare the Western European definite article with the classical Greek, Arabic and Hebrew definite articles? For that matter, should we count the West Germanic and Greek definite articles as cognate? They have a common origin, but not as articles.

In the same way. The point is that we don't know beforehand which are cognate or not (nor do we want to know which words are cognate); the idea would be to get a global resemblance score.

To be honest, with the data available, it's perhaps not worth bothering with better algorithms.

Creyeditor wrote: ↑Sat Oct 31, 2020 6:52 am I also seem to recall that ASJP-like methods can recreate some established families with good accuracy with the right algorithms. Oh, and some insider news from the Max-Planck-Institutes: the algorithms are really expensive, because apparently you have to buy them

*Insert insane screaming about open source and software patents here*

Salesmanship? Who's paying for this work, and how are they induced to pay for it?

Yes, that's most likely it. You probably have a easier time getting research grants if you claim to have a revolutionary new method. (They'd get even more dough if they claimed to use AI and blockchain.)

Post by **zompist** » Sat Oct 31, 2020 5:05 pm

Nortaneous wrote: ↑Fri Oct 30, 2020 6:59 pm Intra-family subgrouping is a huge unsolved problem - at least, unsolved enough that it's possible to make potentially original contributions (comprehensive literature reviews are boring, and difficult when the literature is extremely multilingual) with a laptop and a few weekends.

Ethnologue and Glottolog will give you the impression that the subgrouping of e.g. Sino-Tibetan is as settled as that of IE, or Uralic, but it really isn't.

True. I think there's several levels to this problem, and methods won't necessarily be the same.

If you're thinking of Tibeto-Karen (246 languages, probably more now) or Bantu (494), it's kind of a mess. Maybe computers can help out here just because the data is so huge. Or maybe researchers just need to chew on it for another century, I dunno.

With something like Romance (16), I'm not sure that The Correct Answer exists. I can't defend Jack Rea's opinion, and he can't either since he died a few years back, but as I understood it (explained over a long lunch), it was that the more you know about a language family, the harder it is to classify. To put it in computerese, you know all the factors rather than the superficial ones an outsider would, and there is no objective way to weight them. E.g.:

* French is pretty aberrant phonetically (though it shares nasalization with Portuguese)
* Lexically, my impression is that it's closer to Italian than to the Iberian languages
* If you look at negation, French's two-part negator groups it with Catalan
* Portuguese, unlike all its neighbors, escaped the breaking of o/e
* Sard is the only one to keep [k] before front vowels
* Romanian has postposed articles and retains case
* Portuguese has a personal infinitive, which defies the laws of man and God

And on and on. Which differences are the most important? And if you got a consensus within Romance ("phonology is most important, with features evaluated in this order..., followed by lexicon, ... ), would it make sense to apply the same rules to (say) Semitic?

Richard W · Post by **Richard W** » Sat Oct 31, 2020 5:37 pm

Ares Land wrote: ↑Sat Oct 31, 2020 4:51 pm
Richard W wrote: ↑Sat Oct 31, 2020 9:36 am I trust that last paragraph isn't meant to be of the form "if false then X".
What does that mean?

Your preconditions are very hard to satisfy if taken literally. Would failure to meet them all absolve you of your undertaking?

What does 'documented' mean? There are very sketchily recorded languages for which I am sure one could not supply such a corpus.
300 identical sample sentences. I'm not sure that exact equivalents are so easy to find outside a language area. Simple sentences with overlapping meanings can easily have different ranges of meaning. So what does 'identical' mean? Would having a common semantic overlap satisfy the requirement?
"Not from the Bible". A corpus of 300 sentences with the same meaning would probably have to be commissioned. What gets commissioned (or mandated) - translations of the Bible!

Ares Land wrote: ↑Sat Oct 31, 2020 4:51 pm
How would you compare the Western European definite article with the classical Greek, Arabic and Hebrew definite articles? For that matter, should we count the West Germanic and Greek definite articles as cognate? They have a common origin, but not as articles.
In the same way. The point is that we don't know beforehand which are cognate or not (nor do we want to know which words are cognate); the idea would be to get a global resemblance score.

For Western Europe, we have art-Adj-Noun, art-Noun-Adj or Adj-Noun-art. The other three have art-Adj-art-Noun or art-Noun-art-Adj. How would you match such structures up? Do you need to design comparisons so that the English article would match the Greek article?

Ares Land · Post by **Ares Land** » Sat Oct 31, 2020 7:00 pm

Richard W wrote: ↑Sat Oct 31, 2020 5:37 pm Your preconditions are very hard to satisfy if taken literally. Would failure to meet them all absolve you of your undertaking?

Yes, they're very hard and deliberately so. The kind of data required to build language trees based on family resemblances is very hard to collect. It's not much of a surprise they get wacky results.
(OK, I'll skip the no-Bible requirement. Bible translations are likely to be non-representative, but they'd be lots better than a restricted wordlist)

If we want to classify languages into families, we need a lot of good quality, structured data.
The boring steps of digging up correspondances and resemblances can be delegated to a computer or not. It doesn't really matter.

[gloss]For Western Europe, we have art-Adj-Noun, art-Noun-Adj or Adj-Noun-art. The other three have art-Adj-art-Noun or art-Noun-art-Adj. How would you match such structures up? Do you need to design comparisons so that the English article would match the Greek article?[/gloss]
At a first approximation, words listed as ART would be compared against each other; them being inflected in the same way and placed in the same location would get extra points. (How many extra points? No idea. If this was a real project, I'd probably have a lot of test runs on these languages)

Though again, the point isn't to actually write the code; it's to show that the problem of designing the algorithms is trivial, compared to getting the requisite corpus.

Nortaneous · Post by **Nortaneous** » Sat Oct 31, 2020 8:09 pm

zompist wrote: ↑Sat Oct 31, 2020 5:05 pm
Nortaneous wrote: ↑Fri Oct 30, 2020 6:59 pm Intra-family subgrouping is a huge unsolved problem - at least, unsolved enough that it's possible to make potentially original contributions (comprehensive literature reviews are boring, and difficult when the literature is extremely multilingual) with a laptop and a few weekends.

Ethnologue and Glottolog will give you the impression that the subgrouping of e.g. Sino-Tibetan is as settled as that of IE, or Uralic, but it really isn't.
True. I think there's several levels to this problem, and methods won't necessarily be the same.

If you're thinking of Tibeto-Karen (246 languages, probably more now) or Bantu (494), it's kind of a mess. Maybe computers can help out here just because the data is so huge. Or maybe researchers just need to chew on it for another century, I dunno.

With something like Romance (16), I'm not sure that The Correct Answer exists. I can't defend Jack Rea's opinion, and he can't either since he died a few years back, but as I understood it (explained over a long lunch), it was that the more you know about a language family, the harder it is to classify. To put it in computerese, you know all the factors rather than the superficial ones an outsider would, and there is no objective way to weight them. E.g.:

* French is pretty aberrant phonetically (though it shares nasalization with Portuguese)
* Lexically, my impression is that it's closer to Italian than to the Iberian languages
* If you look at negation, French's two-part negator groups it with Catalan
* Portuguese, unlike all its neighbors, escaped the breaking of o/e
* Sard is the only one to keep [k] before front vowels
* Romanian has postposed articles and retains case
* Portuguese has a personal infinitive, which defies the laws of man and God

And on and on. Which differences are the most important? And if you got a consensus within Romance ("phonology is most important, with features evaluated in this order..., followed by lexicon, ... ), would it make sense to apply the same rules to (say) Semitic?

It seems reasonable to me that mutually exclusive developments - cases where all languages have undergone a shift in {A, B, C, ...} where each one necessarily precludes the others - should be given some amount of priority, modulo the probability of parallel developments. (Germanic and Tocharian have potentially identical treatment of syllabic resonants, although it's extremely hard to tell whether *R̥ > *ar or *ur - but given that there are so many IE languages, they all lost the syllabic resonants eventually, and there are only so many syllabic resonants, what are the odds of coincidental matches?) In the case of Romance, this would be the vowel system: if you have uncompensated loss of length, you can't have ē i > e̝, and so on. This gets you Western, Eastern (Romanian etc.), and Southern (African Romance and Sardinian) branches. But then you may want to subgroup those...

Richard W · Post by **Richard W** » Sun Nov 01, 2020 7:16 am

Richard W wrote: ↑Thu Oct 29, 2020 8:06 pm I can only see two clear errors there, the entries for 'what' and 'white', which are represented as having an onset denoted by the undefined sequence wh~.

I've been digging into old papers, and published ASJP code and data. It seems that tilde forms the two symbols before into a digraph, and there is or was a similar convention for dollar and trigraphs. In the old days, when comparison was done by looking for two identical phonemes in the words with limits on the amount of intervening material, both w and h would match wh~. Now, the downloadable database includes JSON data (a file form.csv) that splits words into individual phonemes written in IPA, so for example kh~ is translated into /kʰ/. Curiously, wh~ gets translated to /w/, which at least agrees with ENGLISH_2. I don't know what the matching rules are for Levenshtein-based distance matching. The Levenshtein distance calculation is coded in Python, and I don't speak parseltongue. However, the strings are passed in as arrays, and who knows how the '!=' operator gets interpreted.

Overloading works fine when the operations are trusted, but it does make code inspection harder. Obscure and proprietary code can fairly be said to make the process less scientific.

Looking at the FRENCH entry, I perceive further horrors such as kwa for WHAT but pw~aso* for FISH, the latter yielding (necessarily imperfect) IPA /pʷɐsõ/. I'm wondering if FRENCH has been kept for repeatability. There is also a FRENCH_2 that looks a lot better - though recording the rhotic as uvular will give forms that may seem bizarre after ASJP folding. This one does not record nasalisation - it seems that the revised ASJP ignores nasalisation, as documented in Wikipedia, and therefore it is not shown in a newer list.

Richard W · Post by **Richard W** » Sun Nov 01, 2020 7:32 am

Nortaneous wrote: ↑Sat Oct 31, 2020 8:09 pm This gets you Western, Eastern (Romanian etc.), and Southern (African Romance and Sardinian) branches. But then you may want to subgroup those...

Or later developments may be more important. Many classifications seem to group Sicilian with Tuscan rather than Romanian, and I suspect Dalmatian may more naturally group with the Italian dialects. Younger changes may dominate older changes. For example, although South Slavic seems to emerge as a natural group, the West v. East division was originally present in southern Slavic, but has been overlain by more recent developments.

As an extreme example, I recall reading (probably forty years ago) that some dialects of Lithuanian agreed with Slavic rather than standard Lithuanian in some satem developments. However, one would normally classify those dialects with other Lithuanian dialects rather than with Slavic dialects.

WeepingElf · Post by **WeepingElf** » Sun Nov 01, 2020 8:23 am

zompist wrote: ↑Sat Oct 31, 2020 5:05 pm I can't defend Jack Rea's opinion, and he can't either since he died a few years back, but as I understood it (explained over a long lunch), it was that the more you know about a language family, the harder it is to classify.

I think I have read this somewhere, too, perhaps from someone else, though. I have always wondered how for many less well known families, even doubtful ones such as Nilo-Saharan, people confidently draw up family trees with lots of neat binary divisions and internal nodes, while for the best-studied family of all, Indo-European, they cannot agree on any subgrouping above the level of the ten (or more) "primary branches" such as Germanic or Indo-Iranian. Doesn't this imply that the binary trees for the less well known families are premature?

Talskubilos · Post by **Talskubilos** » Sun Nov 01, 2020 8:51 am

WeepingElf wrote: ↑Sun Nov 01, 2020 8:23 amI think I have read this somewhere, too, perhaps from someone else, though. I have always wondered how for many less well known families, even doubtful ones such as Nilo-Saharan, people confidently draw up family trees with lots of neat binary divisions and internal nodes, while for the best-studied family of all, Indo-European, they cannot agree on any subgrouping above the level of the ten (or more) "primary branches" such as Germanic or Indo-Iranian. Doesn't this imply that the binary trees for the less well known families are premature?

The problem lies on the relative inadequacy of the genealogical tree model.

Richard W · Post by **Richard W** » Sun Nov 01, 2020 9:56 am

WeepingElf wrote: ↑Sun Nov 01, 2020 8:23 am Doesn't this imply that the binary trees for the less well known families are premature?

They're probably more maps of perceivable plausibly genetic connections than serious phylogenies. For Nilo-Saharan, it would seem we are not in a position where the tree model fails.

It's interesting to see the tension in biologists when they defend 'hard trichotomies'.

Travis B. · Post by **Travis B.** » Sun Nov 01, 2020 10:01 am

Talskubilos wrote: ↑Sun Nov 01, 2020 8:51 am The problem lies on the relative inadequacy of the genealogical tree model.

And your logic is that just because the means we have are imperfect, anything goes instead.

Talskubilos · Post by **Talskubilos** » Sun Nov 01, 2020 10:22 am

Travis B. wrote: ↑Sun Nov 01, 2020 10:01 am
Talskubilos wrote: ↑Sun Nov 01, 2020 8:51 amThe problem lies on the relative inadequacy of the genealogical tree model.
And your logic is that just because the means we have are imperfect, anything goes instead.

Not really. The thing is the genealogical tree doesn't account for lateral relationships (substrates and adstrates), so we need better models.

WeepingElf · Post by **WeepingElf** » Sun Nov 01, 2020 11:25 am

Richard W wrote: ↑Sun Nov 01, 2020 9:56 am
WeepingElf wrote: ↑Sun Nov 01, 2020 8:23 am Doesn't this imply that the binary trees for the less well known families are premature?
They're probably more maps of perceivable plausibly genetic connections than serious phylogenies. For Nilo-Saharan, it would seem we are not in a position where the tree model fails.

It's interesting to see the tension in biologists when they defend 'hard trichotomies'.

AFAIK these trees are based on counts of lexical similarities between the branches - actual cognates where, as in Uralic, the sound correspondences are known, mere casual resemblances where, as in Nilo-Saharan, they are not. The Indo-Europeanists are beyond that stage of tree-drawing, which can be fouled by various factors that may distort the picture (basically the same problems as with glottochronology: the rate of lexical replacement is far from uniform across time and languages), and now attempt to draw trees based on structural shared innovations such as sound changes - and run into a thicket of intersecting isoglosses which shows that the wave model is more apt than the family tree model. The Uralicists are currently moving towards what the Indo-Europeanists have done for decades, and the new trees look quite different from the old ones (for instance, the "Ugric" languages - it is now doubted that these form a valid node at all - are now grouped with Samoyedic as "East Uralic").

Nortaneous · Post by **Nortaneous** » Sun Nov 01, 2020 12:04 pm

Talskubilos wrote: ↑Sun Nov 01, 2020 10:22 am
Travis B. wrote: ↑Sun Nov 01, 2020 10:01 am
Talskubilos wrote: ↑Sun Nov 01, 2020 8:51 amThe problem lies on the relative inadequacy of the genealogical tree model.
And your logic is that just because the means we have are imperfect, anything goes instead.
Not really. The thing is the genealogical tree doesn't account for lateral relationships (substrates and adstrates), so we need better models.

It isn't the job of the genealogical tree to account for lateral relationships! If you think there ought to be a way of representing language history that represents lateral transfer, that's one thing - maybe the microbiologists have something figured out here - but genealogical trees are not trying to do that. And lateral transfer is in most cases readily distinguishable from and much weaker than genealogical inheritance, so this would only be relevant for, like, Bai and Wutun. It's also much harder to demonstrate: Tocharian probably has significant Uralic (or Yeniseian??) influence, but the developments could conceivably have been internal, and I don't know of any convincing Uralic loans in Tocharian. This was likely not out of any constant language-specific resistance to loaning - there are plenty of Indo-Iranian and Sinitic loans.

Talskubilos · Post by **Talskubilos** » Sun Nov 01, 2020 12:35 pm

Nortaneous wrote: ↑Sun Nov 01, 2020 12:04 pmIt isn't the job of the genealogical tree to account for lateral relationships! If you think there ought to be a way of representing language history that represents lateral transfer, that's one thing - maybe the microbiologists have something figured out here - but genealogical trees are not trying to do that. And lateral transfer is in most cases readily distinguishable from and much weaker than genealogical inheritance, so this would only be relevant for, like, Bai and Wutun. It's also much harder to demonstrate: Tocharian probably has significant Uralic (or Yeniseian??) influence, but the developments could conceivably have been internal, and I don't know of any convincing Uralic loans in Tocharian. This was likely not out of any constant language-specific resistance to loaning - there are plenty of Indo-Iranian and Sinitic loans.

I'm afraid Indo-European can't be accounted for without lateral trasnafers, even at the "PIE" level. To quote an example, Germanic *xandu- 'hand' comes from the same lexeme *ḱmt- fossilized in some IE numerals.

Travis B. · Post by **Travis B.** » Sun Nov 01, 2020 12:39 pm

Talskubilos wrote: ↑Sun Nov 01, 2020 12:35 pm I'm afraid Indo-European can't be accounted for without lateral trasnafers, even at the "PIE" level. To quote an example, Germanic *xandu- 'hand' comes from the same lexeme *ḱmt- fossilized in some IE numerals.

You do know you are making an extraordinary claim - do you have extraordinary evidence to back it up?

Talskubilos · Post by **Talskubilos** » Sun Nov 01, 2020 2:01 pm

Travis B. wrote: ↑Sun Nov 01, 2020 12:39 pm
Talskubilos wrote: ↑Sun Nov 01, 2020 12:35 pmTo quote an example, Germanic *xandu- 'hand' comes from the same lexeme *ḱmt- fossilized in some IE numerals.
You do know you are making an extraordinary claim - do you have extraordinary evidence to back it up?

I think the Germanic and the PIE lexemes are related to Semitic *χamʃ '5', and I bet their common ancestor would be *kamtʃ- or something like that.

Zompist Bboard Again

Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages

Re: Paleo-European languages