zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
Richard W wrote: ↑Thu Oct 29, 2020 8:06 pm
zompist wrote: ↑Thu Oct 29, 2020 5:08 pm
And that's before even looking at other aspects of the methodology, such as the bad data that Nortaneous pointed out.
Do you mean the table of English words? I can only see two clear errors there, the entries for 'what' and 'white', which are represented having an onset denoted by the undefined sequence
wh~. (May one use slashes to delimit folded phonemes?) These two entries are not included in the 40 actually used. Perhaps the entry
mini for 'many' bothers you. If I've always rimed 'any' and 'many', then I used to pronounce that word that way. This entry wasn't used either.
* i and ɪ are conflated
The words are recorded in
ASJPcode, as defined In the second reference on the Wikipedia page, Brown et al 2008. The choice to use ASJPcode is regrettable - it greatly reduces the value of their database.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
* so are u and ʊ, and o and ɔ
Them's the rules.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
* e and ɛ are sometimes conflated and sometimes not
I think we're seeing a judgement call for English /e/ between [e] (> ASJP
e) and [ɛ] (> ASJP
E). In my speech, the vowel of 'breast' is opener than the vowel of 'egg', and for the transcriber they seem to have fallen either side of the boundary.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
* <E> is used sometimes for ɛ, sometimes for æ
* θ and ð are conflated
Them's the rules.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
* "all" and "small" don't rhyme, and "small" is given the same vowel as "not"
We do seem to have an accent mixture between one where the vowels are back and open or mid to one where the vowels are central and open. However, I'm not confident that we aren't seeing an accurate record of an idiolect.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
* syllabic r is sometimes written with a shwa, sometimes just as r
* syllabic n always gets a shwa, syllabic l never
Given that these sequences are in free variation, this isn't entirely wrong, just unhelpful. Arguably both should have been given; alternatively, a decision should have been made for the case of r.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
* au is sometimes aw, sometimes au
This may reflect an allophonic difference, as with the word 'egg'. There is something slightly different about the vowel of 'round', a long vowel arguably in a long sequence of three vocoids between consonants as opposed to a short sequence of three vocoids in 'mountain' or just two vocoids in 'crowd'. Fortunately, ROUND is not one of the 40 meanings used.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
If they can't get their own language right, how much confidence should we have that they get other languages right?
As I mentioned in a previous post, I was worried by their inability to apply their distance metric manually in a published paper.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
For "fun", I checked their French wordlist. It's just as bad. "Graisse" is not [grais]; "tu" is not [ti]; "graine" is not [gran]; "cheveux" is not [ʃəve]; "plume" and "pied" do not have the same vowel; they can't decide whether to list verbs as infinitives or stems; "moon" is not [len] or [lɛn];
"montagne" and "rouge" do not end in the same consonant; final [j] is sometimes included and sometimes not...
There do seem to be multitudinous errors in their transcription. According to their rules, [a] should be transcribed as
E and [ɑ] as
o; their list reads as though they have merged to
ä, which then should be transcribed as
a. Am I just behind the times?
They have transcribed "tu" and "cheveux" in accordance with the rules. However, although the rules may be "easy to type" (one of their defences of their system) if one understands the rules, "plume" exhibits interference from IPA - they should of course have entered it as
plim, not
plym.
There may be a grammatically conditioned sound change going on in French. I checked the infinitives on English Wikipedia, and found that final /ʁ/ was often missing from the sound clips, e.g. in "mourir" and "voir". On the other hand, it was very easy to hear the rhotic in -"tre" and -"dre".
I had evaluated their Thai word list. Thai's five tones are split into two sets of three, and the initial consonants are written differently for the two sets. (There are extra tone marks to cope with the cases where there aren't the alternative consonants.) Their transcriptions reflect the different ways of writing the onset consonants. The one word on the 100-word list that I know to be spelt the etymologically wrong way (ฆ่า 'to kill') is not on the 40-word list. However, for Lao, which has the same system in this respect, they don't distinguish the two sets of tones. The word list composers don't disobey the rules consistently. (Both Thai and Lao have a tone which appears in both sets, the tone for B4 in the Gedney tone box.)
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
I'd still like an answer to the question: why do you think the comparative method is not repeatable?
The obvious answer is the Ehret, Orel and Stolbova two-part refutation of the existence of
the comparative method, demonstrated in good faith by application to Afroasiatic.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
If your answer is that, at some points, researchers make their own judgments... yes, they do,
and so so programmers. They do not become more "scientific" because they are in a program.
Repeatabilty and documentation make things more 'scientific'. Think of the fun and games we can have with the English, Latin and Greek words for 'wolf'. For that matter, think of the fun and games we can have with the words for 'one' - do they go back to PIE? The *oi bit seems to go back to PIE, the termination bit possibly, but the middle bit is chaotic.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
The assessment of whether two words are cognate is often subjective.
Maybe so. But cognacy is not a fancy way of saying "similarity", so it cannot be replaced by a complicated way of measuring similarity.
People argue about cognacy because words have complicated histories, and ways of getting things wrong are legion. A computer does not rescue you from this.
If you are talking about ASJP, it does not have much reasonable application to mostly well-understood families like Indo-European. Where it may hope to have reasonable application is where there is little more than not very deeply analysed word lists to work on. At this level, cognacy judgements may very well be little more than similarity judgements forced to binary values. One then runs ASJP on Indo-European not to learn about Indo-European, but to learn about the emergent behaviour of ASJP and its uses.
When used for the relatedness of languages, the project's hope is that a comparison of similarity measures will give similar results to a comparison of cognacy judgements, and benefit from consistency. It might even help with semicognacy - the addition or, rarely, subtraction of affixes.
In general, a computer can rescue you from inconsistency.
zompist wrote: ↑Thu Oct 29, 2020 10:42 pm
The decision making should be exposed in the program, not 'hidden' there.
I don't know their program-- if they make use of neural nets like the cool kids do these days, that is precisely hiding the decisionmaking. If it's all C# or something, then stating your rules that way is far more obfuscatory than writing them down in a language other linguists can read.
Also note that a huge amount of their project is the database itself, full of decisions that are not laid out in code. Did a computer tell them to confuse their lax and tense vowels?
Some people see a virtue in consistency. For example, quality requirements imply that it is better to be consistently mediocre than to be usually mediocre but occasionally excellent. I do demur.
However, there is the view that repeatability is desirable in science.
A neural net feels more like substituting the subjective opinion of the history of training of some algorithm for a person's. It's only advantages are speed of training, speed of application (one hopes) and internal consistency. I had in mind a set of defined processes, which probably requires an English statement, though might conceivably be satisfied by clear computer code. Executing a written process consistently is not something people are particularly good at.
ASJPcode makes coarse phonetics distinctions to reduce the extent to which a difference in accent or transcriber preferences gives misleading assessments of difference. I view that as a dirty quick fix, but better motivated solutions don't seem to have worked.