Zompist Bboard Again

Posted: **Wed May 22, 2024 4:19 am**

bradrn wrote: ↑Tue May 21, 2024 12:39 pm Today the ID came up in the course of a discussion I had with Alexandre François

Examples — it would be nice to have examples for each sound changes. This shouldn’t be too hard for any halfway reliable source, although it would make for more work. Given the hyperlinked nature of the new ID, he suggested that example words could e.g. be linked to the corresponding Wiktionary entry when present. (I think it shouldn’t be very hard to retrofit this into the existing data schema).

That sounds like a good idea. Please badger me to do this with all the families I've done so far.

The rest of the suggestions I fully support but they sound more like coding problems which are way beyond me :/

Posted: **Wed May 22, 2024 4:47 am**

Darren wrote: ↑Wed May 22, 2024 4:19 am That sounds like a good idea.

Great, then I’ll implement it when I get time.

Please badger me to do this with all the families I've done so far.

No need: I’m working from the same papers, so I can add the examples myself.

The rest of the suggestions I fully support but they sound more like coding problems which are way beyond me :/

Graphs are really just a data analysis problem. Maps are similar, with the added task of requiring location metadata. Neither is feasible with what we have right now, but once we have search capabilities they should be straightforward.

Improving the representation of suprasegmentals, on the other hand, is a bigger problem. It’s probably something which needs to be fixed within Brassica itself, not just in the ID. I’ve been thinking about it for quite some time, and like I said, I’m unsure how to solve it. (Probably I should make a dedicated discussion thread for the issue at some point.)

Posted: **Tue Aug 27, 2024 8:07 am**

Would be great to see Proto Korean to Modern Korean.

Posted: **Tue Aug 27, 2024 8:13 am**

Neonnaut wrote: ↑Tue Aug 27, 2024 8:07 am Would be great to see Proto Korean to Modern Korean.

If you know anything about it, please feel free to start a thread for Koreanic and start writing up changes!

(The website component of this project is on hold at the moment, pending improvements to Brassica… which I suspect will be particularly important here, given the tonal nature of Middle Korean (IIRC). I’ll resume transferring changes to the website once I release Brassica v1.0.0.)

Posted: **Tue Aug 27, 2024 6:19 pm**

Neonnaut wrote: ↑Tue Aug 27, 2024 8:07 am Would be great to see Proto Korean to Modern Korean.

I know professional Koreanists who feel the same way.

Posted: **Tue Aug 27, 2024 8:33 pm**

fusijui wrote: ↑Tue Aug 27, 2024 6:19 pm
Neonnaut wrote: ↑Tue Aug 27, 2024 8:07 am Would be great to see Proto Korean to Modern Korean.
I know professional Koreanists who feel the same way.

Though, more seriously… if you know any, could they perhaps be persuaded to get involved?

Posted: **Wed Aug 28, 2024 5:27 am**

I have tried pointing some likely pros in this direction, but it's been quite a while -- I should give it another go, as the opportunities present themselves. Thanks for the reminder!

Posted: **Wed Aug 28, 2024 5:29 am**

fusijui wrote: ↑Wed Aug 28, 2024 5:27 am I have tried pointing some likely pros in this direction, but it's been quite a while -- I should give it another go, as the opportunities present themselves. Thanks for the reminder!

That would be great if you could, thanks!

Posted: **Sun Nov 10, 2024 7:20 pm**

After a hiatus, I’m starting to get back into this project. Just now I finished updating the code to work with Brassica 1.0.0. From the user’s side that doesn’t yield any huge changes, but now it should be a bit easier to write changes involving tones and suchlike. Next, when I get the time (and the inclination), I’m going to go over some of the new entries in this subforum and start transferring them over to the website. I also want to experiment with writing some code for analysis, to prove that we can do something with this data more interesting than ‘format the sound changes on a website’.

In other news: in the background, myself and Man in Space have been talking to some linguists who are very enthusiastic about having a sound change database for their own purposes. We haven’t agreed on very much yet, but hopefully we can put together a design which will be useful for everyone involved.

(In the course of that discussion Man in Space linked this article. It’s really great, and I strongly suggest that everyone here read it.)

Posted: **Wed Nov 13, 2024 9:25 pm**

I think it’s time I wrote down the design principles I’ve been implicitly (or explicitly) assuming:

No original research. It could easily take a whole PhD to work out a canonical set of sound changes for a language family, and we don’t have that time nor the manpower. So we should avoid doing original research by staying as close to the secondary sources as we can. (And if we ever do find ourselves conducting original research, that fact should be made very clear in the published version and in the data.)
Reference everything. A corollary of the above: every piece of data should be accompanied by a citation pointing back to where we got that data from. (This is of course standard academic practice. Also very similar to Wikipedia’s approach.)
Use a structured format. Currently, this is Brassica syntax for sound changes. This helps to keep everything consistent, and makes it possible to process sound changes on a computer.
Avoid editorialising. It’s easy to misunderstand papers, and that makes the data unreliable (see: WALS). The easiest way to avoid this problem is to write down the data as transparently as possible. Thus, if a source writes down a sound change in a certain way, we should strive to write it as similarly as possible, using the same symbols and expressing it the same way. If something can’t be exactly represented in our sound change format, we should get as close as possible, then note down the rest in English text (ideally using direct quotes from the source).
Map transcriptions to IPA. A corollary of the above is that inconsistencies in transcription between sources get carried over into our database. Mapping transcriptions to a common standard makes it possible to compare sound changes between sources, which is vital. It also exposes ambiguities where a symbol could refer to several different phones. If a source is unclear, we should note that fact but still make a best guess. (This is editorialising, but it’s fine as long as it’s identified as such.)

And, of course, the overall goal which drives all of these choices: to have a database of sound changes which is reliable and useful for as wide a range of people as possible.

I know that some of these principles have been controversial, especially (1). I’m writing them down here so that we can have a proper conversation about them. Feel free to disagree if you can think of a better approach — or conversely, mention if you have any more suggestions to add to the list.

bradrn wrote: ↑Sun Nov 10, 2024 7:20 pm I also want to experiment with writing some code for analysis, to prove that we can do something with this data more interesting than ‘format the sound changes on a website’.

Meanwhile, I’ve made a start on this. I wrote some code (in the ‘analysis’ branch on GitHub) to extract input→output phoneme pairs from our data so far. The output is easy enough to graph with Gephi:

: diachronica-graph.png (154.86 KiB) Viewed 125260 times

I colourised it using Gephi’s cluster analysis. The results aren’t very good, I think because there’s not much data. (I’ve gotten more interesting results using the data from the old Index Diachronica, but of course the quality of that data isn’t very good.)

Posted: **Thu Nov 14, 2024 6:28 pm**

I’m getting the hang of Brassica (or trying to, at least).

Two questions:

Will we be writing these lists inline in a Brassica document (presumably using the ; flag), or will the Brassica file be like a supplement to some sort of comprehensive page?
How do 4 and 5 interplay? I think I'm missing something because it’s not clicking. Is 4 kind of talking about describing the conditioning/making reference to features?

Posted: **Thu Nov 14, 2024 8:42 pm**

Man in Space wrote: ↑Thu Nov 14, 2024 6:28 pm 1. Will we be writing these lists inline in a Brassica document (presumably using the ; flag), or will the Brassica file be like a supplement to some sort of comprehensive page?

So far I’ve been using a half-baked custom format (sample) which mixes Brassica sound changes with references and other metadata. I considered using comments as you say, but unfortunately the very first thing the Brassica parser does is to strip them out, so it would require some thinking about how to do that best…

2. How do 4 and 5 interplay? I think I'm missing something because it’s not clicking. Is 4 kind of talking about describing the conditioning/making reference to features?

I’m not quite sure what you’re asking here, sorry. My thought was that, to ensure accuracy, we want to preserve the original data as it was presented (4), but we also want sound changes to be comparable with each other so we need to map those source-specific conventions to something more universal (5). The alternative would be writing everything as IPA directly, which is possible but I think obscures the nature of reconstructed phonemes. (The old ID is a bit inconsistent but it basically does this, IIRC.)

Posted: **Fri Nov 15, 2024 7:18 pm**

Just remembered that fusijui made this pertinent comment a few months ago:

fusijui wrote: ↑Sun Sep 01, 2024 10:49 pm For most if not all of the language families/groupings I personally know much about, what it sounds like you're expecting simply doesn't exist. There is not the documentation of sound changes that's published and also (plausibly) comprehensive, let alone also uncontroversial and widely accepted.

Additionally, those who have access to the kinds of material you want may volunteer to transcribe the data you want into the database structure you want, but that in itself is a an ask, even if the actual goodies are even there in the first place.

Meaningful/usable results; freedom from editing + high verifiability; volunteer engagement: pick two at most.

My response to that reads as, essentially, an earlier version of those design principles I set out above. Now that I’ve written them up properly, I’d be very interested to hear if fusijui has any further thoughts.

Posted: **Sat Nov 16, 2024 11:51 am**

Re "5. Map transcriptions to IPA.", I presume original notation would be presented in parallel to IPA?

I have one question, though. Say there's e → i / _Cj:

1. Does the language have /ɛ ɔ/ in the first place, or just /e o/?
2. What do we do if phonetic nature of the phonemes is not or hardly discussed?
3. Would sound change sections have some metadata, e.g. how many and what phonemes the language has at the start and end of its historical / reconstructed development?

These are considerations if we want to be able to study the typology of sound change on the basis of the new ID.

Posted: **Sat Nov 16, 2024 1:18 pm**

Based on the examples (there are more, but this is the one I could find quickly, the answer for 3. is yes, a lot of it). Presumably any ambiguities in the source would be discussed in the notes of the mapping to IPA.

Edit: I would be like to help with the production of the index, but I don't currently have enough time to spare to do everything I want to. If I come accross a good one I'd be glad to do it (if I remember), andI'd be happy to review someone else's transcription, or if someone has a paper they don't want to transcribe for whatever reason, I'd gladly do it, but I'm not going to spend my time hunting for papers to transcribe.

Posted: **Sat Nov 16, 2024 6:36 pm**

Zju wrote: ↑Sat Nov 16, 2024 11:51 am Re "5. Map transcriptions to IPA.", I presume original notation would be presented in parallel to IPA?

Indeed, this is the whole point. Lērisama linked an early draft, but you can find outputs from the actual project here: https://bradrn.com/index-diachronica/

I have one question, though. Say there's e → i / _Cj:

1. Does the language have /ɛ ɔ/ in the first place, or just /e o/?
[…]
3. Would sound change sections have some metadata, e.g. how many and what phonemes the language has at the start and end of its historical / reconstructed development?

The database (at least as it is currently designed) does indeed list phoneme inventories, when they are provided by a source. So this information should be easy enough to see.

(This was also a key point of the article I mentioned above — there is sampling bias in the sound changes reported by the literature. Providing phoneme inventories should help this somewhat, but we should continue thinking about ways to ameliorate the problem.)

2. What do we do if phonetic nature of the phonemes is not or hardly discussed?

I’ve been sort of handling this case in two ways:

If the transcription seems likely to be IPA (or consistent with IPA), then I’ve just been using it as-is with no special note.
If the transcription seems non-IPA, I’ve had to guess what the IPA could be. For this case I added a field to the metadata to mark when the IPA is a guess vs being explicitly specified in the source.

These are considerations if we want to be able to study the typology of sound change on the basis of the new ID.

Yes indeed. (For more on this I will refer you to that linked article, which is really very good.)

Posted: **Sat Nov 16, 2024 7:58 pm**

Zju pretty much asked the question I was getting at in my earlier post in a better manner than I did.

Posted: **Sun Nov 17, 2024 3:59 am**

bradrn wrote: ↑Sat Nov 16, 2024 6:36 pm Indeed, this is the whole point. Lērisama linked an early draft, but you can find outputs from the actual project here: https://bradrn.com/index-diachronica/

I somehow missed that. Good to know

I’ve been sort of handling this case in two ways:

If the transcription seems likely to be IPA (or consistent with IPA), then I’ve just been using it as-is with no special note.

If the transcription seems non-IPA, I’ve had to guess what the IPA could be. For this case I added a field to the metadata to mark when the IPA is a guess vs being explicitly specified in the source.

This seems unsatisfactory, but is probably the best that can be done. Maybe guesses (not just for IPA values, but sound changes the source notates ambiguously etc.) could be held to a higher standard of review?

Posted: **Sun Nov 17, 2024 5:05 am**

Lērisama wrote: ↑Sun Nov 17, 2024 3:59 am
I’ve been sort of handling this case in two ways:

If the transcription seems likely to be IPA (or consistent with IPA), then I’ve just been using it as-is with no special note.

If the transcription seems non-IPA, I’ve had to guess what the IPA could be. For this case I added a field to the metadata to mark when the IPA is a guess vs being explicitly specified in the source.

This seems unsatisfactory, but is probably the best that can be done. Maybe guesses (not just for IPA values, but sound changes the source notates ambiguously etc.) could be held to a higher standard of review?

Honestly, the whole review system needs further working-out. This could certainly be a part of it.

Posted: **Fri Nov 29, 2024 9:06 am**

bradrn wrote: ↑Sun Nov 10, 2024 7:20 pm In other news: in the background, myself and Man in Space have been talking to some linguists who are very enthusiastic about having a sound change database for their own purposes. We haven’t agreed on very much yet, but hopefully we can put together a design which will be useful for everyone involved.

This has progressed! Today we held an initial online meeting to discuss the design of a future sound change database (whether a direct successor of the ID, or of some other design). On our side, myself and Man in Space were there; the other attendees were Alex François (who I mentioned earlier), as well as Charles Zhang, a student whose research involves sound change simulation.

The meeting was long and productive, and covered a lot of ground. But to me the central question which came out of the meeting was one of empiricalism: to what extent should the database be grounded in solidly attested linguistic data, as opposed to speculation?

Our current approach is determinedly anti-empirical: aside from a small amount of directly attested linguistic history (from Romance and suchlike), all our data comes from reconstructions and the comparative method. Or, to put it another way, it’s all just the personal opinions of the linguist(s) who happened to write the articles we use. We’ve contemplated working on our own reconstructions, but similarly that would just our own opinion. We can work to make our database an accurate reflection of the literature, but by its nature, it would still be just a pile of speculations.

By contrast, Alex suggested a fundamentally different approach, based on his EvoSem database (which is genuinely really useful, go check it out!). In brief, the idea would be to take synchronic cognate sets, and compile sets of phoneme correspondences from those, completely ignoring how some linguist or another may have reconstructed the original situation. Of course, this would lose a large amount of information about the precise nature of diachronic sound change. In exchange, we get far more raw data, including from language families where reconstructions are poor or absent. And of course, that data would be far more reliable and empirically grounded.

The question is then: which of these approaches would be more useful for actual linguists? Perhaps a pile of speculative reconstructions would be intrinsically unreliable, and thus not as helpful as we’re assuming it would be. Conversely, maybe a purely empirical approach would lose too much vital information. Of course, the possibilities are not either/or. We could combine both datasets in one database, for instance. Or there’s Charles’s suggestion that sound changes could be coded with their degree of confidence (from ‘directly attested’ to ‘completely speculative’). And, of course, there could be other possibilities entirely which we haven’t thought of.

So — what do you all think about this question? Is there anything that we’ve missed here?

Zompist Bboard Again

The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica

Re: The Index Diachronica