I think it’s time I wrote down the design principles I’ve been implicitly (or explicitly) assuming:
- No original research. It could easily take a whole PhD to work out a canonical set of sound changes for a language family, and we don’t have that time nor the manpower. So we should avoid doing original research by staying as close to the secondary sources as we can. (And if we ever do find ourselves conducting original research, that fact should be made very clear in the published version and in the data.)
- Reference everything. A corollary of the above: every piece of data should be accompanied by a citation pointing back to where we got that data from. (This is of course standard academic practice. Also very similar to Wikipedia’s approach.)
- Use a structured format. Currently, this is Brassica syntax for sound changes. This helps to keep everything consistent, and makes it possible to process sound changes on a computer.
- Avoid editorialising. It’s easy to misunderstand papers, and that makes the data unreliable (see: WALS). The easiest way to avoid this problem is to write down the data as transparently as possible. Thus, if a source writes down a sound change in a certain way, we should strive to write it as similarly as possible, using the same symbols and expressing it the same way. If something can’t be exactly represented in our sound change format, we should get as close as possible, then note down the rest in English text (ideally using direct quotes from the source).
- Map transcriptions to IPA. A corollary of the above is that inconsistencies in transcription between sources get carried over into our database. Mapping transcriptions to a common standard makes it possible to compare sound changes between sources, which is vital. It also exposes ambiguities where a symbol could refer to several different phones. If a source is unclear, we should note that fact but still make a best guess. (This is editorialising, but it’s fine as long as it’s identified as such.)
And, of course, the overall goal which drives all of these choices: to have a database of sound changes which is
reliable and
useful for as wide a range of people as possible.
I know that some of these principles have been controversial, especially (1). I’m writing them down here so that we can have a proper conversation about them. Feel free to disagree if you can think of a better approach — or conversely, mention if you have any more suggestions to add to the list.
bradrn wrote: ↑Sun Nov 10, 2024 7:20 pm
I also want to experiment with writing some code for analysis, to prove that we can do something with this data more interesting than ‘format the sound changes on a website’.
Meanwhile, I’ve made a start on this. I wrote some code (in the ‘analysis’ branch on GitHub) to extract input→output phoneme pairs from our data so far. The output is easy enough to graph with
Gephi:
- diachronica-graph.png (154.86 KiB) Viewed 439 times
I colourised it using Gephi’s cluster analysis. The results aren’t very good, I think because there’s not much data. (I’ve gotten more interesting results using the data from the old
Index Diachronica, but of course the quality of that data isn’t very good.)