Man in Space wrote: ↑Tue Jul 18, 2023 3:30 pm
As linguists (I say this more in the academic sense, for instance I have a degree in the field), we have no standing to oppose data scraping for AI. Most linguistic data is scraped from somewhere or other—corpus linguistics is the obvious offender here but even acquiring languages by immersion is arguably scraping the data from the conversations going on around you in real space[....] Data scraping is kind of what we do.
I've heard this argument a lot— AI techbros can do what they want because hey, artists just reprocess other art. But it's a flawed argument.
One, it simply does not follow from "humans learn languages" or "artists study other artists" to "techbros can make their billions by putting artists out of work." The pro-AI side always wants to move the argument from the real-world enshittification that they're doing, to the wonders or dangers of AI itself. What AI can or can't do is not that important; it's what the techbros intend to do with it.
Two, it's both reductive and disingenuous to reduce everything to "data scraping." I'll grant you that human cognition is
something like an LLM, but not that it
is an LLM. LLMs are pretty amazing, but it's been amply demonstrated that they are not AIs, and pretending that they are is not helping anyone but said techbros. Even AI research is weakened by acting as if the problem of AI is already solved. It's nice, from a research point of view, when a new milestone is reached, but actual AI people shouldn't swallow the marketing hype.
Three, there are notions like copyright and free use and public domain and plagiarism and misrepresentation for a reason. Erasing distinctions by calling everything "data scraping" is like saying that gifts, sales, and robbery at gunpoint are all perfectly valid forms of "property transfer."
Finally, linguists (and others) should take the opportunity to make sure that their use of corpora is in fact ethical. Many speakers of minority languages believe quite strongly that researchers do not have the right to freely use their language, to say nothing of their artistic productions like songs and myths. Already some websites are declaring that their content is not authorized for use in LLMs.
Is it linguists' right to use that data for their own purposes?
I'd also add that LLMs have the potential to harm corpus linguistics, precisely because uncounted numbers of unscrupulous people are going to flood the Web with bogus LLM-generated text. LLMs have an uncanny ability to produce plausible text; they do not generate correct text. You cannot trust LLM-generated text in the details, and that goes for trying to deduce linguistic facts from it.