the delicious irony: creators of industrial language models are now worried about no longer being able to use the web as their "commons" (i.e. other people's labor that they appropriate and commercialize) because their own outputs are "polluting" it (via

the year is 2025. openai lobbies congress to force websites that publish user-generated content to guarantee content is free of synthetic data (and mark it as such w/metadata in the html). google lobbies for compulsory no-cost licensing of all content published to the web, unless the site owner follows [proprietary standard that costs millions to implement]. facebook pays below poverty wages to thousands of contractors in locked, device-free rooms to type sentences, any sentences as LM fodder


"Here's how to tag your website as containing machine-generated text."


Meanwhile, every niche community of experts having real conversations on forums moved to discord and they were shut down.

Forums of voluntary experts are so massively underrated, they are massive wells of knowledge that are conveyed in the highly trustable form of human conversation. Yah you can fake identities and such but compared to other mediums, fake conversation and beliefs stick out like a sore thumb.

@aparrish Let’s make all web content automatically AGPL and see Google squirm as they realize that then they also have to release everything into the commons. If my content is part of their program (that’s machine learning), then their program must be free.

I bet that there’s GPL content via that route in their programs.

I want commons that takes back what’s taken out.

@ArneBab @aparrish It seems the current legal consensus is that hiding the source data in a machine learning network erases authorship?

@clacke @aparrish in the EU not really — there’s an allowance for research in universities, the rest is between gray and copyright infringement. But almost impossible to prove.

@clacke @aparrish @ArneBab yeah this, the ML model publishers argue that training a model doesn't require permission from the authors of the original content. See GitHub Copilot.

@ArneBab @clacke @aparrish iirc yeah, it hasn't actually been tested in court

@aparrish So basically they admit to being parasitic on a data-host they are consequently destroying...


So the future of the web is a permanent white noise?

@aparrish They are kind of getting ahead of themselves here. Even GPT-3 (which cost millions to train) sucks. Its hard to imagine it polluting the web.

