the delicious irony: creators of industrial language models are now worried about no longer being able to use the web as their "commons" (i.e. other people's labor that they appropriate and commercialize) because their own outputs are "polluting" it (via https://mailchi.mp/jack-clark/import-ai-266-deepmind-looks-at-toxic-language-models-how-translation-systems-can-pollute-the-internet-why-ai-can-make-local-councils-better)
the year is 2025. openai lobbies congress to force websites that publish user-generated content to guarantee content is free of synthetic data (and mark it as such w/metadata in the html). google lobbies for compulsory no-cost licensing of all content published to the web, unless the site owner follows [proprietary standard that costs millions to implement]. facebook pays below poverty wages to thousands of contractors in locked, device-free rooms to type sentences, any sentences as LM fodder
@aparrish i am thinking also about the feedback loop from the penalty of "complex text". This is now so embedded in editors and SEO everywhere (as it makes for less "engagement") thus leading to a collective downhill spiral of ever more simplified expression
Meanwhile, every niche community of experts having real conversations on forums moved to discord and they were shut down.
Forums of voluntary experts are so massively underrated, they are massive wells of knowledge that are conveyed in the highly trustable form of human conversation. Yah you can fake identities and such but compared to other mediums, fake conversation and beliefs stick out like a sore thumb.
@aparrish Let’s make all web content automatically AGPL and see Google squirm as they realize that then they also have to release everything into the commons. If my content is part of their program (that’s machine learning), then their program must be free.
I bet that there’s GPL content via that route in their programs.
I want commons that takes back what’s taken out.
@aparrish So basically they admit to being parasitic on a data-host they are consequently destroying...
@aparrish this has already happened for some languages
i think there's more automatically translated Belarusian on the web than normal Belarusian
because Google gives higher ratings to unique content, some SEO people publish automatic translations as unique (albeit useless) content. created by robots for robots, these texts make it harder for people to find something useful
i feel a kind of Schadenfreude since English speakers will finally encounter what we have been encountering for years
@aparrish They are kind of getting ahead of themselves here. Even GPT-3 (which cost millions to train) sucks. Its hard to imagine it polluting the web.
Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.