Could Reddit's data be "poisoned" to prevent its use in training AI?

nodsocket@lemmy.world · edit-2 8 months ago

Could Reddit's data be "poisoned" to prevent its use in training AI?

Lvxferre@mander.xyz · 8 months ago

I don’t think that you can prevent Reddit data from being used for AI training, but you could reduce its value. Based on that, I’d probably

Generate low quality text that machines would have a hard time sorting out.
Replace your current Reddit content with said gibberish.

I’m saying this based on the following:

I don’t think that Reddit has any sort of complex content versioning system; at most, I think that it keeps your deleted posts/comments.
Odds are that the data is filtered before being used for “training”, and both user karma + content score play a role on that. As such, it would be pointless to add nonsense content that humans will downvote.

Funnily enough, AI might be a good way to generate this poisoning data against AI. For example I asked Gemini “Generate three paragraphs of nonsense text, containing three sentences each.”, and here’s the output

Gemini output

The purple teapot sang opera while the moonbeams juggled kumquats. A nearby rhinoceros tap-danced with a flock of butterflies, their laughter echoing through the forest of lollipops. Suddenly, a gust of wind carrying the scent of chocolate chip cookies swept through, carrying the teapot and rhinoceros off on a whimsical adventure.

The clock chimed a symphony of kazoo sounds as the grumpy gnome polished his collection of spoons. Outside, a flock of polka-dotted penguins argued passionately about the best flavor of bubblegum. Meanwhile, a mischievous squirrel knitted a tiny sweater for a bumblebee, humming happily as it buzzed around the garden.

The gigantic rubber ducky sailed across the sea of marshmallow fluff, its captain a pirate with a monocle and a penchant for pickles. In the distance, a mermaid with hair made of spaghetti twirled underwater, chasing after a school of goldfish wearing tiny tutus. On an island of cheese, a group of singing cacti serenaded the sun with their off-key melodies.

You could tweak the prompt to get something even more nonsense or even more passable, but you get the idea.

FaceDeer@kbin.social · 8 months ago

Reddit’s surely got a copy of the PushShift archives, it’ll have all the pre-sabotage versions of those comments.

Lvxferre@mander.xyz · 8 months ago

The PS archives are publicly available. If either OpenAI or Google were to use it, they wouldn’t pay Reddit Inc. a single penny; and yet Google is paying it 60 million dollars do to do. This means that there’s content that they cannot retrieve through the PS archives that would still be valuable as LLM data.

FaceDeer@kbin.social · 8 months ago

They’re paying Reddit to not sue them.

Regardless, the content that’s available through PS is the content that people are talking about overwriting or deleting. They can’t edit or delete stuff that PushShift couldn’t see in the first place.

Lvxferre@mander.xyz · 8 months ago

They’re paying Reddit to not sue them.

Given how many defences Google would have against that ant called Reddit suing it, ranging from actual fair points to “ackshyually”, I find it unlikely.

Regardless, the content that’s available through PS is the content that people are talking about overwriting or deleting. They can’t edit or delete stuff that PushShift couldn’t see in the first place.

Emphasis mine. Can you back up this claim?

I’m asking this because the content from PS is up to March/2023, it’s literally a year old. There was a lot of activity in Reddit in the meantime, and it’s from my impression that people talking about this are the ones who already erased their content in the APIcalypse, but kept using Reddit because there’s some subject “stuck” there that they’d like to use.

FaceDeer@kbin.social · 8 months ago

Academic Torrents has Reddit data up to December 2023. This data isn’t live-updated, my understanding is that it’s scraped when it’s first posted. That’s how services like removeddit worked, it would show the “original” version of a post or comment from when it was scraped rather than the edited or deleted version that Reddit shows now.

The age isn’t really the most important thing when it comes to training a base AI model. If you want to teach it about current events there are better ways to do that than social media scrapes. Stuff like Reddit is good for teaching an AI about how people talk to each other.