I don't see how you can stop the LLMs ingesting any poison either, because they'...

Ifkaluva · 2026-01-11T19:50:54 1768161054

The big labs spend a ton of effort on dataset curation, precisely to prevent them from ingesting poison as you put it.

It goes further than that—they do lots of testing on the dataset to find the incremental data that produces best improvements on model performance, and even train proxy models that predict whether data will improve performance or not.

“Data Quality” is usually a huge division with a big budget.

conartist6 · 2026-01-11T20:54:49 1768164889

Jeez, why can't I have a data quality team filtering out AI slop!

cbozeman · 2026-01-12T07:56:43 1768204603

You can... you just need to make about $100,000,000,000 USD in profits each year, that's all.

stanfordkid · 2026-01-13T18:15:22 1768328122

Just look at the domains. Obviously social media will get harder to do this with, maybe that's okay though. I think a simple criterion can be used: could the pre-trained LLM have come up with this itself? If so it probably doesn't have training value.