I think the difference may be LLMs may not be laundered clean of copyright data ...

		yurlungur on Feb 7, 2025 \| parent \| context \| favorite \| on: Meta torrented & seeded 81.7 TB dataset containing... I think the difference may be LLMs may not be laundered clean of copyright data anytime soon. Even if chatgpt got big and profitable, it's not so clear that it won't contain copyrighted data as that may simply be necessary to train the best models.

Most of the web is copyrighted