Terpene

Model Collapse, also known as AI collapse or Habsburg AI, refers to a phenomena where machine learning models gradually degrade due to errors coming from uncurated training on synthetic data, meaning the outputs of another model including prior versions of itself.[1][2][3]

Shumailov et al. [1] coined the term and described two specific stages to the degradation: early model collapse and late model collapse. In early model collapse the model begins losing information about the tails of the distribution – mostly affecting minority data. Later work highlighted that early model collapse is hard to notice, since overall performance may appear to improve, while the model loses performance on minority data.[4] In the late model collapse model loses a significant proportion of its performance, confusing concepts and losing most of its variance.

Why does Model Collapse happen?[edit]

Synthetic data, although theoretically indistinguishable from real data, is almost always biased, inaccurate, not well representative of the real data, harmful, or presented out-of-context.[5][6] Using such data as training data leads to issues with quality and reliability of the trained model.[7][8]

Model Collapse occurs for three main reasons – functional approximation errors, sampling errors, and learning errors [1]. Importantly, it happens in even the simplest of models, where not all of the error sources are present. In more complex models the errors oftentimes compound, leading to faster collapse.

Model collapse in generative models can be avoided by accumulating data

Is Model Collapse inevitable?[edit]

From even simplest models it becomes clear that model collapse is not inevitable. For example in the gaussian model[1], a superlinearly increasing amount of data can bound is needed. Later work [9] highlighted that it can also be bounded in some settings, yet comes with a significant training cost – requiring accumulating and tracking data over time.

Alternative branch of literature investigates use of machine learning detectors and watermarking to identify model generated data and filter it out.[10]

Impact on large language models[edit]

In the context of large language models, research found that training LLMs on predecessor-generated text—language models are trained on the synthetic data produced by previous models—causes a consistent decrease in the lexical, syntactic, and semantic diversity of the model outputs through successive iterations, notably remarkable for tasks demanding high levels of creativity.[11]

Data poisoning for artists[edit]

Data poisoning is a form of Adversarial machine learning where the data of an image or text is altered so it cannot be trained on accurately by a training model. There are two main types of data poisoning, defensive, where an image's data is alter to protect the integrity of the work by preventing copying and look-alikes, and offensive, where an image is altered to reduce the reliability of generative artificial intelligent image generation.[12] However, it is unknown how much data poisoning affects training data and generative artificial intelligence on a large scale.

References[edit]

  1. ^ a b c d Shumailov, Ilia; Shumaylov, Zakhar; Zhao, Yiren; Gal, Yarin; Papernot, Nicolas; Anderson, Ross (2023-05-31). "The Curse of Recursion: Training on Generated Data Makes Models Forget". arXiv:2305.17493 [cs.LG].
  2. ^ Ozsevim, Ilkhan (2023-06-20). "Research finds ChatGPT & Bard headed for 'Model Collapse'". Retrieved 2024-03-06.
  3. ^ Mok, Aaron. "A disturbing AI phenomenon could completely upend the internet as we know it". Business Insider. Retrieved 2024-03-06.
  4. ^ Wyllie, Sierra; Shumailov, Ilia; Papernot, Nicolas (2024-06-05). "Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias". Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. FAccT '24. New York, NY, USA: Association for Computing Machinery: 2113–2147. doi:10.1145/3630106.3659029. ISBN 979-8-4007-0450-5.
  5. ^ De Rosa, Micholas (May 31, 2024). "How the new version of ChatGPT generates hate and disinformation on command". CBC. Retrieved June 13, 2024.
  6. ^ Orland, Kyle (May 24, 2024). "Google's "AI Overview" can give false, misleading, and dangerous answers". arsTechinca. Retrieved June 13, 2024.
  7. ^ Alemohammad, Sina; Casco-Rodriguez, Josue; Luzi, Lorenzo; Humayun, Ahmed Imtiaz; Babaei, Hossein; LeJeune, Daniel; Siahkoohi, Ali; Baraniuk, Richard G. (July 4, 2023). "Self-Consuming Generative Models Go MAD". arXiv:2307.01850 [cs.LG].
  8. ^ Self-Consuming Generative Models Go MAD. The Twelfth International Conference on Learning Representations.
  9. ^ Gerstgrasser, Matthias; Schaeffer, Rylan; Dey, Apratim; Rafailov, Rafael; Sleight, Henry; Hughes, John; Korbak, Tomasz; Agrawal, Rajashree; Pai, Dhruv; Gromov, Andrey; Roberts, Daniel A.; Yang, Diyi; Donoho, David L.; Koyejo, Sanmi (2024-04-01). "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data". arXiv:2404.01413 [cs.LG].
  10. ^ Kirchenbauer, John; Geiping, Jonas; Wen, Yuxin; Katz, Jonathan; Miers, Ian; Goldstein, Tom (2023-07-03). "A Watermark for Large Language Models". Proceedings of the 40th International Conference on Machine Learning. PMLR: 17061–17084.
  11. ^ Guo, Yanzhu; Shang, Guokan; Vazirgiannis, Michalis; Clavel, Chloé (2024-04-16). "The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text". arXiv:2311.09807 [cs.CL].
  12. ^ The Nightshade Team. "What is Nightshade". Nightshade. University of Chicago. Retrieved June 13, 2024.

Leave a Reply