This is a Plain English Papers summary of a research paper called LLM Training: Data Costs 1000x More Than You Think!. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

The Unacknowledged Contributors to AI's Success

Large Language Models (LLMs) have become increasingly expensive to train, with costs doubling approximately every nine months. Models like OpenAI's GPT-4 and Google's Gemini Ultra likely cost tens of millions of dollars to develop, with billion-dollar training runs expected in the near future. Yet amid these escalating costs, one critical expense remains largely unaccounted for: the training data itself.

Every LLM is built upon an enormous foundation of human effort - trillions of words carefully written in books, academic papers, codebases, social media posts, and more. Despite this invaluable contribution, content creators rarely receive compensation for their work that powers these AI systems.

A new research paper by Kandpal and Raffel examines this economic imbalance. The researchers analyzed 64 language models released between 2016-2024 and found that the value of their training data dramatically overshadows the computational costs of training the models themselves.

The graph shows training costs compared to estimated dataset costs for 64 language models released between 2016 and 2024, demonstrating that dataset costs exceed training costs by 1-3 orders of magnitude.
Figure 1: Estimated costs for creating LLM training datasets exceed the actual training costs by 10-1000 times across 64 models released between 2016-2024.

Measuring the True Value of Training Data

To quantify the unacknowledged value of training data, the researchers calculated how much it would cost to create these datasets from scratch using paid human labor. They intentionally used conservative estimates, assuming:

  • Minimum wage rates ($15/hour) rather than professional writing rates
  • Fast writing speeds (30 words per minute)
  • Only the raw text creation time, excluding research, editing, or expertise development

This deliberately conservative approach still yielded staggering results. Even simple content creation requires significant human effort when scaled to the volumes needed for LLM training.

Writing Medium Typical Length (words) Estimated Writing Time Estimated Cost (USD)
Blog Post 2,000 1.1 hours $\$ 4$
Academic Paper 5,000 2.8 hours $\$ 11$
Novel 70,000 1.6 days $\$ 150$
Textbook 300,000 1 week $\$ 642$
Encyclopedia Britannica $40,000,000$ 2.5 years $\$ 85,560$

Table 1: Estimated costs and time required to write different types of content based on conservative writing speeds and minimum wage.

These calculations were applied to the training datasets of various LLMs to estimate their true cost. Recent research on cost efficiency of LLM-generated training data complements this work by exploring alternative approaches to data creation.

The Staggering Results: Data Value Eclipses Training Costs

The research reveals an enormous disparity: even with these conservative estimates, the value of training data exceeds model training costs by 10-1000 times. This represents a massive financial liability for LLM providers if they were required to pay fair rates for the content they use.

As models have scaled up over time, the gap between training costs and data costs has grown exponentially. The largest models now train on trillions of tokens, representing content that would cost billions or even trillions of dollars if created from scratch with compensated labor.

This economic reality has significant implications for the sustainability of AI development. Related research on cost-performance optimization for processing low-resource languages highlights additional challenges in this space.

Legal and Ethical Implications: The Growing Controversy

The practice of training LLMs on web text faces mounting legal and ethical challenges. Several high-profile lawsuits have emerged, including actions by the New York Times, Authors Guild, Concord Music Group, and Alden Newspapers, all challenging the unauthorized use of copyrighted material for commercial AI training.

Beyond legal concerns, ethical questions arise about models potentially competing with the very creators whose work they were trained on. This creates a troubling dynamic where content creators effectively train their own competition without compensation.

In response to these concerns, some corporate entities have begun entering data licensing agreements with LLM providers. Major publishers like The Atlantic, Reuters, Wiley, and Vox Media, along with platforms like Reddit and Shutterstock, have established compensation arrangements. However, these agreements cover only a small fraction of the total data used to train most models.

Research on disparities in LLM control and access further explores the power imbalances in the current AI ecosystem.

Towards a More Equitable AI Future

The researchers argue that data creators should receive the largest share of compensation in the LLM development pipeline. This position challenges the current paradigm where computational resources receive significant investment while content creators go unpaid.

Several mechanisms could enable fairer compensation:

  • Direct payment systems for training data contributions
  • Attribution and licensing frameworks that respect creator rights
  • Transparent documentation of training sources
  • Revenue sharing models for commercial AI applications

The economic implications of properly accounting for data costs would fundamentally reshape AI development economics. Training models might become more selective and focused, with greater emphasis on quality over quantity, as suggested in research on efficient LLMs for scientific text.

This research highlights a profound economic imbalance in current AI development. While companies invest billions in computing infrastructure, the true foundation of these systems - human-created content - remains largely uncompensated. Addressing this imbalance represents one of the most significant challenges for the ethical advancement of artificial intelligence.

Click here to read the full summary of this paper