Analysis

A group of AI researchers has demonstrated that large language models can be built using only openly licensed or public domain text, challenging the industry's claim that scraping copyrighted material is necessary. The researchers created an eight-terabyte dataset and trained a 7 billion parameter language model that performed comparably to Meta's Llama 2-7B. While the process was painstaking and not fully automatable, this effort could influence the policy debate around AI and copyright, particularly amid ongoing lawsuits and legislative developments concerning AI training data.

A recent research initiative has demonstrated the feasibility of constructing a 7 billion parameter large language model, comparable in performance to Meta's Llama 2-7B, using an eight-terabyte dataset composed exclusively of openly licensed or public domain text. This development challenges the prevailing industry stance, notably from entities like OpenAI, Anthropic, and venture capital firm Andreessen Horowitz, which have argued that training leading AI models necessitates the use of copyrighted materials scraped from the internet. While the researchers from Eleuther AI and other institutions acknowledge their process was painstaking, arduous, and not fully automatable, their findings introduce a significant counterpoint in the escalating policy and legal debates surrounding AI and copyright. This is particularly relevant amidst current events such as Reddit's (RDDT) lawsuit against Anthropic for alleged unlicensed data use, ongoing discussions around UK copyright law amendments, and the US Copyright Office's report casting doubt on fair use applicability for generative AI. Although the ethically sourced model is smaller than current leading models like OpenAI's ChatGPT or Google's (GOOGL, GOOG) Gemini, and its scalability to match these titans remains uncertain, this effort underscores the potential for alternative, more transparent AI development pathways and may intensify calls for AI companies to disclose more about their training data.

AllMind

AllMind

Analysis | AI firms say they can’t respect copyright. These researchers tried.

Analysis

AllMind AI Terminal

Market Sentiment

Key Decisions for Investors