EleutherAI has released The Common Pile v0.1, an 8-terabyte dataset of licensed and open-domain text, intended to address concerns around copyright infringement in AI model training. The organization claims that its two new AI models, Comma v0.1-1T and Comma v0.1-2T, trained on a fraction of this dataset, perform comparably to models trained on unlicensed, copyrighted data, suggesting that open-source data can achieve similar results. This release aims to increase transparency in AI development and counter the trend of companies reducing research releases due to copyright lawsuits.
EleutherAI has launched The Common Pile v0.1, an 8-terabyte collection of licensed and open-domain text, a strategic initiative aimed at addressing pervasive copyright infringement concerns within AI model training. This dataset, a two-year collaborative effort with partners including Poolside and Hugging Face, was used to train EleutherAI's new 7-billion parameter models, Comma v0.1-1T and Comma v0.1-2T. These models reportedly achieve performance parity with those trained on unlicensed, copyrighted data, rivaling established benchmarks such as Meta's Llama 1 in areas like coding and math. The release directly confronts the issue articulated by EleutherAI's executive director, Stella Biderman, that ongoing copyright lawsuits against major AI developers have primarily resulted in decreased transparency and hampered research dissemination, rather than fundamentally altering data sourcing practices. By offering a large-scale, legally vetted dataset derived from sources like the Library of Congress and the Internet Archive, EleutherAI seeks to promote greater openness and provide a viable alternative to proprietary or ethically ambiguous training data. This initiative, which also serves to rectify criticisms of EleutherAI's earlier dataset "The Pile," signals a commitment to more frequent open dataset releases and a broader push towards more ethical and transparent AI development practices.
AI-powered research, real-time alerts, and portfolio analytics for institutional investors.
Request a DemoOverall Sentiment
moderately positive
Sentiment Score
0.60
Ticker Sentiment