A U.S. court ordered OpenAI to turn over internal communications on why it deleted two massive datasets of pirated books used to train their models.
Those datasets (the definitely-not-hiding-anything sets labeled “books 1” and “books 2”) contained 100,000+ copyrighted and scraped books (novels, non-fiction, etc.). (Sans consent or author compensation, of course.)
At this point, we’ve known for years that tech companies have used pirated copyrighted work to train their AI models, as well as vast quantities of online works (fanfic, blog posts, journalism, etc...). Without them, these LLMs would effectively not exist.
But if the court finds OpenAI knowingly trained on pirated books and deleted the evidence to to avoid scrutiny, that could mean actual consequences to the tune of billions of dollars. Earlier this year, Anthropic was compelled to compensate authors $1.5 billion for... similarly sus training.
This means potentially huge effects across the entire industry, and a major blow to Big Tech’s scrape first, apologize never strategy.
We'll be keeping a close eye on any developments.
- the Ellipsus Team xo