

“There’s no more free lunch. You can’t scrape a web-scale data set anymore. You have to go and purchase it or produce it. That’s the frontier we’re at now,” said Alex Ratner, co-founder of Snorkel AI, which builds and labels data sets for companies.
Top artificial intelligence companies are facing increasing copyright lawsuits and accusations of data scraping from the web, especially as they reach a “data frontier” that is slowing technological advances. Recently, authors sued Anthropic for using copyrighted books without permission, adding to existing lawsuits, including one from The New York Times against OpenAI and Microsoft for copyright infringement. AI companies require large datasets to train models but are now forced to purchase or create this data instead of freely scraping the web. This has led to disputes with publishers over AI’s use of their content without proper compensation.
Some AI companies are striking deals with publishers to access data, as OpenAI has done with Condé Nast and The Financial Times, while others like Anthropic are still negotiating such partnerships. Google, which previously won a fair use case in 2015, faces criticism from publishers as well. The outcome of these cases could set legal precedents for how AI companies interact with content creators and impact the broader AI landscape.
Recommended press articles






