AI Training Data Under Fire – Lawsuits Over Unlawful Use of Content

The Legal Challenges of AI’s Web Scraping Habit

Generative AI systems like ChatGPT are fueled by massive datasets scraped from the internet. This treasure trove of training data often includes copyrighted books, articles, and other media – sometimes used without the creators’ consent. As a result, serious legal and ethical questions have arisen about whether using someone’s content to train an AI model is fair game or a form of digital piracy. A growing number of authors and content creators are pushing back, concerned that their work has been ingested into AI models without permission or compensation. It’s a stark reminder that once data is posted publicly online, controlling where it ends up can be nearly impossible.

Authors and Publishers Fight Back

In mid-2023, a pair of novelists made headlines by suing OpenAI for allegedly “ingesting” their books to train ChatGPT without authorization. Authors Mona Awad and Paul Tremblay filed a class-action complaint after noticing the chatbot could produce detailed summaries of their novels – evidence, they argued, that the AI had been fed their copyrighted text. The lawsuit claims OpenAI “unfairly” profits from “stolen writing and ideas,” and it seeks damages on behalf of all affected writers. This was among the first copyright suits against OpenAI’s AI, but it was not the last. By later that year, at least three similar class-action lawsuits had been filed by groups of U.S. authors, each alleging that OpenAI copied writers’ works without permission to teach its AI models.

And it’s not just book authors. News publishers are also taking action. The New York Times, for example, has sued OpenAI (and its partner Microsoft) for the unpermitted use of Times articles to train large language models. Image creators have raised alarms too – companies like Stability AI, which built generative image models, have been sued by copyright owners (including photo agency Getty Images) over using millions of online images without authorization. In each case, content owners argue that AI firms are building lucrative products off the back of pirated material, while the AI companies contend that using publicly available data in this way qualifies as fair use. These legal battles are poised to test the boundaries of copyright law in the AI era, potentially setting precedents for how AI training data can be collected and used.

Fair Use or “Unauthorized Copying”?

Notably, AI developers have grown increasingly secretive about exactly what data they use. OpenAI has disclosed only broad strokes about its sources – for instance, acknowledging a dataset of “internet-based books” nicknamed “Books2” that analysts deduced likely contained roughly 294,000 volumes obtained from shadow libraries like LibGen or Z-Library. This suggests that a vast number of books (many likely under copyright) were pulled from illicit online archives to train the model. The crux of the debate is whether such uncompensated use of protected material is legally permissible. Experts note the issue may hinge on courts’ interpretations of fair use – i.e. is training an AI on someone’s work a transformative, fair use of the material, or simply “unauthorized copying” outside the bounds of the law? Until clear jurisprudence emerges, AI companies and creators find themselves in a gray area of uncertainty.

Implications for Your Data Online

These controversies highlight a broader truth: when you put information on the internet, you often lose control over how it might be used down the line. AI models are hungry for data, and they don’t distinguish between a classic novel, a personal blog post, or a random forum comment – if it’s publicly accessible, it can be scooped into a training set. In some cases, data that people shared for one purpose ends up repurposed in AI systems without their knowledge or consent. For example, users recently discovered that LinkedIn had automatically opted them into allowing their profile information to train generative AI models, sparking backlash when people realized they hadn’t explicitly agreed to that use. The difficulty of tracking or preventing such usage is exactly why authors and artists are sounding alarms. Once content is out in the open, it may quietly find its way into countless AI datasets or other unintended contexts.

Protecting Your Work and Privacy

So what can creators and regular users do? One takeaway is to be mindful of what you share publicly, especially if it’s something you may want to protect. Consider using watermarks or posting only excerpts instead of full works, so it’s harder for AI systems to ingest useful portions of your content. Some experts suggest opting for more secure or closed platforms for sensitive material, where web scrapers and bots can’t easily harvest the data. When working with important documents, consider using tools that prioritize privacy – for instance, a client-side PDF editor that processes files directly in your browser without uploading them to any server. Solutions like SecurePDFEditor.com exemplify this approach, ensuring your documents remain on your local machine and aren’t inadvertently fed into someone’s AI model. Until laws and norms catch up with technology, exercising caution and using privacy-focused services can help you retain a bit more control over where your data ends up.