Technology

Many companies do not disclose whether they will comply with California’s AI education transparency law.

On Sunday, California Governor Gavin Newsom signed bill AB-2013, which would require companies developing generative AI systems to publish high-level summaries of the data they used to train their systems. Among other things, the summary should include who owns the data, how the data was procured or licensed, as well as whether it contains copyright or personal information.

Few AI companies are willing to disclose their compliance.

TechCrunch reached out to key players in the AI space, including OpenAI, Anthropic, Microsoft, Google, Amazon, Meta, and startups Stability AI, Midjourney, Udio, Suno, Runway, and Luma Labs. Fewer than half responded, and one vendor, Microsoft, explicitly declined to comment.

Only Stability, Runway, and OpenAI told TechCrunch they would comply with AB-2013.

“OpenAI complies with the laws of the jurisdictions in which we operate, including this one,” an OpenAI spokesperson said. A Stability spokesperson said the company “supports thoughtful regulation that protects the public while not stifling innovation.”

To be fair, AB-2013’s disclosure requirements do not take effect immediately. This applies to systems released after January 2022, such as ChatGPT and Stable Diffusion, but companies must begin publishing training data summaries by January 2026. Additionally, the law only applies to systems available to California residents, which leaves some wiggle room.

But there may be another reason why vendors are silent on this issue, and it has to do with how most generative AI systems are trained.

Training data often comes from the web. Vendors collect massive amounts of images, songs, videos, etc. from websites and train their systems on them.

Until a few years ago, it was standard practice for AI developers to list the sources of their training data in the technical documentation that typically accompanies model releases. For example, Google said it trained an early version of Imagen, a family of image generation models, on the public LAION dataset. Many older papers mention The Pile, a collection of open source educational texts that includes academic research and a codebase.

In today’s competitive markets, the composition of training datasets is considered a competitive advantage, and companies cite it as one of the main reasons for non-disclosure. But training data details can also draw legal targets behind developers’ backs. LAION provides links to copyrighted or privacy-infringing images, and The Pile includes Books3, a library of pirated works by Stephen King and other authors.

There are already a lot of lawsuits over misuse of training data, and more are being filed every month.

The authors and publishers claim that OpenAI, Anthropic, and Meta have used copyrighted books (some of Books3’s books) for educational purposes. Record labels took Udio and Suno to court for allegedly training musicians to sing without compensating them. And artists have filed a class action lawsuit against Stability and Midjourney, alleging that their data collection practices amount to theft.

It’s not hard to see how AB-2013 could be problematic for vendors trying to stave off a legal battle. The law requires disclosure of various potentially incriminating specifications about training data sets, including a notice indicating when the set was first used and whether data collection is ongoing.

AB-2013 is quite broad in scope. Any entity that “substantially modifies” (i.e. fine-tunes or retrains) an AI system is: also I needed to post information about the training data I used to do so. The law has a few exceptions, but mostly applies to AI systems used in cybersecurity and defense that are used “for the operation of aircraft within national airspace.”

Of course, many vendors believe that the fair use doctrine provides legal protection, and they assert this in court and in public statements. Some companies, such as Meta and Google, have changed their platforms’ settings and terms of service to allow them to leverage more user data for training.

Driven by competitive pressures and the confidence that fair use defenses will eventually prevail, some companies have taken the liberty of educating themselves about IP-protected data. A Reuters report said Meta once used copyrighted books for AI training despite warnings from its own lawyers. There is evidence that Runway has sourced Netflix and Disney movies to train its video generation system. And OpenAI reportedly copied YouTube videos without the creators’ knowledge to develop models including GPT-4.

As we’ve written before, there are consequences that generative AI vendors can get away with, regardless of whether they disclose their system training data or not. Courts may ultimately side with fair use advocates and decide that generative AI is sufficiently transformative. But it’s not the plagiarism engine the New York Times and other plaintiffs claim it is.

In a more dramatic scenario, AB-2013 could lead to vendors withholding certain models from California or releasing versions of models for California residents trained only on fair use and licensed datasets. Some vendors may decide that the safest course of action for AB-2013 is to avoid compromising disclosure and incurring litigation.

Assuming the law isn’t challenged or upheld, we’ll have a clear picture by AB-2013’s deadline about a year from now.

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

RELATED ARTICLES

Vercel said some of its customers’ data had been stolen prior...

Shade invested $14 million to help creative teams discover their video...

India’s app market is booming, but global platforms are reaping most...