
Don’t worry, the recently reported breach of OpenAI’s systems leaked secret ChatGPT conversations. The hack itself, while troubling and superficial, is a reminder that AI companies are quickly becoming one of the most savory targets for hackers.
The New York Times reported more about the hack after former OpenAI employee Leopold Aschenbrenner hinted at it in a recent podcast. He called it a “major security breach,” but an unnamed company source told the Times that the hackers only had access to employee discussion forums. (I have reached out to OpenAI for confirmation and comment.)
Security breaches shouldn’t be trivialized, and eavesdropping on OpenAI’s internal development conversations is certainly worthwhile. But that’s a far cry from hackers gaining access to internal systems, work-in-progress models, and secret roadmaps.
But it’s something that should scare us, and not just because China or some other adversary threatens to overtake us in the AI arms race. The simple fact is that these AI companies have become gatekeepers to vast amounts of very valuable data.
Let’s talk about three types of data that OpenAI and (to a lesser extent) other AI companies generate or have access to: high-quality training data, large amounts of user interactions, and customer data.
The companies are incredibly secretive about their stockpiles of data, so it’s unclear exactly what kind of training data they have. But it’s a mistake to think that they have a big pile of scraped web data. Yes, they use web scrapers or data sets like Pile, but it’s a huge task to turn that raw data into something that can be used to train a model like GPT-4o. It takes a huge amount of human labor time. It can only be partially automated.
Some machine learning engineers have speculated that the most important factor in building a large-scale language model (or perhaps any Transformer-based system) is the quality of the data set. That’s why models trained on Twitter and Reddit can’t possibly be as eloquent as models trained on every published work from the past century. (And that’s probably why OpenAI was known to use questionable legal sources, like copyrighted books, for its training data. They claim to have abandoned this practice.)
So the training data sets that OpenAI builds are incredibly valuable to competitors, from other companies to hostile nations to U.S. regulators. Wouldn’t the FTC or the courts want to know exactly what data was used and whether OpenAI was telling the truth about it?
But perhaps even more valuable is OpenAI’s vast user data: billions of conversations with ChatGPT on hundreds of thousands of topics. Just as search data was once key to understanding the collective psychology of the web, ChatGPT is taking the pulse of a population that, while not as expansive as Google’s user universe, offers far more depth. (As you may know, conversations are used for training data unless you opt out.)
For Google, an increase in “air conditioner” searches means the market is heating up a bit. But those users aren’t fully talking about what they want, how much they’re willing to spend, what their homes are like, what manufacturers they’re avoiding, etc. You know it’s valuable because Google itself is trying to turn users into AI-powered agents by replacing search with AI interactions to provide them with this information!
Think about how many conversations people have had with ChatGPT, and how useful that information is not only for AI developers, but also for marketing teams, consultants, analysts, etc. It’s a real goldmine.
The last category of data is probably the most valuable in the open market: how customers actually use AI and the data they feed directly into their models.
Hundreds of large companies and countless smaller ones use tools like OpenAI and Anthropic’s APIs for a wide variety of tasks, and for language models to be useful to them, they usually need to be fine-tuned or given access to their own internal databases.
This could be something as mundane as old budgets or personnel records (for easy retrieval, for example) or as valuable as the code for unreleased software. How they use the capabilities of AI (and whether they are actually useful) is their business, but the simple fact is that AI providers have privileged access, just like any other SaaS product.
These are industry secrets, and AI companies are suddenly at the center of a lot of secrecy. The newness of this aspect of the industry carries special risks, as AI processes are not yet standardized or fully understood.
Like all SaaS providers, AI companies can provide industry-standard levels of security, privacy, on-premise options, and generally act responsibly. There’s no doubt that OpenAI’s Fortune 500 customers’ private databases and API calls are very tightly locked down! They certainly know, or should know more, the inherent risks of handling confidential data in the context of AI. (It’s their choice that OpenAI didn’t report this attack, but it doesn’t inspire trust in the company that desperately needs it.)
But good security practices don’t change the value of what you’re trying to protect, or the fact that malicious actors and adversaries of all kinds are knocking on your door. Security isn’t just about choosing the right settings or keeping your software up to date. The basics are important, too. Ironically, it’s a never-ending cat-and-mouse game that’s being further enhanced by AI itself, with agents and attack automation probing every nook and cranny of these companies’ attack surfaces.
There’s no reason to panic. Companies with access to valuable personal or commercial data have faced and managed similar risks for years. But AI companies are newer, younger, and potentially more delicious targets than garden-variety, poorly configured enterprise servers or irresponsible data brokers. Even hacks like the ones reported above, which don’t involve any serious leaks that we know of, should be a concern for anyone who does business with an AI company. They’ve already painted a target on their back. Don’t be surprised if anyone and everyone is shooting.