Patronus AI invested $50 million to build a ‘digital world’ to stress-test its AI agents.

AI agents are becoming more sophisticated. They are evolving from answering questions to autonomously executing multi-step, complex tasks.

But before these agents can be trusted to book travel or perform financial analysis on your behalf, model providers and the startups building these agents want to ensure that the agents perform reliably across a wide range of scenarios.

AI labs often use benchmarks to show off the performance of their models, but even on agent-centric benchmarks, high scores don’t actually prove that an AI can properly perform a variety of complex real-world tasks.

Patronus AI, a startup founded in 2023 by former Meta AI researchers Anand Kannappan and Rebecca Qian, helps modelers and companies fine-tune their models by building simulated digital environments to evaluate the performance of agents.

San Francisco-based startups must be solving important problems. Nearly every cutting-edge AI lab and many emerging startups are now customers, according to Glenn Solomon, managing director at Notable Capital. He explains that the demand for the company’s simulation environment is largely unsatisfied.

Patronus’ revenue has grown 15-fold over the past year, sparking significant investor interest. On Thursday, the company announced a $50 million Series B round led by Greenfield Partners, with participation from Notable Capital, Lightspeed, Datadog and Samsung. This round brings the company’s total funding to $70 million.

Patronus uses a so-called “digital world model” to create replicas of its websites and internal systems. In these environments, agents are trained and then stress-tested using reinforcement learning, which repeatedly rewards successful task completion and penalizes errors.

AI Labs see great value in these digital simulations. This is because it gives agents the opportunity to try out different, sometimes unpredictable, scenarios. The company compares its approach to how Waymo trained its autonomous cars by first building synthetic worlds to test the vehicles against rare hazards, such as bad weather or children chasing a ball.

The difference with AI agents is that they tend to take shortcuts. That means you’re not completing the task correctly. “Patronus is really good at finding hacks and holding models accountable,” Solomon said.

Patronus currently offers simulated digital worlds for software engineering and finance, but according to Kannappan, this is just the beginning.

“Right now we’re focused on verifiable problems, so we’re dealing with problems that we can immediately identify and verify, but there are a lot more areas that are unverifiable or difficult to verify,” he said.

Just because these processes are verifiable doesn’t mean they’re simple. “We want to create an environment where we can actually run agents who can run for 10 hours, 10 days or 10 weeks,” Kannappan said.

As for rivals, Patronus believes it is primarily competing with internal teams built by AI labs to evaluate agent behavior. While human data companies like Mercor and Surge provide reinforcement learning to model builders, Patronus operates differently by assessing how agents behave without human intervention.

If you purchase through links in our articles, we may receive a small commission. This does not affect our editorial independence.