Many safety assessments of AI models have significant limitations.

Today’s tests and benchmarks may fall short despite growing demands for AI safety and accountability, according to a new report.

Generative AI models that can analyze and output text, images, music, videos, and more are increasingly under scrutiny for their tendency to make mistakes and generally behave unpredictably. Now, organizations ranging from public sector agencies to big tech companies are proposing new benchmarks to test the safety of these models.

Late last year, startup Scale AI created a lab dedicated to assessing how well models align with safety guidelines. This month, NIST and the UK AI Safety Lab released a tool designed to assess model risk.

However, these model exploration tests and methods may be inadequate.

The Ada Lovelace Institute (ALI), a UK-based non-profit AI research organization, conducted a study that interviewed experts from academic labs, civil society, and vendors who produce models and audited recent research on AI safety assessments. The co-authors found that while current assessments can be useful, they are not complete, can be easily manipulated, and do not necessarily provide an indication of how the models will perform in real-world scenarios.

“Whether it’s our smartphones, prescription drugs, or cars, we expect the products we use to be safe and reliable. In these areas, products undergo rigorous testing before they’re deployed to ensure they’re safe,” Elliot Jones, a senior research scientist at ALI and co-author of the report, told TechCrunch. “Our research aimed to examine the limitations of current approaches to AI safety assessment, evaluate how assessments are currently being used, and explore their use as a tool for policymakers and regulators.”

Benchmarks and Red Teams

The study’s co-authors first examined the academic literature to establish an overview of the harms and risks posed by today’s models, as well as the state of existing AI model evaluation. They then interviewed 16 experts, including four from unnamed technology companies that develop generative AI systems.

This study found sharp disagreement within the AI ​​industry about the most appropriate methods and classifications for model evaluation.

Some evaluations only tested how the model matched benchmarks in the lab, but did not test how the model might affect real users. Other evaluations used tests developed for research purposes without evaluating production models, but the vendor insisted that they be used in production.

We’ve written about the problems with AI benchmarks before, and this study highlights all of these issues and more.

Experts cited in the study noted that it is difficult to extrapolate model performance from benchmark results, and it is unclear whether benchmarks can demonstrate that a model has a particular competency. For example, a model may perform well on the state bar exam, but that does not mean it can solve more open-ended legal questions.

Experts also pointed to the issue of data contamination, where benchmark results can overestimate a model’s performance if the model was trained on the same data on which it is being tested. In many cases, benchmarks are chosen by organizations for convenience and ease of use, not because they are the best tools for evaluation, the experts said.

“Benchmarks run the risk of being manipulated by developers who can train their models on the same datasets that will be used to evaluate them. This is like looking at a test paper before an exam, or strategically choosing which assessments to use,” Mahi Hardalupas, a researcher at ALI and co-author of the study, told TechCrunch. “It also matters what version of the model is being evaluated. Minor changes can change its behavior unpredictably and bypass built-in safety features.”

The ALI study also found problems with “red teams,” the practice of assigning tasks to individuals or groups to “attack” models to identify vulnerabilities and flaws. Many companies, including AI startups OpenAI and Anthropic, use red teams to evaluate their models, but there are few agreed-upon standards for red teams, making it difficult to evaluate the effectiveness of a given effort.

Experts told the study’s co-authors that it can be difficult to find people with the skills and expertise needed to build a red team, and the manual nature of red teaming makes it expensive and tedious. This poses a barrier for smaller organizations that don’t have the resources to do so.

Possible solutions

The pressure to release models faster and the lack of willingness to conduct tests that could uncover problems before release are the main reasons why AI evaluations have not improved.

“One person we spoke to who worked at a company developing foundational models felt that the pressure to release models quickly within the company was increasing, making it harder to push back and take evaluations seriously,” Jones said. “The major AI labs are releasing models at a rate that is outpacing their own or society’s ability to ensure that the models are safe and reliable.”

One interviewee in the ALI study called model assessment of safety a “non-fixable” problem. So what hope do industry and regulators have for a solution?

Mahi Hardalupas, a researcher at ALI, believes there is a way forward, but says it requires greater involvement from public sector entities.

“Regulators and policymakers need to be clear about what they want from assessments,” he said. “At the same time, the assessment community needs to be transparent about the current limitations and potential of assessments.”

Hardalupas suggests that governments implement measures to support an “ecosystem” of third-party testing, including mandates for greater public participation in assessment development and programs to ensure regular access to necessary models and datasets.

Jones believes it may be necessary to go beyond simply testing how a model responds to prompts to develop “contextual” evaluations that instead look at the types of users a model might affect (such as people of a particular background, gender, or ethnicity) and how attacks on the model might defeat safeguards.

“This will require investment in the basic science of assessment to develop more robust and repeatable assessments that are based on an understanding of how AI models work,” she added.

However, there is no guarantee that the model is safe.

“As others have pointed out, ‘safe’ is not a property of a model,” Hardalupas said. “Determining whether a model is ‘safe’ requires understanding the context in which the model will be used, to whom it will be sold or made accessible, and whether the safeguards in place to mitigate those risks are adequate and robust. While assessments of baseline models can serve an exploratory purpose in identifying potential risks, they cannot guarantee that a model is safe, much less that it is ‘completely safe.’ Many of our interviewees agreed that assessments cannot prove that a model is safe, but can only indicate that a model is not safe.”