Most Deployed AI Agents Have No Published Safety Evaluations

A multi-institution research team has found that the overwhelming majority of commercially deployed AI agents publish no safety evaluation results, according to the 2025 AI Agent Index, presented this week at the ACM Conference on Fairness, Accountability, and Transparency (FAccT 2026).

The index was authored by researchers at MIT, Stanford, Harvard Law School, the University of Cambridge, the University of Washington, and partner institutions. It surveyed 30 prominent agents across three categories — 12 chat assistants, 5 browser agents, and 13 enterprise tools — coding 1,350 data points using public sources and developer correspondence.

A Pervasive Transparency Gap

The headline finding is stark: 25 of 30 agents disclose no internal safety evaluation results, and 23 of 30 have no publicly available third-party testing information. Across all 1,350 data fields, researchers could find no public information for 227 of them. Safety-related fields fared the worst — 135 of 240 safety data points (56%) were blank. Enterprise agents were missing safety information in 66% of relevant fields; browser agents, in 60%.

The four agents that did disclose agent-specific safety evaluations were ChatGPT Agent, OpenAI Codex, Claude Code, and Gemini 2.5 Computer Use — all products from the three companies whose underlying models power nearly every other agent in the index.

Capabilities Publicized; Risks Less So

The researchers identify a revealing asymmetry. Nine of 30 agents publish capability benchmarks — results on coding tasks or GUI navigation — while rarely pairing them with corresponding safety data. Only 15 of 30 developers maintain any published AI safety framework, and 10 have no framework documentation at all. Just 3 of 30 agents support watermarking of AI-generated media.

A Concentrated, Hard-to-Audit Ecosystem

The index flags a structural concern: virtually all 30 agents are built on one of three foundation model families — GPT, Claude, or Gemini. A vulnerability in any of those base models could propagate across much of the deployed agent landscape simultaneously.

The researchers also note that accountability is diffuse by design. Because agents sit at the intersection of model providers, platform developers, and enterprise operators, determining who bears responsibility for a given failure is often unclear.

Why It Matters Now

The findings arrive as AI agents are being integrated into legal research, healthcare workflows, enterprise software, and government operations — at autonomy levels where agents can book travel, execute code, or send emails without human confirmation at each step.

The authors call for standardized disclosure requirements comparable to those governing pharmaceutical trials or financial products, arguing that the current voluntary approach leaves users, organizations, and policymakers unable to assess deployment risk.

The full index is publicly available at aiagentindex.mit.edu.