OpenAI on June 30 introduced GeneBench-Pro, a research-level benchmark designed to test how well AI agents can handle computational biology — not by recalling facts, but by reasoning through ambiguous, noisy data the way working scientists do.
What the benchmark tests
GeneBench-Pro comprises 129 synthetic problems spanning genomics, quantitative biology, and translational medicine. Each problem pairs a deliberately noisy dataset with a specific downstream question, requiring an agent to choose the right analysis, revise assumptions mid-course, and judge whether results are reliable enough to act on. OpenAI calls this capacity “research taste.”
Problems are generated synthetically with known causal structures, allowing deterministic scoring without sacrificing complexity. External domain experts vetted 82 of the 129 problems for realism and solvability. Human experts estimated each problem would take 20 to 40 hours to complete — thousands of dollars in labor — while AI inference on the same task costs only a few dollars.
Model results
GPT-5.6 Sol, OpenAI’s most capable current model, achieved a 28.7% pass rate at its highest reasoning setting and 31.5% in Pro mode. That represents a sharp rise from the original GeneBench, where GPT-5 scored below 5%. OpenAI acknowledged that even its best model remains far short of expert-level reliability on research-grade tasks, but said partial automation of lengthy analytical workflows could still deliver meaningful value.
To support independent evaluation, OpenAI is releasing ten representative problems on Hugging Face and providing a 50-question subset to Artificial Analysis for third-party benchmarking.