AI Is Getting Better at Science. OpenAI Is Testing How Far It Can Go

OpenAI published FrontierScience, a benchmark of text-only physics, chemistry and biology problems in two tiers (Olympiad: 100 questions; Research: 60 questions) that shows rapid recent gains in LLM reasoning—GPT-5.2 scored 77.1% on the Olympiad tier and 25.3% on Research—but also highlights major evaluation limits (small sample sizes, no human baseline, inability to test experiments or image/video analysis) and high annotation costs that rely on expert-data firms valued at over $10bn. The piece situates these results amid concrete narrow AI wins (eg, DeepMind’s AlphaFold with ~200 million predicted protein structures and AI projects on plasma control and weather forecasting) and mixed practitioner reactions—real productivity gains in coding and math counterbalanced by frequent hallucinations and a glut of low-quality AI-assisted submissions to journals. The implication for investors is a credible pathway toward LLMs as research accelerants that could multiply scientific productivity, but with significant reliability, benchmarking and deployment risks that leave the timing and breadth of commercial impact uncertain.

Analysis

OpenAI published FrontierScience, a text-only benchmark with two tiers—Olympiad (100 questions) and Research (60 questions)—to measure scientific reasoning; GPT-5.2 scored 77.1% on Olympiad and 25.3% on Research, with rapid improvements reported over the past year but only negligible Research gains versus GPT-5. The paper and commentary stress severe evaluation limits: small sample sizes, no human-baseline comparison, inability to test experimental, image or video competencies, and the high cost of sourcing domain experts for question creation and grading. Expert-annotation firms such as Mercor and Surge AI (both cited as valued over $10 billion) are integral to benchmark construction, highlighting an ecosystem of third-party vendors that capture value even as core model capabilities are still being validated. Real-world evidence is mixed: narrow AI tools like DeepMind’s AlphaFold (≈200 million protein structures) and documented productivity gains in coding and math coexist with systemic risks from hallucinations and a flood of low-quality AI-assisted research submissions, leaving the timing and breadth of commercial research-assistant adoption uncertain and market impact modest in the near term.

AllMind

AllMind

AI Is Getting Better at Science. OpenAI Is Testing How Far It Can Go

Analysis

AllMind AI Terminal

Market Sentiment

Key Decisions for Investors