Back to News
Market Impact: 0.35

AI Is Getting Better at Science. OpenAI Is Testing How Far It Can Go

GOOGLGOOG
Artificial IntelligenceTechnology & InnovationHealthcare & Biotech

OpenAI published FrontierScience, a benchmark of text-only physics, chemistry and biology problems in two tiers (Olympiad: 100 questions; Research: 60 questions) that shows rapid recent gains in LLM reasoning—GPT-5.2 scored 77.1% on the Olympiad tier and 25.3% on Research—but also highlights major evaluation limits (small sample sizes, no human baseline, inability to test experiments or image/video analysis) and high annotation costs that rely on expert-data firms valued at over $10bn. The piece situates these results amid concrete narrow AI wins (eg, DeepMind’s AlphaFold with ~200 million predicted protein structures and AI projects on plasma control and weather forecasting) and mixed practitioner reactions—real productivity gains in coding and math counterbalanced by frequent hallucinations and a glut of low-quality AI-assisted submissions to journals. The implication for investors is a credible pathway toward LLMs as research accelerants that could multiply scientific productivity, but with significant reliability, benchmarking and deployment risks that leave the timing and breadth of commercial impact uncertain.

Analysis

OpenAI published FrontierScience, a text-only benchmark with two tiers—Olympiad (100 questions) and Research (60 questions)—to measure scientific reasoning; GPT-5.2 scored 77.1% on Olympiad and 25.3% on Research, with rapid improvements reported over the past year but only negligible Research gains versus GPT-5. The paper and commentary stress severe evaluation limits: small sample sizes, no human-baseline comparison, inability to test experimental, image or video competencies, and the high cost of sourcing domain experts for question creation and grading. Expert-annotation firms such as Mercor and Surge AI (both cited as valued over $10 billion) are integral to benchmark construction, highlighting an ecosystem of third-party vendors that capture value even as core model capabilities are still being validated. Real-world evidence is mixed: narrow AI tools like DeepMind’s AlphaFold (≈200 million protein structures) and documented productivity gains in coding and math coexist with systemic risks from hallucinations and a flood of low-quality AI-assisted research submissions, leaving the timing and breadth of commercial research-assistant adoption uncertain and market impact modest in the near term.

AllMind AI Terminal

AI-powered research, real-time alerts, and portfolio analytics for institutional investors.

Request a Demo

Market Sentiment

Overall Sentiment

mixed

Sentiment Score

0.05

Ticker Sentiment

GOOG0.40
GOOGL0.40

Key Decisions for Investors

  • Consider selective, staged exposure to large AI platform leaders such as GOOGL/GOOG that own proven narrow scientific assets and infrastructure rather than betting on broad, unproven research-assistant claims
  • Avoid committing large capital to pure-play startups that market LLMs as end-to-end scientific discovery tools until Research-tier performance, multimodal benchmarks, and independent human baselines materially improve
  • Monitor leading indicators—Research-tier score progression toward parity, expansion to multimodal evaluation, independent human baselines, and credible commercial case studies—and use those data points to scale exposure
  • Watch the expert-annotation and data-infrastructure vendors as potential beneficiaries of benchmarking demand, and manage position sizing to hedge against near-term reliability, reproducibility and regulatory risks