Acing this new AI exam — which its creators say is the toughest in the world — might point to the first signs of AGI | AllMind AI News

Researchers from the Center for AI Safety and Scale AI published Humanity’s Last Exam, a PhD-level benchmark of 2,500 non-searchable questions across 100+ subjects intended to measure expert-level AI reasoning. As of Feb. 12, 2026 Google’s Gemini 3 Deep Think leads with a 48.4% score (humans score ~90%), while earlier top performers included OpenAI’s o1 at 8.3%; the study stresses that high HLE accuracy would indicate expert-level closed-ended performance but not autonomous research capabilities or AGI. The test’s strict vetting and non-memorization design make it a new standard for assessing large models’ progress, but authors warn results aren’t definitive evidence of general intelligence.

Analysis

Market structure: The HLE score jump (Gemini 48.4% vs prior single-digit results) reinforces scale advantages for hyperscalers and vertically integrated chip/software vendors (GOOGL, NVDA, AMZN, MSFT). Expect 6–18 month higher infrastructure spend: leading model training demand could lift GPU/cloud billings by ~20–50% YoY for early adopters, compressing margins for smaller AI service providers and legacy software vendors.

Risk assessment: Key tail risks are regulatory clampdowns or liability events that could remove 5–20% of addressable monetization near-term, and a sudden capex surge that inflates costs (GPU spot shortages). Immediate (days) market moves will be muted; short-term (weeks–months) will see M&A/speculation; long-term (1–3 years) winners with model+stack control can compound revenue +10–30% CAGR but require sustained talent and dataset access.

Trade implications: Favor equities tied to compute + software stack; expect NVDA and GOOGL to exhibit asymmetric upside but also event-driven volatility. Use relative-value and option structures to capture fast model-improvement headlines (buy LEAPs/call spreads; hedge with index tail puts) and rotate out of small-cap AI consultancies that lack IP or proprietary models.

Contrarian angles: Consensus equates HLE progress with near-term monetization — that’s likely overstated: monetization lag of 6–18 months and rising price competition can compress margins. The market may underprice regulatory/legal friction and overprice immediate cash flows from improved benchmarks; historical parallel: deep-learning breakthroughs (2012) created a multi-year maturation, not instant profit realization.

AllMind

AllMind

Acing this new AI exam — which its creators say is the toughest in the world — might point to the first signs of AGI

Analysis

AllMind AI Terminal

Market Sentiment

Key Decisions for Investors