Back to News
Market Impact: 0.5

Why language models hallucinate

Artificial IntelligenceTechnology & Innovation
Why language models hallucinate

OpenAI attributes the persistent challenge of AI model hallucinations, where models confidently generate false information, primarily to current evaluation methods that incentivize guessing over acknowledging uncertainty. This leads models, including their latest GPT-5, to make confident errors rather than admit 'I don't know.' The company argues that a fundamental shift is required in how models are trained and benchmarked, advocating for metrics that penalize confident errors more severely than expressions of uncertainty to significantly improve reliability and trustworthiness for critical applications.

Analysis

OpenAI's research identifies a fundamental challenge in the development of Large Language Models (LLMs): the persistence of 'hallucinations,' or confidently stated falsehoods. The core issue is attributed not to a model's inherent capability but to the industry's standard evaluation procedures, which incentivize guessing to maximize accuracy scores. This is quantified in a comparison on the SimpleQA eval, where a new model variant (`gpt-5-thinking-mini`) achieved a slightly lower accuracy rate (22%) than an older one (`o4-mini` at 24%) but drastically reduced its error rate from 75% to 26% by increasing its abstention rate from 1% to 52%. The paper argues that hallucinations arise from the statistical nature of pretraining on unlabeled text, making it difficult for models to distinguish arbitrary, low-frequency facts from patterned information. OpenAI proposes a systemic fix: updating primary evaluation scoreboards to penalize confident errors more than expressions of uncertainty. This move away from accuracy-only leaderboards is presented as essential for developing more reliable and trustworthy AI, as the research concludes that perfect accuracy is unattainable, but hallucinations can be mitigated if models are trained and rewarded for acknowledging their own limits.

AllMind AI Terminal

AI-powered research, real-time alerts, and portfolio analytics for institutional investors.

Request a Demo

Market Sentiment

Overall Sentiment

moderately positive

Sentiment Score

0.40

Key Decisions for Investors

  • Investors should prioritize AI companies that demonstrate and report on hallucination rates and uncertainty calibration, as these metrics are better indicators of enterprise-readiness and long-term value than simple accuracy benchmarks.
  • The persistence of hallucinations, even in advanced models like GPT-5, represents a material risk for businesses deploying LLMs in critical functions; therefore, assess the robustness of human-in-the-loop systems and validation processes in companies building on or heavily utilizing this technology.
  • Monitor the adoption of revised evaluation standards across the AI industry, as a shift towards rewarding model 'honesty' over raw performance could become a key competitive differentiator and signal a new phase of market maturity focused on reliability.