Back to News
Market Impact: 0.65

‘A complete accuracy collapse’: Apple throws cold water on the potential of AI reasoning – and it's a huge blow for the likes of OpenAI, Google, and Anthropic

AAPLGOOGLGOOGDBXZMMSFTCSCOSAS
Artificial IntelligenceTechnology & InnovationCompany Fundamentals
‘A complete accuracy collapse’: Apple throws cold water on the potential of AI reasoning – and it's a huge blow for the likes of OpenAI, Google, and Anthropic

Apple's research indicates that AI reasoning models from companies like OpenAI, Google, and Anthropic face significant limitations in solving complex logic problems, experiencing a complete accuracy collapse beyond a certain level of difficulty. The study challenges the prevailing narrative that these models are highly effective for tasks requiring human-like reasoning, suggesting current benchmarks are flawed and susceptible to data contamination. This casts doubt on the generalizability and real-world applicability of large reasoning models (LRMs) and may lead customers to question the capabilities of AI 'thinking' models.

Analysis

Apple's recent research paper, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," presents a significant challenge to the prevailing optimism surrounding the capabilities of advanced AI reasoning models. The study reveals that leading Large Reasoning Models (LRMs) from prominent developers, including OpenAI’s o1 and o3, DeepSeek R1, Anthropic’s Claude 3.7 Sonnet, and Google’s latest Gemini, experience a "complete accuracy collapse" when confronted with logic puzzles exceeding a certain complexity threshold. This occurs despite the models efficiently handling low-complexity tasks. Apple's researchers contend that current LRM benchmarks, often centered on coding and mathematical problems, are flawed and susceptible to data contamination, failing to provide adequate control over variables. To counter this, Apple devised new puzzles emphasizing logical reasoning, which demonstrated that beyond a specific complexity, model accuracy plummeted to zero, even when the solution algorithm was provided in the prompt, suggesting an inherent limitation. Furthermore, models exhibited 'overthinking' on simpler problems, wasting computational resources after finding a solution. These findings cast serious doubt on the generalizability of LRMs and the claims by developers that these models can 'think' or 'reason' like humans on complex tasks, representing a substantial setback for the narrative that these AIs are ready for sophisticated enterprise applications. The research suggests a potential fundamental flaw in current LRM algorithms, causing them to lose focus or cease effective computation when challenges intensify, which could compel developers to re-evaluate their training methodologies and architectures, and may lead customers to question the true capabilities of these 'thinking' AI systems.