Back to News
Market Impact: 0.6

When billion-dollar AIs break down over puzzles a child can do, it's time to rethink the hype | Gary Marcus

AAPL
Artificial IntelligenceTechnology & InnovationProduct LaunchesCompany Fundamentals
When billion-dollar AIs break down over puzzles a child can do, it's time to rethink the hype | Gary Marcus

An Apple research paper challenges the notion that large language models (LLMs) possess reliable reasoning abilities, demonstrating that models like ChatGPT and Claude struggle with complex tasks such as the Tower of Hanoi puzzle, despite being designed for reasoning. The paper reveals that LLMs excel at pattern recognition but falter when encountering novelty beyond their training data, suggesting that scaling these models may not solve their fundamental limitations. This finding implies that businesses cannot reliably deploy LLMs for complex problem-solving and that society should remain cautious about fully trusting generative AI outputs without human oversight.

Analysis

A research paper from Apple Inc. (AAPL) significantly challenges the perceived reasoning capabilities of leading large language models (LLMs) such as ChatGPT, Claude, and Deepseek, revealing their proficiency in pattern recognition but critical failures when faced with novel tasks requiring complex reasoning beyond their training data. The paper demonstrates empirically, using examples like the Tower of Hanoi puzzle where models struggled with seven-disc scenarios (achieving less than 80% accuracy) and largely failed with eight discs, that simply scaling these models does not resolve these fundamental limitations. This finding, which carries a "strongly negative" general sentiment (-0.8) and a market impact score of 0.6, suggests that current LLMs are not a direct path to artificial general intelligence (AGI) and cannot be reliably deployed for complex, autonomous problem-solving in business contexts. While the research is critical of the broader LLM field, the sentiment specifically for Apple (AAPL) is neutral (0.5), potentially indicating that the market views Apple's grounded assessment or its underlying R&D capabilities positively. The paper underscores that LLMs, while useful for tasks like coding and brainstorming with human oversight, are no substitute for well-specified conventional algorithms for complex, reliable computation and reasoning.