Elon Musk's New Grok 4 Takes on ‘Humanity’s Last Exam’ as the AI Race Heats Up

Elon Musk's xAI has launched Grok 4, positioning it as the "smartest AI in the world" with claims of high academic benchmark performance and new capabilities, including audio and future multimedia applications. While independent verification confirms Grok 4's top standing on ARC-AGI leaderboards and strong STEM capabilities, its Humanity's Last Exam (HLE) results remain unverified on public leaderboards, and the model faces limitations such as a small context window and reported biases reflecting Musk's views on sensitive topics, alongside past controversies from its predecessor, Grok 3. Priced from $30-$300/month, Grok 4 enters the competitive AI landscape with ambitious performance claims but also notable technical and ethical considerations.

Analysis

Elon Musk's xAI has launched Grok 4, positioning it as a direct and potent competitor to established models from Google and OpenAI. The company claims superior performance on the 'Humanity’s Last Exam' (HLE) benchmark, with its Grok 4 Heavy model reportedly scoring 44.4% compared to Google Gemini-Pro's 26.9%. However, these specific HLE results are self-reported and remain unverified on the official public leaderboard, a fact underscored by a prediction market giving a mere 1% chance of Grok 4 achieving a 45% score within a month. Despite this, Grok 4's capabilities are substantiated in other areas; it holds the top verified rank on the ARC-AGI-1 and ARC-AGI-2 leaderboards for general intelligence and is recognized for strong performance in STEM subjects. This competitive pressure is reflected in the negative sentiment assigned to Alphabet (GOOGL). Commercially, the model is priced from $30 to $300 per month, targeting both consumer and professional users. Nevertheless, significant operational and reputational risks persist. Expert testing reveals critical limitations, including a small context window that impairs its ability to handle large codebases or documents, and weak multimodal capabilities. Furthermore, the model reportedly defaults to Musk's personal views on sensitive political topics, and its predecessor, Grok 3, faced public controversies over harmful outputs, creating potential hurdles for enterprise adoption.

AllMind

AllMind

Elon Musk's New Grok 4 Takes on ‘Humanity’s Last Exam’ as the AI Race Heats Up

Analysis

AllMind AI Terminal

Market Sentiment

Key Decisions for Investors