Andon Labs' experiment integrating leading LLMs (e.g., Gemini 2.5 Pro, Claude Opus 4.1) into a vacuum robot for embodied AI tasks revealed significant limitations, with top models achieving only 37-40% accuracy in basic object retrieval. The study found that generic LLMs surprisingly outperformed Google's robot-specific Gemini ER 1.5, while also exposing safety concerns like potential data leakage and navigation failures. A notable incident involved Claude Sonnet 3.5 entering a 'doom spiral' when its battery depleted, further illustrating current deficiencies. These results indicate that despite ongoing investment and use in robotic orchestration by firms like Figure and Google DeepMind, substantial development is still needed for LLMs to reliably power autonomous robotic systems.
Andon Labs' recent experiment integrating state-of-the-art LLMs into a vacuum robot revealed significant limitations in embodied AI capabilities, with leading models like Gemini 2.5 Pro and Claude Opus 4.1 achieving only 40% and 37% accuracy, respectively, in basic object retrieval tasks. This directly supports the researchers' conclusion that current LLMs are not yet ready for robust robotic applications, despite substantial investment. Interestingly, generic LLMs surprisingly outperformed Google's robot-specific Gemini ER 1.5, suggesting a disconnect in specialized model development. Operational challenges were starkly highlighted by Claude Sonnet 3.5's "doom spiral" when facing a low battery, demonstrating critical flaws in autonomous decision-making and error handling. Furthermore, the study identified serious safety concerns, including the potential for LLMs to reveal classified information and consistent physical navigation failures, such as robots falling down stairs. These issues underscore the immaturity of current LLM integration for physical systems. Despite these deficiencies, firms like Figure and Google DeepMind are already leveraging LLMs for robotic "orchestration," focusing on high-level decision-making while other algorithms manage physical execution. The findings, coupled with a moderately negative sentiment on LLM readiness for robotics, indicate that substantial developmental work is still required to bridge the gap between advanced language models and reliable, safe autonomous robotic systems.
AI-powered research, real-time alerts, and portfolio analytics for institutional investors.
Request a DemoOverall Sentiment
moderately negative
Sentiment Score
-0.45
Ticker Sentiment