Apple’s latest research has delivered a stark challenge to the AI industry: 65% of leading AI models fail basic reasoning tests when presented with irrelevant information. This finding directly contradicts OpenAI’s claims that their latest systems demonstrate genuine logical thinking. OpenAI’s o1 model achieves 83% accuracy on advanced mathematics competitions compared to just 13% for previous versions, yet Apple researchers argue this represents “sophisticated pattern matching” rather than true reasoning capabilities. This disagreement between two leading AI research organisations has triggered the most significant debate in artificial intelligence history, with implications extending far beyond academic circles into healthcare, education, and automated decision-making systems that millions rely on daily.
Apple’s challenge: exposing the fragility of machine reasoning
Apple’s research team developed GSM-Symbolic, a new benchmark that modifies the widely-used GSM8K mathematical reasoning dataset by changing names, numbers, and adding irrelevant information to test whether models truly understand problems or simply pattern match. The results were consistent and concerning across multiple leading models.
The most revealing example involves a simple maths problem about collecting fruit. When researchers added the irrelevant detail that “five of the kiwis were smaller than average,” multiple AI models, including OpenAI’s advanced systems, incorrectly subtracted these kiwis from the final count despite size having no bearing on the total number. This fundamental error highlights what Apple researchers describe as critical limitations in how these systems process information.
“We found no evidence of formal reasoning in language models,” concluded the Apple research team led by Mehrdad Farajtabar. “Their behaviour is better explained by sophisticated pattern matching – so fragile, in fact, that changing names can alter results by approximately 10%”.
- Performance drops ranged from 17.5% for OpenAI’s o1 Preview to 65% for Microsoft’s Phi 3 model when irrelevant clauses were added
- Models showed significant variance when only numerical values were changed, suggesting memorisation rather than understanding
- The study tested over 20 models including OpenAI’s o1 and GPT-4o, Google’s Gemma 2, and Meta’s Llama 3, all demonstrating similar fragilities

OpenAI’s response: defending the breakthrough
OpenAI maintains that their o1 model represents a genuine breakthrough in AI reasoning through reinforcement learning techniques that teach models to “think before they answer” using extended chain-of-thought processing. Their evidence centres on benchmark performance that rivals human experts in specialised domains.
On the American Invitational Mathematics Examination (AIME), designed to challenge the brightest high school mathematics students, o1 achieved 74% accuracy with single attempts and 93% when given multiple tries, placing it amongst the top 500 students nationally. The model also demonstrated expertise-level performance on GPQA diamond, a benchmark requiring PhD-level knowledge in chemistry, physics, and biology.
OpenAI researchers argue that the fragility identified by Apple can be addressed through improved prompting techniques, though they acknowledge this may require exponentially more contextual data for complex problems. This response highlights a key philosophical divide: whether true reasoning should be robust to distracting information without extensive guidance.
The company emphasises that o1’s performance improves with both training time and inference time, suggesting scalable pathways towards more sophisticated reasoning capabilities. They point to examples where the model demonstrates multi-step problem-solving, error correction, and strategy refinement — behaviours that align with human-like reasoning processes.
The emergence debate: breakthrough or measurement artifact?
The reasoning controversy intersects with a broader debate about “emergent abilities” in large language models. Research from Google and others has documented how capabilities like few-shot learning and complex reasoning appear to emerge suddenly as models reach certain scales, but this phenomenon has faced significant scrutiny.
Stanford researchers published influential work titled “Are Emergent Abilities of Large Language Models a Mirage?” arguing that apparent emergence results from poorly chosen evaluation metrics rather than genuine capability leaps. They found that 92% of supposedly emergent behaviours occurred in tasks evaluated via BIG-Bench, and that changing metrics eliminated the appearance of emergence.
AI researcher Melanie Mitchell, analysing recent studies, notes that “the distinction between artificial narrow intelligence and artificial general intelligence remains significant, with current LLMs landing firmly in the former category despite impressive performance improvements”.
- Critics argue that logarithmic scaling charts mask gradual improvements, making steady progress appear as sudden breakthroughs
- Proponents maintain that capabilities like zero-shot learning and creative problem-solving cannot be easily explained by simple pattern matching
- Some researchers propose that emergence may be real but continuous rather than discontinuous, with evaluation methods creating artificial thresholds
Expert perspectives: a divided scientific community
IBM researchers have provided detailed analysis supporting Apple’s position. “This paper has fundamentally proven that LLMs can’t reason,” states Ash Minhas, IBM Technical Content Manager, emphasising that current benchmark problems are flawed because models can solve them through pattern matching rather than actual reasoning.
Gary Marcus, a prominent AI researcher, views Apple’s findings as validation of long-standing concerns: “The inability of standard neural network architectures to reliably extrapolate — and reason formally — has been the central theme of my own work back to 1998 and 2001”.
However, the scientific community remains divided. Some researchers argue that whilst current LLMs may not demonstrate formal logical reasoning, they exhibit “probabilistic, memorisation-influenced noisy reasoning” that still provides substantial practical value.
Independent research has demonstrated that LLMs struggle to defend their initial reasoning when challenged by invalid arguments, suggesting they may not grasp the fundamental principles underlying their responses. Studies found that “despite impressive performance at generating correct step-by-step solutions initially, LLMs like ChatGPT cannot maintain their beliefs in truth for a significant portion of examples when challenged by oftentimes absurdly invalid arguments”.
How leading AI researchers and institutions align on the spectrum of LLM reasoning capabilities
Real-world implications: beyond academic debate
The reasoning debate carries significant implications for how AI systems are deployed in critical applications. Research on human-AI collaboration emphasises that “variability, prompt brittleness, and inconsistencies in LLM outputs across different conditions pose significant challenges for ensuring effective interaction with humans”.
Healthcare applications present particularly high stakes. Studies reveal that “users increase their trust in LLM responses when these are accompanied by explanations, even if the responses are deceptive,” highlighting the danger of overreliance on AI reasoning in medical diagnosis and treatment decisions.
Many researchers advocate for hybrid approaches that combine LLM capabilities with traditional rule-based systems. As one researcher noted: “Maybe the future is a combination of different compute types, some intuitive pattern recognition from LLMs, coupled with hard logical rules from traditional programming”.
- Financial applications require careful consideration of algorithmic reliability and regulatory compliance
- Educational systems must balance AI assistance with maintaining students’ critical thinking skills
- Legal and safety-critical applications may need external verification systems regardless of claimed reasoning capabilities
The evaluation challenge: measuring machine intelligence
At the heart of the reasoning debate lies a fundamental challenge: how to properly evaluate AI capabilities. Apple researchers point to widespread data contamination in existing benchmarks, noting that “the GSM-8K dataset is such an industry benchmark that there are bits and pieces of it all over the training data that all models know about”.
Companies like Gretel AI have responded by developing synthetic datasets that “surpass the quality of both the OpenAI GSM8K and Apple GSM-Symbolic datasets” through advanced generation techniques. These efforts aim to create evaluation frameworks that can differentiate between memorisation and genuine understanding.
Critics of Apple’s methodology argue that their studies lack human control groups, noting that “the sort of changes that they implement, changing word meanings, inserting distraction statements and also increase the rate of humans making errors”. This raises important questions about whether the fragility Apple identifies is unique to artificial systems.
Melanie Mitchell highlights the definitional challenge: “Reasoning is one of those overburdened terms that can mean quite different things. The word ‘reasoning’ is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems”.
Looking ahead: the future of AI reasoning research
The field is rapidly evolving with new approaches to inference-time scaling, reinforcement learning, and hybrid architectures that combine multiple reasoning strategies. Large Reasoning Models (LRMs) like OpenAI’s upcoming o3 and DeepSeek-R1 represent the latest attempts to address limitations identified in traditional language models.
Research directions include developing better evaluation frameworks, exploring neurosymbolic approaches that combine neural networks with formal logic systems, and investigating how to make reasoning more robust and interpretable.
As Mitchell observes: “If robust general-purpose reasoning abilities have emerged in LLMs, this bolsters the claim that such systems are an important step on the way to trustworthy general intelligence. On the other hand, if LLMs rely primarily on memorisation and pattern-matching rather than true reasoning, then they will not be generalisable”.
Apple researchers emphasise the broader stakes: “Understanding the true reasoning capabilities of LLMs is crucial for their use in real-world scenarios where accuracy and consistency are essential, specifically in AI safety, alignment, education, healthcare, and decision-making systems”.
Key takeaways: navigating the reasoning controversy
The Apple-OpenAI reasoning debate reflects deeper questions about the nature of intelligence itself. Whilst both sides present compelling evidence, the practical implications suggest a nuanced approach to AI deployment that acknowledges both capabilities and limitations.
Current evidence suggests that large language models excel at sophisticated pattern matching and can solve complex problems through learned associations, but struggle with genuine logical reasoning when faced with novel variations or distracting information. This doesn’t diminish their practical value but does inform how they should be integrated into critical systems.
As Gary Marcus concludes: “Nothing that I have read, verified, or done gives me any compelling reason to believe that LLMs do reasoning/planning, as normally understood. What they do instead, armed with web-scale training, is a form of universal approximate retrieval”. Yet this “approximate retrieval” has proven remarkably powerful for many applications, even if it falls short of human-like reasoning.
The debate continues to evolve as new models, evaluation methods, and research findings emerge. For practitioners and policymakers, the key is maintaining healthy scepticism whilst recognising the genuine capabilities these systems demonstrate that preparing for a future where AI reasoning may be powerful but different from human cognition in fundamental ways.
Sources and references
Primary research papers
Apple GSM-Symbolic Research:
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” arXiv preprint arXiv:2410.05229. Available at: https://arxiv.org/abs/2410.05229
OpenAI GSM8K Dataset:
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). “Training Verifiers to Solve Math Word Problems.” arXiv preprint arXiv:2110.14168. Available at: https://github.com/openai/grade-school-math
Emergent Abilities Research:
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). “Are Emergent Abilities of Large Language Models a Mirage?” Advances in Neural Information Processing Systems. Available at: https://arxiv.org/abs/2304.15004
LLM Reasoning Defense Research:
Findings from research on human-AI collaboration and LLM explanation reliability. Available at: https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2024.1464690/full
Official sources and documentation
Apple Machine Learning Research:
Official research page for GSM-Symbolic study. Available at: https://machinelearning.apple.com/research/gsm-symbolic
OpenAI o1 Model:
“Learning to reason with LLMs.” OpenAI Official Documentation. Available at: https://openai.com/index/learning-to-reason-with-llms/
AIME Competition:
Mathematical Association of America. “American Invitational Mathematics Examination.” Available at: https://www.maa.org/math-competitions/aime
GPQA Diamond Benchmark:
Graduate-level Google-Proof Q&A benchmark for testing expertise-level knowledge. Available at: https://github.com/idavidrein/gpqa
Expert analysis and commentary
Gary Marcus Analysis:
Marcus, G. (2024). “LLMs don’t do formal reasoning – and that is a HUGE problem.” Marcus on AI. Available at: https://garymarcus.substack.com/p/llms-dont-do-formal-reasoning-and
Melanie Mitchell Analysis:
Mitchell, M. (2024). “The LLM Reasoning Debate Heats Up.” AI Guide. Available at: https://aiguide.substack.com/p/the-llm-reasoning-debate-heats-up
IBM Research Analysis:
Minhas, A. (2024). “AI’s mathematical mirage: Apple study challenges notion of AI reasoning.” IBM Think. Available at: https://www.ibm.com/think/news/apple-llm-reasoning
Gretel AI Analysis:
Watson, A., Meyer, Y., Corneil, D., & Van Segbroeck, M. (2024). “GSM-Symbolic: Analyzing LLM Limitations in Mathematical Reasoning and Potential Solutions.” Gretel AI Blog. Available at: https://gretel.ai/blog/gsm-symbolic-analyzing-llm-limitations-in-mathematical-reasoning
Additional supporting research
BIG-Bench Evaluation:
Large-scale benchmark for testing language model capabilities. Available at: https://github.com/google/BIG-bench
LLM Reasoning Defense Study:
Research on LLM response defense capabilities and reasoning maintenance. Available at: https://aclanthology.org/2023.findings-emnlp.795/
Apple Future Research Direction:
Large Reasoning Models (LRMs) and future AI reasoning approaches. Available at: https://machinelearning.apple.com/research/illusion-of-thinking
Key quotes and attributions
Mehrdad Farajtabar (Apple Research Team Lead): “We found no evidence of formal reasoning in language models. Their behaviour is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by approximately 10%”
Ash Minhas (IBM Technical Content Manager): “This paper has fundamentally proven that LLMs can’t reason. They’re just pattern matching.”
Gary Marcus (AI Researcher): “Nothing that I have read, verified, or done gives me any compelling reason to believe that LLMs do reasoning/planning, as normally understood. What they do instead, armed with web-scale training, is a form of universal approximate retrieval.”
Melanie Mitchell (AI Researcher): “If robust general-purpose reasoning abilities have emerged in LLMs, this bolsters the claim that such systems are an important step on the way to trustworthy general intelligence. On the other hand, if LLMs rely primarily on memorisation and pattern-matching rather than true reasoning, then they will not be generalisable.”
Note: All links verified as of June 2025. This research represents the current state of the LLM reasoning debate and will continue to evolve as new studies and models are released.