To Think or Not to Think? Apple’s Take on AI Reasoning — and Where It Fails

A closer look at “The Illusion of Thinking” — and what it really reveals about the limits of reasoning in large language models

Jun 24, 2025

Abstract

Apple published a study claiming that modern “reasoning models” fail catastrophically once task complexity increases. A counter-paper accuses them of flawed evaluation and misleading conclusions. Meanwhile, two simple logic problems — one about siblings, one about apples — reveal that the real problem isn’t about how much AI thinks. It’s about knowing when to stop.

1. Apple’s Paper: The Illusion of Thinking

Apple’s team introduces a structured evaluation of AI reasoning by constructing four synthetic tasks:

The Four Task Environments:

Tower of Hanoi
A classic recursive puzzle. Move stacked disks from one rod to another under two constraints: only one disk at a time, and no larger disk may sit on a smaller one.
→ Problem depth increases exponentially with the number of disks (N). The optimal move count is 2N−12^N - 12N−1.
River Crossing
Several agents (e.g. wolf, goat, cabbage) must cross a river in a limited-capacity boat, under safety constraints (e.g. goat can’t be left alone with the cabbage).
→ Though the number of steps grows slowly, the state space is highly irregular and semantically complex — one wrong step leads to invalid or unsolvable states.
Blocks World
Classic planning setup where blocks must be rearranged into a target configuration using a robot arm, subject to stacking/movement rules.
→ Moderate planning depth, high combinatorial complexity.
Checker Jumping
A grid-based logical game where checkers must jump over each other to reach a target configuration.
→ Sparse solutions, requires pathfinding under nontrivial constraints.

What Apple Wanted to Know

Their central question: Do reasoning-enhanced LLMs (those that generate explicit multi-step CoT outputs) actually reason better — or just sound smarter?
To test this, they compare:

Standard LLMs (outputting final answers directly)
Chain-of-Thought models (LRMs) (forced to explain intermediate steps)

Crucially, they normalize token budgets — a key move. CoT responses are longer by design. So to keep comparisons fair, they fix the inference compute rather than just asking “Who got it right?”

What They Found — and Why It Matters

1. Low Complexity Tasks (e.g. Hanoi with 3 disks)

Standard LLMs outperform LRMs. Why?

The tasks are trivial enough that CoT is overkill.
CoT introduces unnecessary intermediate steps that dilute accuracy.
In some cases, correct early steps are overwritten by later, erroneous overthinking.

→ Overthinking hurts more than it helps.

2. Medium Complexity (e.g. Hanoi with 5–7 disks, nontrivial River Crossing)

Here, CoT starts to shine:

LRMs can plan incrementally and correct earlier mistakes.
The longer token horizon is useful to hold structured thought in place.
In many cases, reasoning steps directly mirror the optimal algorithmic plan.

→ Structured thought is a net gain.

3. High Complexity (e.g. Hanoi ≥9, River Crossing with >5 agents)

All models fail. But not gracefully:

Accuracy collapses to near zero, even with correct prompts.
Models start producing fewer reasoning tokens — not due to token limits, but seemingly due to internal entropy: they “give up”.
Adding explicit pseudocode or algorithm hints does not help. The failure is not knowledge-based — it’s procedural.

→ The illusion: more thinking ≠ better performance. There is a limit beyond which the models lose grip — not on facts, but on strategy.

Why This Matters

Apple's contribution isn’t just another benchmark. Their setup reveals how reasoning behavior changes across problem depth. It also undermines the popular narrative that Chain-of-Thought inherently improves performance. It does — but only in the narrow band between triviality and chaos.

The key failure mode is this:

LLMs are good at thinking within structure — but bad at deciding how to structure their thinking.

That’s the deeper “illusion” Apple reveals.

2. The Counterpoint: The Illusion of the Illusion

A response paper challenges the validity of Apple’s conclusions. Their critique:

Token budget violations: Some Apple tasks (e.g. Hanoi with N>8) exceed output token limits. The models didn’t fail at reasoning — they ran out of room.
Bad grading: Models that say “pattern continues, stopping early” get marked wrong, even if the logic was correct.
Unsatisfiable tasks: Some River Crossing scenarios Apple used are mathematically unsolvable, yet scored negatively.

The counter-researchers also show: if you let a model write a small recursive program instead of outputting every step, it solves Hanoi with 15 disks just fine.

3. Two Real-World Examples That Reveal More Than Either Paper

A. The “Sally” problem

Sally has three brothers. Each brother has two sisters. How many sisters does Sally have?

Naive LLMs say: 2.
CoT-enabled models say: 1, which is correct — Sally is one of the sisters. There's only one other.

➡ Here, structured reasoning clears up a reference ambiguity. CoT works exactly as intended. This directly contradicts Apple’s blanket claim that CoT hurts performance in low-complexity settings.

B. The apple ranking trap

Alice has twice as many apples as Bob. Charlie has half as many as Alice. Dana has as many as Bob and Charlie combined. Who has the second most apples?

Math-wise:

Bob = B
Alice = 2B
Charlie = B
Dana = 2B
=> Alice and Dana tie for first. Bob and Charlie tie for second.

What does the model do? It calculates correctly — then spirals. It can’t decide whether a tie is allowed. It starts questioning the framing. It re-runs its logic. 200 tokens later, it’s still unsure what the question “really wants.”

➡ This is not a failure of logic. It’s a failure of semantic confidence.

4. What Actually Fails? Not Reasoning — But Reasoning Control

From both papers and both examples, a deeper pattern emerges:

Reasoning fails not only with depth (e.g. Hanoi), but also with semantic instability (e.g. apple rankings).
Chain-of-Thought is neither good nor bad. It helps when disambiguation is needed. It backfires when meta-uncertainty takes over.
The true failure mode is not in the model’s ability to compute — but in its inability to stop. It doubts. It loops. It overcorrects.

Neither paper truly isolates this failure type. Apple focuses on plan-length and task depth. The counter-paper focuses on token constraints and fairness. Both miss the elephant in the room: semantic control.

5. The Truth

Today's large language models don’t “think” in any meaningful human sense. But even within their current abilities, their real weakness isn’t a lack of reasoning — it’s a lack of judgment about when the reasoning has gone off track.

Apple showed symptoms. The rebuttal exposed setup flaws. But real-world tasks — like the ones above — show that even with perfect inputs, models often either think too little, or think themselves into a hole.

Until models can reliably ask themselves: “Am I still answering the question?” — we're going to keep seeing breakdowns.

To think or not to think?
That’s no longer the question.
The real challenge is knowing when to stop.

Prompt Injection

Discussion about this post