AI Is Not AGI: 100% of Today’s Models Fail Real Algorithmic Coding Tests

Behind the code lies a void: Why today’s AIs can’t navigate the logic they imitate.

Jun 19, 2025

In recent months, a familiar story has echoed across tech media: GPT-4o, Claude, Gemini — they’re not just answering trivia anymore, they’re beating top humans at coding. Headlines claim that these models now rival, or even surpass, elite competitive programmers. But a new benchmark tells a very different story.

LiveCodeBench Pro, released last week, is a rigorously designed evaluation that pits today’s best language models against real-world algorithmic programming challenges — the kind used in Codeforces, ICPC, and IOI contests. Not toy examples, not textbook exercises, and certainly not the curated prompts in HumanEval or LeetCode.

The result? Every single model failed the hardest problems. Even the best-performing model — OpenAI’s o4-mini-high — solved 0% of hard tasks. No exceptions. No workarounds. AGI this is not.

1. The Benchmark: Finally, a Real Test

LiveCodeBench Pro is more than just a set of coding questions. It’s a carefully constructed benchmark with:

584 problems from top-tier competitions (Codeforces, ICPC, IOI)
Strict curation to avoid data contamination (problems added before solutions appeared online)
Full test coverage, real-time updates, and expert annotation by Olympiad medalists

Each problem is tagged not just by algorithmic domain (e.g., dynamic programming, graph theory), but also by cognitive category:

Knowledge-heavy (template-based, brute-force solvable)
Logic-heavy (requires stepwise derivation, state-space modeling)
Observation-heavy (requires creative insight, “aha” moments)

This benchmark isn’t asking “Can a model write syntactically valid Python?” It’s asking: Can it think algorithmically under pressure, like a human would in a real contest?

2. The Outcome: Total Failure on Hard Problems

The top-performing models (o4-mini-high, Gemini 2.5 Pro, Claude 3.7 Sonnet) performed decently on easy, template-friendly problems — with success rates up to ~80%. But as difficulty increases, their success rates collapse:

o4-mini-high
Hard: 0.0% – Medium: 53.5% – Easy: 83.1%
Elo: 2116 | Cost: $0.10

Gemini 2.5 Pro
Hard: 0.0% – Medium: 25.4% – Easy: 70.4%
Elo: 1992 | Cost: $0.30

DeepSeek R1
Hard: 0.0% – Medium: 9.9% – Easy: 56.3%
Elo: 1442 | Cost: $0.04

Claude 3.7 Sonnet (Reasoning)
Hard: 0.0% – Medium: 1.4% – Easy: 36.6%
Elo: 992 | Cost: $0.29

GPT-4.1
Hard: 0.0% – Medium: 0.0% – Easy: 23.9%
Elo: 889 | Cost: $0.02

The hard tier — problems rated above Elo 3000 — mirrors the kind of thinking only the top 0.1% of competitive coders can manage. And yet, all models without exception failed completely.

Zero. Not one successful pass.

3. What the Models Actually Do Well

To be fair: LLMs do shine in narrow areas.

Implementation tasks? Excellent.
Template-heavy problems? Solid.
Problems that map to known algorithms? They can stitch them together with remarkable fluency.

Why? Because these models excel at recall and recombination. When the solution pattern already exists in the training corpus — and can be retrieved or approximated — they perform well.

In this sense, they’re impressive pattern repeaters, not problem solvers.

4. What They Consistently Fail At

But outside those comfort zones, their limits become obvious — and consistent.

They struggle with:

Observation-heavy tasks (where one key insight collapses the solution space)
Interactive problems (that require back-and-forth state management)
Case analysis (handling edge conditions, fallthrough logic)
Abstract modeling (building an algorithm from first principles)

Even more telling: many failures happened on sample inputs. That is, code that couldn’t even pass the example cases from the problem description — something every human programmer checks before submitting.

5. The Dangerous Illusion of “Confidence”

One of the study’s most striking findings is how the models fail.

They don’t crash. They don’t refuse. They don’t hedge.

They return confident, plausible, well-structured code — that’s just totally wrong.

The researchers describe it as:

“Confidently wrong and logically invalid.”

This is what makes LLMs dangerous in professional contexts: their output looks right. It feels like someone solved the problem. But under the surface, it’s pure hallucination — no real reasoning, no true understanding.

If a human programmer submitted code like this, you'd say: “This person knows how to write code, but not how to solve problems.”

6. Chain-of-Thought Doesn't Save Them

A natural question: does reasoning mode (step-by-step thinking, CoT prompting) help?

Yes — in some places. Models showed significant gains in logic-heavy domains like combinatorics or DP.

But in observation-based problems — where human insight is key — reasoning adds little to nothing. In some cases, it even hurts.

Reasoning is not understanding. It’s just longer output.

The benchmark shows that even with step-by-step formats, models are often just guessing — more verbosely.

7. Multiple Attempts Help — But Not Enough

OpenAI has claimed Elo ratings above 2700 for o4-mini under real-world conditions with tool access and multiple attempts. The benchmark tests this directly.

With multiple attempts (pass@k), performance does improve — especially on medium-difficulty tasks.

But even at pass@10, success on hard problems remains 0%.

Tool access (e.g., terminals, web search) adds more gains, but only when used effectively — and most models aren’t yet doing that reliably.

So yes, volume helps. But it's not solving the core problem.

8. The Deeper Truth: LLMs Don’t Know What They’re Doing

The most important insight from this benchmark isn’t about scores. It’s structural.

These models are not thinkers. They are:

Syntax-aware
Pattern-rich
Context-sensitive

…but ultimately shallow.

They lack:

Internal verification
Mental modeling
Conceptual understanding
Real error correction

What they generate is not grounded in an internal logic engine. It's grounded in surface-level associations. They don't "solve" problems. They approximate what a solution might look like.

Conclusion: AGI Is Still Miles Away

If your notion of AGI includes the ability to reason, model, adapt, and solve new problems, then this benchmark is a reality check.

Despite all the noise, the best models today:

Can’t solve problems the best high schoolers can
Fail basic logical consistency
Hallucinate structure in the absence of understanding

That doesn’t mean LLMs are useless. Far from it. They’re phenomenal assistants, code synthesizers, template matchers.

But they’re not general problem solvers. And they’re certainly not on the verge of replacing creative human programmers.

LiveCodeBench Pro is more than a benchmark. It's a mirror — showing us how far we've come, and how far we have left to go.

Prompt Injection

Discussion about this post