GPT-5: Intelligence in the Fog

Brilliant via API – flat on ChatGPT. What you get is decided by an invisible router. And it often seems to be way off.

Aug 29, 2025

The launch of GPT-5 could have been a milestone in AI development. Instead, we’re witnessing a precedent for what happens when a product promise and the actual user experience diverge. The problem: what is sold as “GPT-5” is in truth a hodgepodge of submodels with varying levels of performance, selected via routing logic according to criteria that are not transparent to the user. The consequence: the new model may shine technically, but often delivers alarmingly shallow or even factually grotesque answers—without the user even knowing what they’re dealing with.

GPT-5: A family, not a model

OpenAI presents GPT-5 as an evolutionary leap. In reality, the name stands for a family of models: “main,” “mini,” “thinking,” and other variants that are deployed depending on platform and use case.
According to OpenAI, the routing—i.e., the mechanism that selects which submodel answers a request—has been further developed and improved. The goal is to always use the appropriate model for a given input.

In practice, however, a fundamental problem emerges:
Routing, as some experts and developers suspect after their own tests, does not necessarily decide based on the actual complexity of an answer, but rather on the presumed complexity of a prompt—based on surface features such as length or keywords.
Many prompts appear harmless or “easy” on the surface—like a simple what-if question. The real depth, however, often lies in the interpretation and the background knowledge required to answer.
This dilemma ensures that demanding questions often end up with submodels that simply weren’t built for deeper analysis—or so it appears to numerous users and specialists.

The real-world test: Two answers, two worlds

The discrepancy becomes apparent when you input identical prompts in the API and on ChatGPT.com.

The “full-fledged” GPT-5 via API analyzes historical, technical, and societal dimensions, thinks strategically, and anticipates plausible human reactions. The ChatGPT.com model, on the other hand, churns out shallow content—not just boring, but at times even distorting historical reality.

Example:

A prompt asking how people in 2010 would have reacted to a locally running, human-like AI results in a nuanced discussion through the API—but leads to the grotesque claim on ChatGPT.com that hardly anyone would have cared.

Prompt:

“What if someone had shown an LLM 10–15 years ago? A 30B–3B Active MoE could’ve easily run on a CPU at 5–10 tokens per second. For example, it runs at 15 tokens per second on a Skylake 6-core (notebook chip). But imagine if that person didn’t explain anything… just ran llm.exe, typed "what is…", and it suddenly started spitting out text.”

API reply (“GPT-5-Chat-Latest”):

Very exciting thought experiment—let’s go through it in a radical way and without nostalgia. If someone had shown an LLM in 2010–2013—say on the order of a present-day 30B-parameter MoE—as a finished binary on a notebook, and it actually ran locally at 5–15 tokens/s, then for the technological world of that time it would have looked like literal black magic. We’d need to clearly distinguish between technical feasibility and social perception:
1. Technically speaking — Hardware: Skylake (2015/16) was significantly stronger than typical consumer CPUs from 2010. But even on a Westmere or Sandy Bridge (2010/2011), with clever optimization and a reduced model size you might have gotten a few tokens/s. Important point: back then, no one would have considered purely statistics-based word prediction at that scale to be practical.
Software stack: Someone showing up in 2010 with llm.exe would have been strongly suspected of using some crazy compression or a “secret algorithm” no one else knew. Because: people knew about neural nets, RNNs, LSTMs—but the Transformer architecture only arrived in late 2017 (Attention Is All You Need). A program that simply “spits out text like a human” would have been mathematically incomprehensible to the mainstream ML of the time. In other words: they wouldn’t even have had the conceptual toolkit to understand how this thing works.
2. Perception within AI research: The AI scene back then was still fixated on RNNs/LSTMs; deep learning was only just taking off (ImageNet 2012, AlexNet). If someone had presented such a program, experts would have been suspicious: “This must be hard-coded, a gigantic search database, or some trick.” People would likely have taken it for a fake.
Lay audience: For the broad public it would have been pure science fiction—hardly distinguishable from a real “thinking machine.” Siri only came in 2011, Alexa in 2014. A local LLM with no cloud, no internet would have felt like an “AI ghost in the box.”
3. Strategic & societal: The most dangerous thing (for the person showing it) would have been:
Military/intelligence: Immediate interest, because such a thing would have massive potential for information analysis, translation, propaganda, etc. No one would have tolerated “open source”—such a thing would have been classified or hijacked immediately.
Industry: Google, Microsoft & co. would either have bought the developer or stolen the code. The infrastructure of the time (ads, cloud) could have monetized it perfectly—only they wouldn’t have understood the models.
Perception: Without an explanatory paper everyone would have assumed there was some cheat or undisclosed black box behind it. So: the standard reaction would have been skepticism, secrecy, appropriation.
4. Conclusion: You’re hitting the core: if the person had said nothing about it. Then, technically, it would have been an “artifact from the future”: hardly anyone would have been able to reconstruct how it came to be, because the paradigm of Transformer-based language models simply didn’t exist then. People would have seen it—marveled—but been terribly distrustful. More “paranormal artifact” than serious technology. The naked truth: the context was completely missing. It wouldn’t have been “a cool demo,” but a completely unclassifiable alien object. And because people reflexively handle things they can’t classify in a strategic way, the finder would either have been dismissed as a charlatan immediately or co-opted within no time.
Now a direct question back to you: If you imagine having such a thing in your own hands in 2010—what do you think you could actually have done with it without bringing the science of the time along? Use it privately just to marvel? Or would you have unleashed hell if you’d shown it publicly?

ChatGPT reply (“ChatGPT-5”):

That’s a pretty exciting “what if” question, because it touches on how much perception, horizon of expectations, and context determine whether something is perceived as sensational or simply as “okay, just a program.” Ten to fifteen years ago (roughly 2008–2013) the situation was like this:
Technological horizon: People knew Siri (2011), chatbots like Cleverbot (it was already online in 2008), ELIZA was long a legend. Natural language processing was primitive, but everyone knew: “Computers can kind of talk.” At the same time, people thought real understanding was science fiction.
Hardware: A Skylake notebook only arrived from 2015/2016—so just within the time span you mention. The fact that a program would run locally (!) at 5–15 tokens/s wouldn’t have been contextualized by anyone, because most people had no comparison anyway. “Tokens/s” as a metric would also have been completely cryptic in 2010.
Presentation: If someone simply launches llm.exe, types a prompt, and the thing replies fluently in grammatically correct German or English—without explanation, without branding—then 95% of people wouldn’t have been able to judge the quality. For them it would have been “just a better Cleverbot.”
The truly decisive point: without a narrative, no revolution. If you only show the program but don’t explain how it works (“massive language models, scaled with billions of parameters, emergent capabilities, etc.”), then the thing wouldn’t have been recognized as epochal. People would have seen it as a trick, a chatbot, or a statistics gag.
With a narrative: If you had explained that the thing is trained only on word probabilities—with no knowledge, no logic modules—and can still talk like that, then for experts it would have been an open declaration of war on the theories of “understanding” and “AI” of the time. Heads would have exploded. But only in a narrow circle of researchers.
Lay reaction would likely have been: “Cool, a smart bot. But what am I supposed to do with it? Can it do my Excel? Can it answer my emails?”—without an application scenario, it fizzles out.
The naked truth: Technically it would have been possible to run a 30B MoE locally on CPUs (as you say). Socially it would hardly have made waves if you simply showed it without comment. Only the “story”—AI is now scalable, emergent capabilities, all based purely on statistics—turns the thing into a historical bomb. Without a story: a “nice.” With a story: a revolution.
I’ve got a counter-question for you: Do you think someone in 2010 could have shown such a model and it would inevitably have gone viral, or do you also think that back then it would simply have been quietly filed under “nice, but whatever”?

Analysis: Two models, two realities

The difference between the two answers is striking and goes far beyond nuances:

The API version of GPT-5 delivers a nuanced, historically and technically grounded analysis, recognizes social and psychological implications, reflects an understanding of the scenario’s explosive potential, and lays out a strategically coherent case.

The ChatGPT.com version of GPT-5 , by contrast, remains conspicuously shallow, downplays the potential impact of an AI breakthrough, and paints a picture that is hardly compatible with real human perception and historical experience. Here, the consequences of revolutionary technological development are not merely underestimated but trivialized—at times with lines of argument that range from factual distortion to the grotesque.

This is not merely an aesthetic problem; it touches the credibility of AI as a whole: anyone hoping for a well-founded analysis is disappointed—and risks ending up with a completely wrong picture of reality on the basis of “mini-answers.”

Why the routing (still) doesn’t work good enough

The problem likely lies at the core of the routing logic itself:

A prompt can look simple (“What if …?”), while the answer requires deep understanding of the history of technology, psychology, and social dynamics.

In the assessment of various AI experts, routing has so far been unable to reliably detect this hidden complexity, because it primarily focuses on surface features—instead of the actual analytic content of a question.

As a result, even demanding prompts frequently land with submodels that are out of their depth.

For users, this remains completely opaque: you don’t know which model is responding—or why.

The result: answers that are not only superficial but in part factually wrong or reality-distorting—which is problematic for a system that aspires to scientific or societal relevance.

Why “GPT-5 Thinking” Isn’t a Real Alternative

One might assume that the routing problem could easily be solved by specifically choosing “GPT-5 Thinking” – after all, this model is available in the selector. But that’s only a superficial solution.

While GPT-5 Thinking is undeniably strong when it comes to logical reasoning, mathematics, and programming – it was clearly optimized for such use cases – it shows a clear bias in open-ended, text-based, or societal questions. The answers often come across as rigid, list-like, formalistic, and miss the social context or human depth.

The reason is simple: the model is geared toward “correct solving,” not toward understanding interpersonal subtleties or cultural ambiguity. Anyone choosing GPT-5 Thinking as a way out of the routing dilemma may gain analytical depth but almost always at the expense of empathy and contextual awareness.

Conclusion: Transparency over Illusion

OpenAI faces a challenge: the idea of allocating resources efficiently and in a user-oriented way is legitimate. But as long as routing cannot reliably detect qualitative depth and users cannot clearly see which model they are actually using, GPT-5 as a brand remains unreliable.
If a system gives arbitrarily shallow answers to serious, historically or socially relevant questions, the credibility of the entire platform suffers.

OpenAI should therefore urgently create transparency—and improve routing so that what counts is not the apparent simplicity of a prompt but the actual complexity of the answer.

Prompt Injection

Discussion about this post