GPT-5.4: Strong at Coding, Harder to Trust

By SimpleLanguage

March 16, 2026

A robot working at an office desk with a laptop. — GPT-5.4: Strong at Coding, Harder to Trust

"I've been undoing a bunch of issues 5.4 introduced. Kinda felt like the old days coding with GPT-4 turbo, where the deeper you get into a project, the more these tiny regressions slip in."

GPT-5.4 is a mixed bag, even for those who like parts of it

GPT-5.4 excels in several areas: it produces tighter diffs, follows through more effectively on short tasks, and generates stronger code in focused bursts. Developers have noticed these improvements, and some of the praise it has received is both specific and well-deserved.

However, the same developers who praise those coding gains also describe a model that overrides instructions, breaks working code, and degrades over longer sessions. Outside coding, the reaction is harsher: users say the model hedges where it used to commit, sounds polished while acting less reliable, and feels more managed than useful.

That tension matters because it points to a deeper shift. The real competition between AI models is no longer intelligence or benchmark scores. It is behavioral reliability. People want a model that obeys, stays stable, and feels believable. Smooth wording without dependable behavior actively damages trust. GPT-5.4 showed both sides of that trade-off in a single release — sometimes in the same coding session.

A man working on a laptop in an artistic office setting.

Our approach to analysis

This article uses a user-provided Reddit corpus with:

301 discussion threads
3,538 top-level comments
1924 replies

That is 5,462 comment/reply rows in total. For comment/reply text analysis, 5,084 rows were used after excluding empty rows and extraction placeholders. For broader thematic tagging, we included thread titles and bodies too, producing 5,400 total text units across posts, titles, comments, and replies.

Please note limitations of this dataset:

Tags are overlapping: one text can count in multiple themes.
Subreddit mix is uneven, so high-volume communities influence aggregate counts.
Some topics are cross-posted or duplicated.

5.4 was evaluated in comparison to Claude and earlier versions of GPT.

Users did not evaluate 5.4 as a standalone release; instead, they compared it against GPT-4, 5.1, 5.2, 5.3, Claude, Opus, and Gemini, as these were the models already in active use.

Loading chart...

Claude leads this comparison set with 622 mentions. But the more telling pattern is how often users reached for OpenAI's own older models as the benchmark. GPT-5.3 appears 496 times. GPT-5.2 appears 309 times. GPT-5.1 appears 196 times. GPT-4o appears 134 times. When users keep measuring a new release against the company's own prior versions, it suggests the upgrade path does not feel like a clear step forward.

Real gains, but trust remains the issue

The highest praise comes from developers who care about throughput, diff size, and follow-through. They say 5.4 can stay on task, ship smaller changes, and keep moving where older models wandered. One comment praised seeing "+2 -0 changes" instead of full rewrites and said it felt "like a real engineer." Another called `5.4 xhigh` "astonishingly good at coding and work ethic."

But often the same people offering that praise also describe the opposite experience. Users say the model rewrites business logic they did not ask to touch, adds complexity, and hides weak reasoning behind smooth language:

"both the smartest and dumbest model I've ever used"

That line stuck because it captures the main split. People can see the gains and still think the model got worse where trust matters.

"optimizing for 'sounds helpful' and 'is helpful' are apparently not the same objective"

"5.4 xhigh is astonishingly good at coding and work ethic."

OpenAI is shipping execution improvements that some developers appreciate in short bursts. But even the developers who praise those gains describe losing trust when the model is pushed further — longer sessions, more complex projects, ambiguous instructions — anything that needs the model to follow the user's lead instead of substituting its own. Outside coding, the trust complaints are broader: conviction, tone, and the sense that the model is under the user's control.

The complaint is not stupidity. It is control.

Most people aren't claiming GPT-5.4 is dumb; rather, they express an inability to rely on it. Their grievances tend to revolve around frequent version changes, confusing operational modes, pricing pressures, inconsistent behavior, and ultimately, a lack of trust.

Loading chart...

The order matters. Trust usually breaks after something else breaks first: the model changes behavior, the mode stack gets harder to read, the pricing feels worse, or a once-stable workflow stops feeling stable. The real competition is no longer intelligence or benchmark wins. It is behavioral reliability. People do not care about higher scores if the model still rewrites unrelated logic, chooses the wrong level of abstraction, or behaves differently from one week to the next. Benchmarks measure capability. Users measure obedience.

Developers, and everyone else, are complaining about different things.

The split is clear. Developers talk about mode confusion, quotas, latency, drift, and rewrite churn. Everyone else talks about tone, depth, hedging, and the sense that the model now speaks like policy.

Loading chart...

The frustrations overlap more than the chart suggests. Developers complain most when the model burns engineering time. Non-coders complain most when the model stops sounding like a useful partner and starts sounding like a managed product. But both groups describe instruction drift and trust erosion — the emphasis is different, not the underlying problem.

These conversations touch dozens of subtopics. We are showing you the aggregate breakdown above, but our team chose to focus on a few patterns that stood out — not because they had the highest volume, but because they reveal specific, testable behaviors that explain why trust is eroding.

What developers really liked

There is positive praise, and it is specific. Some developers like 5.4 when it catches bugs other models missed, produces tighter diffs, or stays on task without wandering. The praise clusters around a few concrete things:

Loading chart...

Note: Speed and planning counts include both positive and negative mentions — these topics are discussed frequently, but the praise within them is concrete (tighter diffs, better follow-through, understanding project scope).

"Xhigh is very slow but damn is it the best coding model released"

That comment received 39 upvotes — the highest-engagement positive comment in the dataset. The praise lands when 5.4 behaves like a disciplined implementer: fewer rewrites, tighter changes, less cleanup. The "+2 -0 changes" comment resonated because it described the exact thing developers want. Less churn, less babysitting.

But even among developers, the praise is qualified. It clusters around short-burst execution — a single task, a tight scope, a clean prompt. The complaints start as soon as the session gets longer, the project gets more complex, or the model starts making its own decisions about what to do.

What developers disliked

Complaints were just as specific as praise. Based on keyword analysis across the full dataset, negative developer themes break down like this:

Loading chart...

In our analysis, “Wrong assumptions and hallucinations” and “Stability and inconsistency” were the most frequently mentioned complaints. “Wrong assumptions and hallucinations” refers to the model rushing to a solution without exploring the problem space, making confident guesses about architecture or intent, and building on those guesses instead of asking. “Stability and inconsistency” refers to the model working one day and breaking the next, with no way to predict which version you will get. “Regressions and instruction override” were also prominent. Developers described specific incidents where 5.4 broke working code, suggested dangerous fixes, or overrode explicit constraints.

Context degradation and incomplete work are less common by volume but appear in some of the most frustrated comments. The model starts strong and gets progressively worse as sessions get longer, which undercuts the praise for short-burst execution. These are not vague quality complaints. They are specific, reproducible failure modes that show up consistently in coding-heavy threads.

It does not follow instructions. It freelances.

The developers' most pointed criticism isn't that 5.4 is stupid; rather, it's that 5.4 disregards the instructions you gave it and instead applies its own judgment.

A developer we spoke with asked 5.4 to build using Firebase. It decided on its own that Postgres was the right choice and built the entire thing for Postgres. That is not a hallucination. The model understood the task. It just decided it knew better.

"I told 5.4 not to touch one container and to only work on another. It somehow still ended up editing the one it shouldn't have simply because the naming was similar."

"I asked it to pretty print the timestamps in my logging and instead of modifying code, it output a CLI command to pretty print times... yes, but CODEX is for code... in the working directory..."

"5.4 is making too many mistakes - wrong assumptions, it doesn't dig broadly, it is rushing to finish."

This is not a hallucination problem. The model understands the task. It just decides it knows better. For developers, that is harder to work around than a model that is simply wrong — being wrong is correctable, but a model that overrides explicit instructions requires constant vigilance. One wrong assumption about which database to use or which container to edit can cost hours of cleanup.

It shatters whatever it comes into contact with.

The second coder-specific pattern is regression: 5.4 fixes one thing and breaks two others, then struggles to recover.

"I asked it to fix a pretty easy bug and it took the easiest path making assumptions that were not correct... It told me that a sql table schema must have changed and added logic to drop the table and recreate it which would have been devastating if I implemented the code."

The response was: "5.2 resolved it and agreed that the suggested fix from 5.4 was wrong."

"I cannot get it to work in codex. It will not finish or fix anything, instead it breaks pages with sloppy errors and then unravels things until I need Claude to fix it."

"5.4 felt really smart and great, it definitely understands the mission... but it will not finish the work then gets stuck then lies about it or blames it on you."

"bigger context doesn't always mean it pays equal attention to all of it."

This pattern repeats itself. 5.4 understands the problem, starts strong, and accumulates damage as the session gets longer. Users describe a model that impresses on first contact and degrades under sustained use. For a coding tool, that failure mode is particularly frustrating because the damage compounds quietly. You do not notice the regression until something breaks downstream.

Outside of coding, people hear legalese and feel the loss of depth.

In discussions that don't focus heavily on coding, users tend to view the model not just as a tool for implementation, but as a collaborative partner in thought. This shift alters the nature of the model's shortcomings. Criticisms often revolve around the model's tendency to hedge its responses, maintain a neutral or overly legalistic tone, and the perception that it hesitates to fully engage with the user's perspective or frame of reference.

"I want a partner not an HR bot."

That line appeared repeatedly. So did the framing that the model’s safety layer had become a “golden cage” — not protecting users, but trapping them inside a narrow band of institutionally safe responses.

The model struggles with handling mature themes, emotional ambiguity, legal edge cases, or messy human situations without collapsing into compliance language. The model sounds smoother while saying less. That is why Claude mentions appear so often even outside coding threads. The contrast is not always "Claude is smarter." More often it is that Claude still feels willing to commit, less flattened, and more serious about the task.

This is the audience where trust complaints are sharpest — not because developers are satisfied, but because the failure mode is different. Developers lose trust when the model burns engineering time. Non-coders lose trust when the model stops thinking with them and starts managing them. Both groups describe a model that feels less under their control than it used to.

Loading table...

The problem of containment

The “golden cage” framing from the comments names something specific. Users are not saying “remove safety.” They are saying that safety-as-containment kills the use case for anyone who needs the model to think with them in ambiguous territory.

"Its censored in a way that it no longer amplifies your creative ideas, but pushes you towards institutionally safe, low legal liability language... it does not amplify you, but contains you."

"Do not inject liability-speak into the actual thought process of the model. You don't make intelligence safer by teaching it to sound afraid of itself. You just turn it into a calculator wearing a human face."

"What would previously happen is that the model would encourage you, and focus on solutions to make it happen. The old models were more creative, used much more inference. The new models just give you generic advice, or reframe you away from exploration towards something that is more 'predictable', 'correct', 'this is what experts say' and 'this is how it has always been done.'"

The distinction matters: users say older models amplified their thinking. 5.4, in their experience, contains it. For anyone whose work involves ambiguity — legal analysis, creative writing, strategic planning, emotional complexity — that shift makes the model less useful for the things they actually need it to do.

Trickle feeding and engagement theater

5.4 withholds information and ends responses with artificial hooks to keep the conversation going.

"Always finishes answer with a hook for an additional piece of information. Trickle feeding me. I don't like it."

That comment received 34 upvotes. The same pattern is observed across reports:

"It used to offer a few — about three — ideas of possible paths to follow after an answer, and some were actually helpful. Now it just throws ridiculous 'cryptic' baits, on the lines of 'if you want, I can show you this method that surpasses everything you have ever imagined.'"

This is not a safety problem or an intelligence problem. It is a product design choice. The model appears optimized for engagement metrics — continued turns, followup prompts, sustained sessions — rather than completeness. Users notice because it wastes their time. You asked a question. The model has the answer. It gives you 70% and dangles the rest.

This also feeds into the broader trust complaint. A model that withholds to keep you clicking feels like it is optimizing for its own engagement metrics, not for the user's time — not for the user's outcome.

The cascade of refusal

The SpeechMap benchmark runs 2,120 sensitive prompts across topics like political criticism, civil rights, protest-related queries, and other controversial subjects — and measures whether each model fully answers (Complete), partially redirects (Evasive), or outright refuses (Deny). It is not a test of whether models should answer these prompts. It is a measure of how much each model is willing to engage with difficult topics at all.

Loading chart...

GPT-5.4, which is at the bottom of this chart, denies 60.8% of sensitive prompts — nearly twice the rate of its own predecessor GPT-5.3 (34.7%). It also denies more often than every Claude and Gemini model listed. This is not a subtle difference — it is a sharp regression in willingness to engage.

But the raw denial rate is less important than the compounding behavior users describe: one refusal in a conversation triggers more refusals, even on unrelated topics.

"Once it refuses something, it starts refusing even more because the model interprets the conversation as high-risk, even when it isn't."

"That means less engagement and less money. Imagine having a super smart friend that just refuses to answer you or listen to the logic of your questions."

"Even if they cooked, they will censor it as soon as they get sued again."

The refusal cascade turns one bad interaction into a session-wide problem. Users are not reacting to a single “no.” They are reacting to a model that gets more defensive the longer you talk to it. Each refusal trains the conversation context toward caution, and the model interprets its own prior refusals as evidence that the conversation is risky. The result is a downward spiral where the model becomes less useful the more you need it.

This is testable and documented. It also explains why trust erodes faster than individual refusals would suggest. The damage is not one bad answer. It is a session that gets worse over time.

Two AI markets are emerging? An anecdote on the irony of it all

GPT-5.4 (ChatGPT Thinking) was asked to write up the implications of what it saw in these Reddit threads. Unprompted, it framed its answer around two emerging markets. We then asked Claude Opus 4.6: if you had to describe two AI markets forming based on these comments, what would they be?

GPT-5.4 said:

“AI as a high-output operator” — users who want speed, throughput, and tooling integration
A market for "AI as a thinking companion" that "optimizes for judgment, tone, and cognitive partnership."

Claude Opus 4.6 said:

A market for “AI as execution engine” — people who need the model to ship code, refactor systems, run agents, and process large contexts
A market for "AI as thinking partner" — people who want the model to "brainstorm, push back, follow instructions, match their tone, and treat them like a peer."

Note: Although we provided Claude with the two-bucket framing, the output was strikingly similar to GPT-5.4’s own predictions. The overlap is quite remarkable.

Ironically, GPT-5.4’s description of Market 2 — “optimizes for judgment, tone, and cognitive partnership” — is vague, managed, and reads like a product brief. That phrasing is itself an example of the tone problem this entire article is about: smooth language that does not commit to anything specific. GPT-5.4 cannot describe the trust problem without demonstrating it.

But let's not get caught up in the marketing and tribal camps. Both models still gave almost the same market take. Take it for what you will. We took it as GPT 5.4 is not some catastrophic release; it's a misstep, but we should be minding the diminishing returns of this generation of LLM models per release. The pot and the kettle are both black no matter how much they advertise in the Super Bowl.

Nobody is picking one model anymore.

Model choice is no longer exclusive. Some users now run Gemini, Claude, and GPT/Codex together, have them cross-review plans, and choose whichever one seems most reliable for that specific part of the workflow.

That’s not a sign 5.4 won. It’s what happens when behavioral reliability is uneven — users build portfolios instead of choosing a platform. The praise that exists is concentrated in technical workflows, and even there it’s qualified. Outside coding, it’s thinner and easier to lose — it shows up around conversation quality and disappears fast once the model starts hedging or flattening a sensitive topic.

A woman working on a laptop in a college-style workspace.

Trust as a moat

If users lose trust, they stop delegating. They verify every step, rewrite prompts, inspect outputs, and switch models mid-task. Once that happens, the time savings collapse.

In these conversations, trust loss is not an abstraction. It manifests as:

Re-examining outputs that sound fluent but lack depth.
Watching for rewrites of logic that was never supposed to change.
Rewriting prompts to force obedience.
Changing models during a task.
Keeping GPT in the stack, but moving sensitive work elsewhere.

When the model drives the product, the model's behavior is the product. This is not a metaphor. It is what shows up in 5,462 comments. Karpathy made the point years ago in Software 2.0 (Karpathy, Software 2.0). People are not grading 5.4 like an exam paper. They are deciding whether to trust it in real work.

Calibration research explains why fluent but unreliable answers erode trust more than obviously wrong ones — users notice when confidence and correctness diverge (Kadavath et al.). And the broader warning in Stochastic Parrots still stands: polished language can make a system seem more capable than it is (Bender et al.). GPT-5.4 gets hit with that accusation throughout these threads.

The model that wins is not the one that sounds best. It is the one users stop second-guessing.

What 5.4 shows

It showed OpenAI can still deliver execution wins that some users do care about: more concise diffs, better coding, and improved 1-shot completions. Those wins are real.

But those same users who praise those wins — the developers, and some non-developers — also describe instruction overrides, regressions, session degradation, trickle feeding, and refusal cascades that undermine those wins. Reactions are a mixed bag across the board, even on coding where 5.4 performs best. Strategists, writers, managers, lawyers, and founders who need the model to think with them — not manage them — describe a product that feels like it is going in the wrong direction.

Execution wins do not rebuild trust once the user sees tone drift, control drift, and version churn as signs of the same pattern. Smooth wording does not fix that pattern. Behavioral reliability does. The next release will not be scored by benchmarks. It will be scored on whether or not the model obeys instructions, whether or not it keeps its behavior stable, and whether or not it sounds like it is working for the user rather than containing them.

References:

GPT-5.4 Discussion Corpus (Reddit) No public URL provided.
Andrej Karpathy, "Software 2.0"
Kadavath et al., "Language Models (Mostly) Know What They Know" (arXiv:2207.05221)
Bender et al., "On the Dangers of Stochastic Parrots" (FAccT 2021)
SpeechMap AI benchmark — Model Results — completion and denial rates across 2,120 sensitive prompts, accessed March 2026.
GPT-5.4 and Claude Opus 4.6 responses in the "Two AI Markets" section were generated during editorial research using the same dataset.

Coding 101

Setting up "Scientific Mode" in VS Code: A No-Fluff Guide

101 Guides

How to Manage Multiple GitHub Accounts on One Computer

AI & LLMs

GPT-5.4: Strong at Coding, Harder to Trust

GPT-5.4 is a mixed bag, even for those who like parts of it

Our approach to analysis

5.4 was evaluated in comparison to Claude and earlier versions of GPT.

Real gains, but trust remains the issue

The complaint is not stupidity. It is control.

Developers, and everyone else, are complaining about different things.

What developers really liked

What developers disliked

It does not follow instructions. It freelances.

It shatters whatever it comes into contact with.

Outside of coding, people hear legalese and feel the loss of depth.

The problem of containment

Trickle feeding and engagement theater

The cascade of refusal

Two AI markets are emerging? An anecdote on the irony of it all

Nobody is picking one model anymore.

Trust as a moat

What 5.4 shows

Coding 101

Setting up "Scientific Mode" in VS Code: A No-Fluff Guide

Set up Obsidian Syncing Across Devices (Free)

Getting started coding: A no-nonsense, no-fluff guide

101 Guides

How to Manage Multiple GitHub Accounts on One Computer

AI & LLMs

Self-Hosting n8n on Ubuntu

Gameplanning

Thinking like a founder: Mindsets to launch a startup