Hacker Newsnew | past | comments | ask | show | jobs | submit | gizmodo59's commentslogin

Codex 5.3 is hands down the best model for coding as of today

That’s gpt-5.3-codex released last week

It’s not a crime if you do something for money. Those who comment are likely doing the same and they couldn’t get into a company like OpenAI and hence the hatred! Keep doing the great work you always did! Excited to see what you ll do with all the resources in the world.


5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!


Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?


We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)


Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.


Usually I find this kind of variation is due to context management.

Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.

If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.


I don't think so. I am aware that large contexts impacts performance. In long chats an old topic will someone be brought up in new responses, and the direction of the mode is not as focused.

Regardless I tend to use new chats often.


This is called context rot


I thought context rot was only for long distance queries.


Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.


I appreciate you taking the time to respond to these kinds of questions the last few days.


Can you be more specific than this? does it vary in time from launch of a model to the next few months, beyond tinkering and optimization?


Yeah, happy to be more specific. No intention of making any technically true but misleading statements.

The following are true:

- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)

- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.

- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.

ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

Codex changelog: https://developers.openai.com/codex/changelog/

Codex CLI commit history: https://github.com/openai/codex/commits/main/


I ask then unironically then, am I imagining that models are great when they start and degrade over time?

I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.

I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?


You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.

Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.

Just a guess! Not trying to psychoanalyze anyone.


I don’t think so. I notice the same thing, but I just use it like google most of the time, a service that used to be good. I’m not getting a dopamine rush off this, it’s just part of my day.



Yep, we recently sped up default thinking times in ChatGPT, as now documented in the release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.

If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).


Isn’t that just how many steps at most a reasoning model should do?


>there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware

Maybe a dumb question but does this mean model quality may vary based on which hardware your request gets routed to?


Thank you for saying this publically.

I feel like you need to be making a bigger statement about this. If you go onto various parts of the Net (Reddit, the bird site etc) half the posts about AI are seemingly conspiracy theories that AI companies are watering down their products after release week.


Do you ever replace ChatGPT models with cheaper, distilled, quantized, etc ones to save cost?


We do care about cost, of course. If money didn't matter, everyone would get infinite rate limits, 10M context windows, and free subscriptions. So if we make new models more efficient without nerfing them, that's great. And that's generally what's happened over the past few years. If you look at GPT-4 (from 2023), it was far less efficient than today's models, which meant it had slower latency, lower rate limits, and tiny context windows (I think it might have been like 4K originally, which sounds insanely low now). Today, GPT-5 Thinking is way more efficient than GPT-4 was, but it's also way more useful and way more reliable. So we're big fans of efficiency as long as it doesn't nerf the utility of the models. The more efficient the models are, the more we can crank up speeds and rate limits and context windows.

That said, there are definitely cases where we intentionally trade off intelligence for greater efficiency. For example, we never made GPT-4.5 the default model in ChatGPT, even though it was an awesome model at writing and other tasks, because it was quite costly to serve and the juice wasn't worth the squeeze for the average person (no one wants to get rate limited after 10 messages). A second example: in our API, we intentionally serve dumber mini and nano models for developers who prioritize speed and cost. A third example: we recently reduced the default thinking times in ChatGPT to speed up the times that people were having to wait for answers, which in a sense is a bit of a nerf, though this decision was purely about listening to feedback to make ChatGPT better and had nothing to do with cost (and for the people who want longer thinking times, they can still manually select Extended/Heavy).

I'm not going to comment on the specific techniques used to make GPT-5 so much more efficient than GPT-4, but I will say that we don't do any gimmicks like nerfing by time of day or nerfing after launch. And when we do make newer models more efficient than older models, it mostly gets returned to people in the form of better speeds, rate limits, context windows, and new features.


> we never made GPT-4.5 the default model in ChatGPT

Just wondering: Why was it never made available via API? You can just charge whatever per token to make sure it's profitable like o1-pro.

I use it via my ChatGPT-Pro subscription, but I still find the API omission weird.


It was available in the API from Feb 2025 to July 2025, I believe. There's probably another world where we could have kept it around longer, but there's a surprising amount of fixed cost in maintaining / optimizing / serving models, so we made the call to focus our resources on accelerating the next gen instead. A bit of a bummer, as it had some unique qualities.

He literally said no to this in his GP post


My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.

It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.

If you make raw API calls and see behavioural changes over time, that would be another concern.


It will give the user lower quality if it finds them “distressed” however, choosing paternalistic safety over epistemic accuracy. As a user gets more frustrated with the system, it will pick up the distress signal even more so, a kind of feedback loop toward degraded service quality. In my experience.


Specifically including routing (i.e. which model you route to based on load/ToD)?

PS - I appreciate you coming here and commenting!


There is no routing with API, or when you choose a specific model in chatGPT.


In the past it seemed there was routing based on context-length. So the model was always the same, but optimized for different lengths. Is this still the case?


Has this always been the case?


I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.


I do wonder about reasoning effort.


Reasoning effort is denominated in tokens, not time, so no difference beyond slowness at heavy load

(I work at OpenAI)


Hi Ted! Small world to see you here!


sure. we believe you


It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.


> I'd expect the numbers are all real.

I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).

It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.

And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.


On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better


There's an extended thinking mode for GPT 5.2 i forget the name of it right at this minute. It's super slow - a 3 minute opus 4.5 prompt is circa 12 minutes to complete in 5.2 on that super extended thinking mode but it is not a close race in terms of results - GPT 5.2 wins by a handy margin in that mode. It's just too slow to be useable interactively though.


Interesting, sounds like I definitely need to give the GPT models another proper go based on this discussion


I mostly used Sonnet/Opus 4.x in the past months, but 5.2 Codex seemed to be on par or better for my use case in the past month. I tried a few models here and there but always went back to Claude, but with 5.2 Codex for the first time I felt it was very competitive, if not better.

Curious to see how things will be with 5.3 and 4.6


Interesting. Everyone in my circle said the opposite.


My experience is that Codex follows directions better but Claude writes better code.

ChatGPT-5.2-Codex follows directions to ensure a task [bead](https://github.com/steveyegge/beads) is opened before starting a task and to keep it updated almost to a fault. Claude-Opus-4.5 with the exact same directions, forgets about it within a round or two. Similarly, I had a project that required very specific behaviour from a couple functions, it was documented in a few places including comments at the top and bottom of the function. Codex was very careful in ensuring the function worked as was documented. Claude decided it was easier to do the exact opposite, rewrote the function, the comments, and the documentation to saynit now did the opposite of what was previously there.

If I believed a LLM could be spiteful, I would've believed it on that second one. I certainly felt some after I realised what it had done. The comment literally said:

  // Invariant regardless of the value of X, this function cannot return Y
And it turned it into:

  // Returns Y if X is true


That's so strange. I found GPT to be abysmal at following instructions to the point of unusability for any direction-heavy role. I have a common workflow that involves an orchestrator that pretty much does nothing but follow some simple directions [1]. GPT flat-out cannot do this most basic task.

[1]: https://github.com/Vibecodelicious/llm-conductor/blob/main/O...


Strange behaviour and LLMs are the iconic duo of the decade. They've definitley multiplied my productivity, since now instead of putting off writing boring code or getting stuck on details till I get frustrated and give up I just give it to an agent to figure out.

I don't thing my ability read, understand, and write code is going anywhere though.

Neat tool BTW, I'm in the market for something like that.


I've found this orchestrator+reviewer+judge setup to yield much better results than anything else I've tried. And it's such a simple setup - a few markdown files.

I'm also creating one that is similar, but purpose-built for making the plans that this setup can orchestrate. It still needs some tweaking to get agents to follow it better - it still takes additional prompting to nudge it down the proper path. But I've had similar benefits - sending plans through this adversarial review loop has yielded significant improvements in final output.

https://github.com/Vibecodelicious/llm-conductor/blob/main/p...


Unrelated but this just happened and I thought of you ;-)

  › commit this, **SKIP BEADS**


  • Ran git status -sb
    └ ## main...origin/main [ahead 4]
       M containers/frigate.nix
      ?? .beads/bd.sock.startlock

  • I see an untracked .beads/bd.sock.startlock file that might be normal but needs clarification. I also note the requirement to include a bead and confirm bead readiness
    before proceeding, so I'll ask whether to create the bead and if the untracked file should be included in the commit.

  ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  • I can’t skip beads for this repo. The AGENTS.md rules are explicit: no work (including commits) without an open bead. Please confirm you want me to create a bead for
    this commit.
I don't know what's wrong with your Codex, but mine can't bring itself to break the rules.


It probably depends on programming language and expectations.


This is mostly Python/TS for me... what Jonathan Blow would probably call not "real programming" but it pays the bills

They can both write fairly good idiomatic code but in my experience opus 4.5 is better at understanding overall project structure etc. without prompting. It just does things correctly first time more often than codex. I still don't trust it obviously but out of all LLMs it's the closest to actually starting to earn my trust


Even for the same language it depends on domain.


I pretty consistently heard people say Codex was much slower but produced better results, making it better for long-running work in the background, and worse for more interactive development.


Codex is also much less transparent about its reasoning. With Claude, you see a fairly detailed chain-of-thought, so you can intervene early if you notice the model veering in the wrong direction or going in circles.


I don't think much from OpenAI can be trusted tbh.


When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?

I definitely suspect all these models are being degraded during heavy loads.


This hypothesis is tested regularly by plenty of live benchmarks. The services usually don't decay in performance.


At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.


We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.


Are you referring to FrontierMath?

We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.


No one believes you.


If you don't believe me, that's fair enough. Some pieces of evidence that might update you or others:

- a member of the team who worked with this eval has left OpenAI and now works at a competitor; if we cheated, he would have every incentive to whistleblow

- cheating on evals is fairly easy to catch and risks destroying employee morale, customer trust, and investor appetite; even if you're evil, the cost-benefit doesn't really pencil out to cheat on a niche math eval

- Epoch made a private held-out set (albeit with a different difficulty); OpenAI performance on that set doesn't suggest any cheating/overfitting

- Gemini and Claude have since achieved similar scores, suggesting that scoring ~40% is not evidence of cheating with the private set

- The vast majority of evals are open-source (e.g., SWE-bench Pro Public), and OpenAI along with everyone else has access to their problems and the opportunity to cheat, so FrontierMath isn't even unique in that respect


The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0]

You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.

Which is why it must be independent.

[0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...


The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out


The -codex models are only for 'agentic coding', nothing else.


Anthropic models generally are right first time for me. Chatgpt and Gemini are often way, way out with some fundamental misunderstanding of the task at hand.


That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...


claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?


Its SWE bench pro not swe bench verified. The verified benchmark has stagnated


Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.


it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private


You're comparing two different benchmarks. Pro vs Verified.


> "What a time to be alive!"

fuck off.


Codex seems to give compensation tokens whenever this happens! Hope Claude gives too.


Nvda is not the only exception. Private big names are losing money but there are so many public companies seeing the time of their life. Power, materials, dram, storage to name a few. The demand is truly high.

What we can argue about is if AI is truly transforming lives of everyone, the answer is a no. There is a massive exaggeration of benefits. The value is not ZERO. It’s not 100. It’s somewhere in between.


The opportunity cost of the billions invested in LLMs could lead one to argue that the benefits are negative.

Think of all the scientific experiments we could've had with the hundreds of billions being spent on AI. We need a lot more data on what's happening in space, in the sea, in tiny bits of matter, inside the earth. We need billions of people to learn a lot more things and think hard based on those axioms and the data we could gather exploring what I mention above to discover new ones. I hypothesize that investing there would have more benefit than a bunch of companies buying server farms to predict text.

CERN cost about 6 billions. Total MIT operations cost 4.7 billions a year. We could be allocating capital a lot more efficiently.


I believe that eventually the AI bubble will evolve in a simple scheme to corner the compute market. If no one can afford high-end hardware anymore then the companies who hoarded all the DRAM and GPUs can simply go rent seeking by selling the computer back to us at exorbitant prices.


The demand for memory is going to result in more factories and production. As long as demand is high, there's still money to be made in going wide to the consumer market with thinner margins.

What I predict is that we won't advance in memory technology on the consumer side as quickly. For instance, a huge number of basic consumer use cases would be totally fine on DDR3 for the next decade. Older equipment can produce this; so it has value, and we may see platforms come out with newer designs on older fabs.

Chiplets are a huge sign of growth in that direction - you end up with multiple components fabbed on different processes coming together inside one processor. That lets older equipment still have a long life and gives the final SoC assembler the ability to select from a wide range of components.

https://www.openchipletatlas.org/


That makes no sense. If the bubble bursts, there will be a huge oversupply and the prices will fall. Unless all Micron, Samsung, Nvidia, AMD, etc all go bankrupt overnight, the prices won't go up when demand vanishes.


That assumes the bubble will burst, which it won't if they succesfully corner the high-end compute market.

It doesn't matter if the AI is any good, you will still pay for it because it's the only way to access more compute power than consumer hardware offers.


There is a massive undersupply of compute right now for the current level of AI. The bubble bursting doesn't fix that.


There is a massive over-buying of compute, much beyond what is actually needed for the current level of AI development and products, paid for by investor money. When the bubble pops the investor money will dry up, and the extra demand will vanish. OpenAI buys memory chips to stop competitors from getting them, and Amazon owns datacenters they can't power.

https://www.bloomberg.com/news/articles/2025-11-10/data-cent...


For every sensational article of AI was useless, there is plenty of examples where using ChatGPT to find out what else could be happening and then having a conversation with doctor has helped many that I know of anecdotally and many such reports online as well.

At the end of the day, it’s yet another tool that people can use to help their lives. They have to use their brain. The culture of seeing doctor as a god doesn’t hold up anymore. So many people have had bad experiences when the entire health care industry at least in US is primarily a business than helping society get healthy.


Not sure if this correct. Codex was one of the first research projects long before Anthropic was started as a company. May be they did not see it as a path to AGI. It seems like coding is seen by few companies as the path to general intelligence (almost like Matrix where everything is code).


Only in HN and some reddit subs I even see the name claude. In many countries AI=ChatGPT.


And at one point in the 90s, Internet=Netscape Navigator.

I see Google doing to OpenAI today what Microsoft did to Netscape back then, using their dominant position across multiple channels (browser, search, Android) to leverage their way ahead of the first mover.


That's funny, the way I see it is Microsoft put tens of billions of dollars behind an effort to catch Google on the wrong foot, or at least make Google look bad, but they backed the wrong guy and it isn't quite going to make it to orbit.


From what I've seen, 99% of people are using the free version of ChatGPT. Those who are using Claude are on the subscription, very often the $100/month one.


Yeah. 50% of the time to throw away expensive tokens and limits is not ideal. But I bet by this time next year OSS models will be at that capability!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: