They’re pretty upfront in their release post that they took an open source model and improved it with their own coding data. They mention “continued pretraining” (on top of the base model) and RL. Cursor never claimed to have done a full pretraining run.
More to the point, beating Opus 4.6 at coding and coming within striking distance of gpt-5.4 is impressive! The benchmarks outperform raw Kimi K2.5.
It’s particularly impressive given larger labs like Meta are struggling to catch up to OpenAI/Anthropic.
Beating Opus 4.6 and coming within striking distance of gpt-5.4 is impressive! Particularly given larger labs like Meta are struggling to catch up to OpenAI/Anthropic.
More competition among model vendors is great for developers!
Has HN really stooped so low that we are upvoting unsourced AI slop? This “article” is sourced to a random Reddit thread and was clearly written by an LLM.
> A Reddit researcher just exposed
>The technical reality hits harder than policy abstractions.
It is actually not argumentum ad hominem, not least because this author is clearly not a person. It is extremely relevant to substance of this post that it was written by an LLM based on an anonymous Reddit commit (based on "reporting" itself written by Claude).
>If it’s clearly wrong then demonstrate.
Sorry, this does not work in the age of AI. If you don't bother writing your own words, then no one should bother responding to them.
It has been demonstrated as being wrong throughout this thread and the original threads about it.
The original report was AI slop from Claude Code. If you go to the repo it doesn’t even claim that Meta spent $2B, that’s just a sum of a lot of numbers Claude could find, not the number that Meta spent on lobbying this.
>Surely if OpenAI had insisted upon the same things that Anthropic had, the government would not have signed this agreement.
Have we been watching the same Trump admin for the last year? That sound exactly like something the government would do: pointlessly throw a fit and end up signing a worse deal after blowing up all political capital.
While that thought crossed my mind, someone in a sub thread of parent comment made a point: OpenAI made a statement about how "We insisted this be not be used in those ways and DoD totally says they won't". Which sounds to me like they ceded any hard terms oand conditions and are letting the DoD use it in "any lawful means" which is what Anthropic didn't stand for.
You cannot out-astroturf Claude in this forum, it is impossible.
Anyways, do you get shitty results with the $20/month plan? So did I but then I switched to the $200/month plan and all my problems went away! AI is great now, I have instructed it to fire 5 people while I'm writing this!
Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.
I might flip that given how hard it's been for Claude to deal with longer context tasks like a coding session with iterations vs a single top down diff review.
I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.
I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.
The way "Phases" are handled is incredible with research then planning, then execution and no context rot because behind the scenes everything is being saved in a State.md file...
I'm on Phase 41 of my own project and the reliability and almost absence of any error is amazing. Investigate and see if its a fit for you. The PAL MCP you can setup to have Gemini with its large context review what Claude codes.
5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.
I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.
But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.
Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.
Yes, because there’s value in a common reference for comparison. It helps to shed light on different models’ relative strengths and weaknesses. And, just like with performance benchmarks, you can learn to spot and read past the ways that people game their results. The danger is really more in when people who are less versed in the subject matter take what are ultimately just a semi tamed genre of sales pitch at face value.
When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.
It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .
Codex 5.3 seems to be a lot chattier. As in, it comments in the chat about things it has done or is about to do. They don't show up as "thinking" CoT blocks, but as regular outputs, but overall the experience is somewhat more like Claude is in that you can spot the problems in model's reasoning much earlier if you keep an eye on it as it works, and steer it away.
Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.
What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.
The denial of this overall trend on here and in other internet spaces is starting to really bother me. People need to have sober conversations about the speed of this increase and what kind of effects it's going to have on the world.
Yeah, I really didn't believe in agentic coding until December, that was where it took off from being slightly more useful than hand crafting code to becoming extremely powerful.
And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...
> All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?
You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.
It's hardly tinfoil to understand that companies riding a multi-trillion dollar funding wave would spend a few pennies astroturfing their shit on hn. Or overfit to benchmarks that people take as objective measurements.
Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3
It's relatively easy for people to grok, if a bit niche. Just sometimes confuses LLMs. Humans are much better at holding space for rare exceptions to usual rules than LLMs are.
In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.
I think for many/most programmers = 'speed + output' and webdev == "great coding".
Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.
But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.
This is the way. People are unfortunately starting to divide themselves into camps on this — it’s human nature we’re tribal - but we should try to avoid turning this into a Yankees Redsox.
Both companies are producing incredible models and I’m glad they have strengths because if you use them both where appropriate it means you have more coverage for important work.
GPT 5.2 codex plans well but fucks off a lot, goes in circles (more than opus 4.5) and really just lacks the breadth of integrated knowledge that makes opus feel so powerful.
Opus is the first model I can trust to just do things, and do them right, at least small things. For larger/more complex things I have to keep either model on extremely short leashes. But the difference is enough that I canceled my GPT Pro sub so I could switch to Claude. Maybe 5.3 will change things, but I also cannot continue to ethically support Sam Altman's business.
I'd say that GPT 5.2 did slightly better on the stuff that I'm working on currently compared to Opus 4.5, but it's rather niche - a fancy Lojban parser in Haskell). However Opus is much easier to steer interactively because you can see what it's doing in more detail (although 5.3 is much improved in that regard!). I wouldn't feel empty-handed with either model, and both wrote large chunks of code for this project.
All that said, the single biggest reason why I use Codex a lot more is because the $200 plan for it is so much more generous. With Claude, I very quickly burn through the quota and then have to wait for several days or else buy more credit. With Codex, running in High reasoning mode as standard with occasional use of XHigh to write specs or debug gnarly issues, and having agents run almost around the clock in the background, I have hit the limit exactly once so far.
Didn't make a difference for me. Though I will say, so far 4.6 is really pissing me off and I might downgrade back to 4.5. It just refuses to listen to what I say, the steering is awful.
How many people are building the same thing multiple times to compare model performance? I'm much more interested in getting the thing I'm building getting built, than than comparing AIs to each other.
Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.
Hopefully performance will pick up after the rollout.
ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.
A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.
The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...
I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.
> The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly
Not very accurate. For each of ARC-AGI-1 and ARC-AGI-2 there is training set and three eval sets: public, semi-private, and private. The ARC foundation runs frontier LLMs on the semi-private set, and the labs give them pre-release API access so they can report release-day evals. They mostly don't allow anyone else to access the semi-private set (except for live Kaggle leaderboards which use it), so you see independent researchers report on the public eval set instead, often very dubious. The private is for Kaggle competitions only, no frontier LLMs evals are possible.
(ARC-AGI-1 results are now largely useless because most of its eval tasks became the ARC-2 training set. However some labs have said they don't train LLMs on the training sets anyway.)
More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.
As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.
> It'll be noteworthy to see the cost-per-task on ARC AGI v2.
Already live. gpt-5.2-pro scores a new high of 54.2% with a cost/task of $15.72. The previous best was Gemini 3 Pro (54% with a cost/task of $30.57).
The best bang-for-your-buck is the new xhigh on gpt-5.2, which is 52.9% for $1.90, a big improvement on the previous best in this category which was Opus 4.5 (37.6% for $2.40).
I’ve gone through this process before and while it was more work it did not take 30 minutes.
I presented a student ID and was escorted through the security line. My baggage was selected for additional screening and I received a pat down search.
I went through an identical procedure on the return flight, right down to the exact words the TSA agent spoke to me while conducting the pat down.
I've also gone through this process, it did take about 30 minutes in my case. That also included waiting for a TSA agent to be available to even start the process. So YMMV, perhaps based on how busy the airport is at the time.
They had me answer a series of questions about past addresses etc, it wasn't just an extra pat down in my case. After answering all the questions correctly they allowed me to continue.
Nested scrollbars! Horizontal and vertical scroll!
reply