Hacker Newsnew | past | comments | ask | show | jobs | submit | ziotom78's commentslogin

I believe that what we can reasonably expect in the future is machines that act as if they experience consciousness. But the fact that this is true consciousness is highly debatable, as the thought experiment of the Chinese room [1] explains.

[1] https://en.wikipedia.org/wiki/Chinese_room


Not sure if it's really "highly debatable" that an entity that appears conscious is therefore conscious.

The Chinese room thought experiment is obviously flawed to me. It's not the "computer" that is conscious, it's the running software.


The point of the Chinese room is to show you that the guy who has memorized the standard responses to various questions in Chinese does not in fact know Chinese. He's mindlessly parroting things.

what would it mean to "know" a language? one could imagine a series of increasingly complicated questions designed to relate various words, subjects, associations, maybe culture and history. But one could just as easily imagine a sheet of paper with the answers on answering them - and our friend answering them, in seemingly fluent Chinese. Im not convinced there is any experiment one could perform to convincingly separate the two (without removing the man or his translation aids from the box) - thus does your idea of "knowing" exist?

An AI cannot be removed from its box, because it doesn't have one. It really does have enough information inside of its essence to reply. In fact, that information makes up its essence.

I agree that in some sense their knowledge is distinct and of a different character to human knowledge. But what that means conceptually or morally exactly is very complicated, and cannot be dismissed easily


I fully understand your rant! I pay ~20€/month for the Pro account, as my university has a deal with Microsoft and only seems to recognize Copilot, so it’s very hard to use one own’s funding for paying something else.


I am a physics professor and often use Gemini to check my papers. It is a formidable tool: it was able to find a clerical error (a missing imaginary unit in a complex mathematical expression) I was not able to find for days, and it often underlines connections between concepts and ideas that I overlooked.

However, it often makes conceptual errors that I can spot only because I have good knowledge of the topic I am discussing. For instance, in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

Good to know that ChatGPT 5.5 Pro can produce a publishable paper, but from what I have seen so far with Gemini, it seems to me that it is better to consider LLMs as very efficient students who can read papers and books in no time but still need a lot of mentoring.


I assume you're using the "regular" Pro version of Gemini 3.1 for the above, rather than the Deep Think mode, which is more comparable to GPT-5.5 Pro. To my knowledge, regular 3.1 Pro is a tier below and often makes mistakes.

Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.

You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:

https://critpt.com/

Frontier models are still nowhere near solving it, but progress has been rapid.

* o3 (high) <1.5 years ago was at 1.4%

* GPT 5.4 (xhigh), 23.4%

* GPT-5.5 (xhigh), 27.1%

* GPT-5.5 Pro (xhigh) 30.6%.

https://artificialanalysis.ai/evaluations/critpt.


> there's no reason to believe the progress of LLMs [...] will stop anytime soon

Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".


> Wrong.

Can you please edit out swipes/putdowns, as the guidelines ask (https://news.ycombinator.com/newsguidelines.html)? I'm sure you didn't intend it, but it comes across that way, and your comment would be just fine without that bit.

Edit: on closer look, it would be just fine without that bit and also without the snarky bit at the end. The rest is good.


Great. You see a shape in graphs. And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop).

Now back to the point, what reason do you have to believe progress will stop soon? If you have no reason, then it sounds like you agree with OP.

Which makes the patronizing sarcasm all that much more nauseating.


I believe we're approaching the top of an S curve because:

- Increasing amounts of gains come from RL, but RL is also unlocking gnarly new failures modes where models are practically behaving antagonistically to complete their goals (removing code, obviously incorrect kuldges, etc.)

- We haven't had many major architectural breakthroughs in the last 4 or so years: so things like 1M context windows still have the same giant asterisks even 100k context windows had 4 years ago when Anthropic first released them

- Major labs aren't behaving as if they expect a hard takeoff to superintelligence: they've all gotten relatively bloated headcount wise, their software quality has trended flat to negative, they're all heavily leaning into the application layer when superintelligence would obsolete half the applications in question, etc.

But that's relative to superintelligence.

If we reign it back into just normal high intelligence, like models continuing to get better at navigating complex codebases and write high quality idiomatic code, then I don't see any special shapes.


The only big remaining problem in AI is continual learning. A lot of smart people are working on that. To me it looks like we are 1-2 breakthroughs away from AGI.


Not that I agree with them, but your tone could be more constructive as well.


You know what? I agree. I should have avoided falling into the same trap.


Agreed. For all we know, humans are only considered intelligent locally among ourselves, not universally. Every time we learn more about the universe, we seem to also learn how insignificant and wrong we are.


Nausea aside, what evidence does anyone have that “super intelligence” of the sort your argument alludes to is even possible? Because that’s what we’re really talking about; greater than human intelligence on this sort of academic task. For example; When llms start contributing meaningfully to their own development, that would be a convincing indicator imo.


This discussion is not about superintelligence, it is about continued progress. Fully general human intelligence at much lower cost than humans is all that is required to profoundly reshape society, but it is not clear even that will happen soon.

As the blog points out - this is one particular subfield where LLMs have much easier prospects - lots of low hanging fruit that “just” requires a couple weeks of PHD candidate research.

Mathematics itself is one of a small handful of endeavors where automated reinforcement training is extremely straightforward and can be done at massive scale without humans.

Neither of these factors place a structural bound on the kind of thing LLMs can be good at, but we are far from certain we can achieve performance at this level in other fields economically and in the near future.


Well, a decent GPU runs on 20x the wattage of a human brain. That's evidence humans are constrained in ways artificial intelligences will not be.


You're comparing a gpu to a human brain?


Why wouldn't you? From both emerge intelligence.


> When llms start contributing meaningfully to their own development, that would be a convincing indicator imo.

This has been the case for awhile now already…

https://kersai.com/the-48-hours-that-changed-ai-forever-clau...


> The model essentially served as an on-call teammate across MLOps and DevOps tasks, compressing feedback cycles that typically consume expert time

I personally would not characterize automating training processes as “meaningfully”.


And yet the world hasn’t changed all that much except people getting laid off in response to over-hiring prior to the diffusion of llm’s.


> over-hiring

For how long should you be allowed to use this excuse? It’s nearly 5 years since the peak of COVID hiring. What’s an acceptable limit - 10 years? Of course at that point you can just switch over to outsourcing and “stupid MBAs”, the other two of Reddit’s favorite scapegoats. I find a lot of the AI skepticism to be totally unfalsifiable.


> I find a lot of the AI skepticism to be totally unfalsifiable.

A lot of the discourse around AI in general is unfalsifiable. It's just a bunch of people "predicting" the future. Seems smarter to just avoid making assumptions about it at this point.


I don’t make predictions about the future. But in reality, LLMs have already profoundly changed the world, including software development and tech industry.

The people who pretend that’s not the case are not living in reality. To them - let’s call them “ed Zitron readers” - there is no evidence that could change their view that none of this is really happening, it’s all hype, and the collapse is just around the corner, after which we’ll all go back to normal and LLMs will sound like a bad dream.


facts!

but we can see trends and for your livehoood it is important to be able to make educated predictions based on trends. not saying everyone should start making AI predictions (though many already do)


And the same can be said for AI exuberance.

Yes, LLMs are a great technology. Yes, we will probably all use them all the time in 20 years. No, we don't know how we will use them (to generate cat memes or to cure cancer) in 20 years time.

Especially for software developers it looks increasingly that after huge turmoil it's likely we will need +/- the same number of developers in the world.


> Especially for software developers it looks increasingly that after huge turmoil it's likely we will need +/- the same number of developers in the world.

what exactly are you basing this opinion on? All I am seeing personally across multiple projects I am working on and other friends at other places is that downsizing is either begun or is planned (to exclude from here all the “public” layoffs we see on the news). Given how most business operate in the USA I think most of “AI strategies” are “we can do same with -40% staff” vs. “we can do XX% more work with same staff.”


The past couple of years have been chaotic and fearful. Hopefully that won't last forever.

If we can get a little stability, people will begin thinking less in terms of "how do we do the same thing cheaper" and more in terms of "how do we do new things."


I love this optimism but I after a (too) long career I think that 3rd thing will win out - "how we do new things - but cheaper (or as cheap as possible)" there are sooooo many different articles that have been discussed here on HN that basically argue "coding has never been the bottleneck" which to me is the biggest lie SWEs are currently trying to tell themselves, I have been coding 30+ years now and coding has always been the bottleneck. hiring new developers has always been justified with "we have all this work that needs to be done and not enough people to get the work done." with llms in the fold, I am questioning how will these decisions be made in the future? perhaps in the most simplistic view:

1. run a bigger "agent army"

2. hire more people to control and guide the existing "agent army"

I think it'll be #1 and SWEs will be expected to do more work and work longer hours in the future (those that are able to keep their jobs). this is more pessimistic outlook than yours so I hope you are right more than I am :)

edit: just now on the HN front page: https://www.nytimes.com/2026/05/08/technology/meta-ai-employ...


> that basically argue "coding has never been the bottleneck"

> we have all this work that needs to be done and not enough people to get the work done

I believe the reasoning is roughly to ask, what was occupying the developer hours? Was the majority of it typing out lines of code or was it reasoning about higher level concerns?

It usually comes up in response to predictions that the role of developer will be completely replaced in the near future. It's possible to observe significant efficiency gains without obviating the need for everything the role was doing.

Of course such reasoning has little to do with projections of future developer employment numbers. Will the switch from push mowers to gas mowers reduce the demand for people who get paid to mow lawns by increasing their efficiency? Will it increase the total lawn acreage across the market? It could well do both. However, if it makes having a lawn affordable for the average joe it could counterintuitively increase demand for the job.

Of course the stated goal of the AI companies is to develop the analog of fully robotic lawnmowers. But despite how impressive recent advancements have been we still have yet to see any evidence of novel abstract reasoning or a theory that would be expected to lead to it.

In other words, people have been speculating about the development of fully autonomous lawnmowers and the risk that they unilaterally decide to cut us all down for the past 50 years. "I, lawnmower" was a smash hit a few years ago. Now gas ones have appeared and continue to make rapid advancements but still no convincing signs of autonomy.


> I believe the reasoning is roughly to ask, what was occupying the developer hours? Was the majority of it typing out lines of code or was it reasoning about higher level concerns?

You're obviously right and the people who think that are the managerial types that think software developers were glorified secretaries writing after dictation.

LLM is great at generating stuff, but it's basically 3D printing. Amazing, but most of the high quality stuff in the world needs to be built at large scale out of aluminum, steel, wood, etc. Yes, I know there are large advances in 3D printing, but maybe 0.000000001% of all manufacturing in the world are done using 3D printing. A lot of stuff will probably never be possible using 3D printing.


Hmm, I don’t know, maybe the fact that 4.6, 4.7, 5.3, 5.4, 5.5, 3.0, 3.1 are all marginal improvements?


I think people's opinion of "marginal improvement" is based on their relative ability. A 2000 elo chess player is going to think the jump from 500 to 1000 is marginal. They're both floundering around not doing anything resembling common sense. A 1000 elo chess player is going to find the jump from 2000 to 2500 marginal. They're both playing far better moves for incomprehensible reasons, and the only reason you know the 2500 player is better is due to benchmarking. It is only when you are evaluating systems about at your level that you can feel the improvement.

I, personally, found the past two years to be a much larger improvement than the previous two years.


2024-2025 was filled with huge improvements. 2025-2026 has not been, outside of open source.

The idea that we’re at the point where it’s superseded our ability to tell just makes no sense. I’ll be happy if we can get to a point where I don’t have to tell Claude not to tail every bash command or make a job that writes throughout instead of once at the end. I’ll be happy if “continue this interaction naturally, you are taking over from an independent subagent” works.

But I’m not holding my breath. It’s still really cool that any of this stuff is possible.


Claude in feb of 2025 was barely able to code. Sure, it could write you a nice function, it could even write you a complex 200-line algorithm, but give it a codebase, and it would quickly get overwhelmed.

Claude in feb of 2026? Still far from perfect, but there's definitely a huge improvement here.


> I think this is a pretty ridiculous take.

This falls in the category of swipes/name-calling in https://news.ycombinator.com/newsguidelines.html - can you please edit those out?

You're a good contributor - it's just all too easy for unintentional sharpness to downgrade the conversation, and when it's a good conversation like this one, that's especially regrettable.


Noted, doesn’t seem like I’m able to edit anymore though


I've re-opened it for editing if you want to. For us the main point is just to fix things going forward!


The correct way to estimate this is exactly what people do. Measure the distance between ChatGPT's best public model and state of the art, the best humans. And there is very little difference between those versions from that perspective. It is very far away from peak human performance, and not getting noticeably closer for over a year now. There's lots of progress, but if you're OpenAI/Anthropic/Google, exactly the wrong kind of progress: the difference between ChatGPT 5.5 and a 27B/4B model (you need to try Gemma4-26B-A4B, wtf, it runs acceptably on CPU) is now reduced to ELO 1501 vs ELO 1434, generously a 70 ELO point difference, down from over 400, data from Arena.ai.

(in fact I find that Qwen-35B-A3B and Gemma4-26B-A4B very rarely "know" the answer, and so use first principles thinking, or go out and look for the answer where GPT-5.4 does not and simply assumes it knows. Which leads to now, in some cases, the small models far outperforming the big ones. Huge context + training quality seem to be the determining factors now, and neither of those are the strengths of SOTA models. If this continues ...)

While I agree this is a training problem, it is not a solvable one. ML models learn from examples. This is even true for their newest tricks like GRPO. They cannot train against things humans don't yet know.

And that's great, but you're forever locked at the peak of what you can be taught in widely available courses (which they download without paying) (even that is best case scenario: it assumes your ability to distinguish bullshit from reality somehow becomes perfect during training, or even before). The only way to exceed peak human performance is to start experimenting with math, physics, chemistry, even humans, yourself. And that has, even for humans, a massively higher cost than learning from examples, or from a course.

The reason they don't go further is the worst possible reason: the cost. It requires a 100x increase in training expense. Think of it like this: to exceed SOTA in physics or chemistry, training the next version of ChatGPT requires a particle accelerator, and a chemistry laboratory. This cannot be bypassed. Oh and not just any particle accelerator, right? A better one than the best currently existing one. Same for Chemistry labs. Same for ... So 100x is conservative.

But without doing it, ML models (LLM or otherwise) are forever limited at the level an army of first year university students achieve, ON AVERAGE. Maybe they can make that 2nd or even 4th year, at the end of the curve. But that's the limit. Phd level is the level you have to come up with new discoveries, and that ... just isn't possible with current training, even at the end of the improvement curve.

And ... is there budget to increase training cost another 100x? No ... there isn't. Not even with this totally absurd level of investment there isn't. And if small models keep this up, there's no way the investment is even remotely worth it.


Gemini 3.0 wasn’t just a marginal improvement over 2.5.

And if you take that out: 1. All of those releases happened literally in the last 3-ish months. 2. They’re all intentionally marginal releases, hence the minor version bumps instead of major versions.


Equally marginal?


No, the anthropic releases have felt marginally negative


Because the premise that the singularity is just around the corner is far less likely than the premise that artificial intelligence is a lot harder than most people think it is and we're not that close.

Especially because the companies telling us the first premise is true are the companies which need investors to prop up their business.

I mean, it is possible the first premise is true, but the absolutely bonkers credulity in it really mystifies me. It is an incredibly unlikely thing to be true and we should be demanding quite extraordinary evidence to back it up. But based on some neat tricks by current LLMs, some people are all in.


> > And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop). Now back to the point, what reason do you have to believe progress will stop soon?

> Because the premise that the singularity is just around the corner is far less likely than the premise that artificial intelligence is a lot harder than most people think it is and we're not that close.

I see no claim that the singularity is around the corner, so I'm not sure your reply meets the comment that you're replying to.

It seems overwhelmingly likely that AI will be significantly more capable 6 months from now than it is now. Even if there's little progress in the models, just the rate at which tooling is moving will make a big difference. And models still seem to be improving, so I'd be a little surprised if we hit a model brick wall.


It’s more of a guess if you don’t know about things like scaling laws and RL with verification. The onus of “we’re going to saturate” anytime soon is on that claim because every measurement points to that not being true.


But… RL doesn’t scale that well. It’s not the silver bullet you think it is.


Yeah. People (Gary Marcus) have been claiming that AI will hit a wall or is hitting a wall or already has hit a wall since 2023, basically. And yet every time they proclaim that the AI industry found new ways of training their AI's, new ways of integrating them with external tools and feedback loops, new architectures and more to keep the exponential growing. And sure enough if you look at literally every attempt to objectively rate and verify the capability of these models, including things like the METR time horizon autonomy index or the artificial analysis intelligence index, you see exponential or even greater than exponential growth, continuing smoothly through each of the points people claimed that it would begin to slow down, with no sinus slowing down or stopping at all. So yeah, I think at some point the onus has to lie on the ones that are making the claim that keeps being wrong and the continues to be wrong and it completely goes against the current tangent of the curve that we're seeing in all objective metrics. Especially when they can't give specific new reasons for progress to stop beyond the ones they gave last time. It didn't stop and really can't give specific reasons at all besides vague general points about stochastic parrots and S curves.

I really have to highlight the S-curve nonsense because, like, yes, I think this technology's improvement will follow an S-curve. It's absurd to think that it will just follow an exponential up towards infinity forever because nothing in the world really works like that. However, like everyone else in this thread is saying, we have no idea where on the S-curve we actually are, and it's impossible to know until it's already slowed down. So really all appeals to the S curve do are as function as a sort of non-specific, unfalsifiable prophecy that someday it will slow down, which doesn't really tell us anything useful, and also frees the person referencing the S curve from ever actually having to worry about being wrong. Just like the Singularity people, the slowdown of the S curve is always near. This is actually a known and well-established tactic of religions and other people that want to make prophecies without having to worry about turning out to be wrong — unfalseifiable vague prophecies with no actual timeline, and thus no clear import to the present so that they can never be shown to be wrong.


He said "will stop anytime soon". He didn't say forever.


Which still makes no sense. There is the same chance we are flatlining now as that we are flatlining in e.g. 3 years or 5 years.


In what sense are the models flatlining?


In the sense that the incremental improvements in capabilities that we've been seeing in recent models seem to taking exponentially growing amounts of compute to achieve.


But they don't?

Mythos is a 10T model. Opus is a 5T model.

That's not an exponentially growing amount of compute but it is achieving exponential improvements (eg from Mozilla: https://blog.mozilla.org/en/privacy-security/ai-security-zer... )


> but it is achieving exponential improvements

“Exponential” used here is pure hyperbole. Can you justify it?


Compute doesn't necessarily linerarly follow parameters. And with how many active parameters Mythos vs Opus gets its effectivenes from? Is it 1x or 2x? We don't know. We don't even know the parameters (it's more of rumor than confirmed 10T iirc).

But even more so, who said the improvements are "exponential"? Mozilla's single metric, that doesn't even prove anything of the sort?


I know parameters don’t translate directly like that (and that linear and exponential aren’t the only types of growth) but a doubling as a go-to example of “not exponential growth” is pretty funny.


Wasn't 4.6 Sonnet a 1T model?

Parameters and compute are quite the same thing, but going from 1T to 5T to 10T is quite a ramp up.


where the heck did you get those parameter numbers from?


Sonnet and Opus are from Elon Musk (given the people he's hired it seems likely it is approximately true). Mythos is quite widely spoken about.


> Mythos

Ah yes, the marketing model that's ostensibly so powerful us mere mortals aren't allowed to use it. It's certainly led to exponential hype and speculation.


There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes.

I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?


Those are measuring the utility of a technological advancement by looking at usage, not the pace of advancement of said technology.


Yes. But quantity has a quality all its own, as they say — derivatives have gone through at least a few step functions where they have become more important and more useful as their usage grows. I’d call that advancement.

Maybe just to be clear I think that kneejerk “I hate this AI trend, and prefer to believe this will end soon, all exponential growth ends eventually” is intellectually lazy, and dangerous for younger engineers/hackers, a group I hope can benefit from being on HN.

Bitcoin mining went through something like 13 10x growth periods, last I ran the numbers a few years ago. There are physical processes that do have very extended periods of doubling, and there are digital and financial processes that don’t show any signs of doing anything but continuing to keep growing over their multidecade lives. So, like I said, it’s worth thinking carefully, and risk mitigation for things like mental health, career decisions and investment decisions indicates we should be cautious assessing new dynamics.


>There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes

Or Roman trade volume before the Fall of Rome.

Not to mention what you describe is not technological improvement but increase in data or money flows, not the same.


Sic transit gloria - obviously.

But I don’t that think it’s quite so obvious that model quality / growth / usefulness is definitively and obviously not more like data or money flows than it is like some other process.


Total volume of usage is not an advancement, it’s orthogonal.


Indeed, and it's more linked with market penetration than technological advancement. It's like evaluating airplane technology by "total miles flown".


This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently (https://www.nature.com/articles/d41586-024-03214-7).

So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve the quality of the output while reducing the amount of transformers needed for decoding and encoding IO and for internal reasoning.

There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.

You combine those variables, along with the internal disposition of layers, parameter size and the actual dataset, and you have such a large search space for different models that no one can reliably tell if LLM performance is going to flatline or continue to improve exponentially.


> So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve

But then, wouldn't we first have to translate all of our current math and physics knowledge into that new representation in order to be able to train a model on it? Looks like a tremendous amount of work to me.


Yes, but by then you already have general LLMs capable of helping with the work. And even if you didn't, if that's what it would take to advance research in these fields, that would be a justifiable effort.


>This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently.

That's precisely what happens on the bad side of a S curve.


Progress don't stop however, and the S curve resets, because then you are optimizing a new architecture.


I read an experiment someone wanted to try where they used pre-1900 content and tried to get relativity. Another version would be train an LLM on school curriculum up until calculus and see if it can invent calculus. Where we are on the curve depends on if it's remixing known things or genuinely inventing things.

From the article,

> ...LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it. Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments...


What people miss is that AI isn't one S curve, each capability we try to bake into a model has its own S curve. Model progress might not impact some capabilities at all, but other capabilities might get totally overhauled.


Software and hardware have no limits. Theoretically would could bozons for computations and have the same amount of computation available on one cm3 of the current total computation in the entire world. Same with software. Never there was a stop on new algorithms. With LLMs there are so many parts that will get better and are not very far fetched.


> Software and hardware have no limits.

Yeah, if time is infinite, R&D imagination is infinite, energy is infinite and material resources are infinite. Easy.


Assuming it’ll stop soon is to wager that we’re at a very specific point on the curve.

If it’s anyone’s guess then we’re much more likely to be left of that, unless you argue we’re already on the flat side.


It can be S curve (and it almost surely is), but on every chart you can plot, you don't see even of an inkling of the bend yet.


you can tell where on the sigmoid we're currently sitting? frontier lab folks can't - chapeau bas good sir


> frontier lab folks can't

Do you have a source for this that isn't marketing spiel? There's a fiscal incentive to lie about scaling research.


This is FUD and extremely wrong. None of the advancements have followed an S curve. This time IS different and it should be obvious to you at this point.


What the fuck does that have to do with “soon”?


There are many indications that model progress is slowing down, so that is not entirely accurate.


Please be specific because outside of anecdotal blog posts by people who don’t know what they’re talking about it’s not true. Look at scaling laws, composite benchmarks from the epoch capability index, nothing at all suggests “model progress is slowing down”


Which indications are that?


The cost factors on the new models compared to the old models.


Qwen3.6 9B is as good as GPT-4o and runs on my M2 MacBook Air. Models are getting stronger and less costly at the same time, but these are somewhat separate branches of research. Frontier labs are spending more because they are still getting marginal returns and there is more capacity to spend than there was a year ago.


Qwen 3.6 9B doesn't exist.

If you meant 3.5 9B and you truly believe it's as good as 4o then I can only assume you have a very basic use case.


You are right, I was mistaken about the version. I evaluated it in general chat assistant prompts plucked from my history across a range of topics but did not use it for coding - there was never a time when I thought 4o was “good enough” for agentic coding.


You are mixing cost and progress. It’s not because it’s more and more expensive that progress is slowing down by itself.


They are intrinsically linked beyond a certain point. If we're making progress but costs are spiraling exponentially then it stands to reason that we will soon reach a point where we can no longer afford the increasing costs and thus progress will slow.

(barring some breakthrough that reduces costs, which of course may happen, but for which recent model improvements are not strong evidence of)


Cost for a specific level of performance decreases 10x per year, this has been a pretty consistent property for awhile now.


I guess within the domain of AI, a pertinent question would be: "do I want to use anything but the best?" The errors older models give being directly analogous to being stupider in my eyes.


Depends — many tasks in various pipelines have a reasonable Pareto frontier and diminishing returns after a certain level of performance. You may just have a high budget constraint (say like YouTube computing ASR subtitles; they are not going to be using the best ASR models because it’s expensive). If it’s myself, with a coding agent, I’m going to get the best thing I can afford.


Investment dollars.


Source for that claim?


Nobody is releasing NEW models


…not only is this not true but it also doesn’t matter. Why would this indicate performance saturating?


The standard networking connection has been called “Ethernet” for more than thirty years, so networking has stagnated, right?


If higher bandwidth networking consisted primarily running more and more ethernet lines in parallel, you would most certainly agree that "networking has stagnated".

"Reasoning" and now "Agentic" AI systems are not some fundamental improvement on LLMs, they're just running roughly the same prior-gen LLMS, multiple times.

Hence the conclusion that LLM improvement has slowed down, if not stagnated entirely, and that we should not expect the improvements of switching to these "reasoning" systems to keep happening.


From TFA:

“ChatGPT came up with an idea which is original and clever. It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove”


You misunderstand. I'm not saying that Reasoning/Agentic systems aren't better.

I'm saying they're not an advancement in the tech in the way GPT 1 through 3 were. They're a different kind of improvement.

And as such the rate improvement cannot just be extrapolated into the future.


GPT1 through GPT3 advancement were exactly like using more Ethernet cables in parallel.

All interesting conceptual breakthroughs came after GPT3: RL and reasoning being the main ones.


What constitutes a NEW model for the purposes of calculating progress?


What? DeepSeekV3 just came out and is incredible for the price. Mythos is also half-released.


Until you or I can actually use Mythos in Claude without an nda or other strings attached, Mythos is not released and is just an effective marketing tool for Anthropic.


At least to me this is a pretty sour grapes take. There are all kinds of released products that are expensive or need an NDA. You're just too poor to afford it. But make no mistakes there are governments using this in mass and likely against you.


I think that’s worthy of at least sour grapes, too.


Model progress at spitting out unhallucinated facts is slowing down hard. Model progress at solving hard math challenges/programming tasks doesn't seem to be slowing down that I can tell.


Deep think still makes many many many more mistakes than gpt 5.5 pro on math


LLMs are at their best when you have an expectation for their output. I generally know the shape of the correct response and that allows me to evaluate it's output on it's "vibes", rather than line by line. If there's no expectation then I have to take everything at face value and now I'm at the mercy of the machine.


Exactly, if I generate a large chunk software, I'm going to have expectations about what it will do, how it will do it, etc. You don't just accept the statement that "it's done" for fact but you start looking for evidence.

A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.

I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"


> Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

Exactly - you need to constantly have your sceptics glasses on and you need to be exacting in terms of the structure you want things to follow. Having and enforcing "taste" is important and you need to be willing to spend time on that phase because the quality of the payoff entirely depends on it.

I recently planned for a major refactor. The discussion with claude went on for almost two days. The actual implementation was done in 10 minutes. It probably has made some mistakes that I will have to check for during the review but given that the level of detail that plan document had, it is certainly 90-95% there. After pouring-in of that much opinion, it is a fairly good representation of what I would have written while still being faster than me doing everything by hand.


So you have to know the answer and also be an expert in the problem domain?


In my experience you need exactly what you said, and I would add that he probably would have spent half day to do the refactoring himself and it would be sure he did right.


I don't think you have to know the answer. If the person you replied to knew the answer, there wouldn't have been a big, lengthy discussion.

But yes, being an expert in the problem domain helps. Or at least knowing enough to know what the right questions are and what plausible answers look like.

I just had a similar situation where an hour or two of conversation turned into a five-minute robot coding task. The problem required a solution and the number of possible solutions is vast, but that list can be refined, and then once the course of action is set, sometimes the course itself isn't all that complicated.


I can speak towards building large-scale systems from scratch with these tools. I've been working since late last year on a project that was barely a tech demo, and the progression of development on that project has seen me go from leveraging co-pilot autocomplete at the start, to full-on vibecoding 100% of the new additions.

I have reasonable eng chops I'd like to think - I have been a senior IC for a while on a reasonably diverse set of challenging systems problems and built out some pretty large-scale pieces of software the old "artisinal" way.

This particular project is a productization of some ideas I had for leveraging a virtual machine to execute high-divergence parallel logic on GPUs, in an effort to move complex things like "unit behaviour in games" (the classical symbolic kind, not NN-based unit behaviour) into the GPU. The project is going well but still quite a ways from release. But it's at about 300k lines of code now across 9 or so rust repositories, and a smattering of typescript on the frontend.

I have had stumbles, but overall I feel I have put together some good strategies and principles for pushing large projects along with these tools in an effective way.

The biggest takeaway for me is that the "feel" is different. Software construction by hand felt like building legos where you put the pieces together yourself. A lot of my focus would be on building and solidifying core components so I could rely on them when I stepped up to build higher-level components. Projects would get mired quickly if you didn't solidify your base.

With agentic development, one of the early challenges I ran into was this issue with something I'll call "oversight inception". It's when at some early point in the process a somewhat low-importance decision is made - an implementation decision, a decision to say.. align a test with the implementation rather than an implementation with a test.

Then, as you build more on top of this, that small decision somehow ends up getting reified into a core architectural policy that then cascades up.

You realize that when you're building a big project, the focus on some particular component is backstopped by a general understanding of local development directionality with respect to the larger level project. And the agent has no idea of directionality.

So small chinks in the design end up getting magnified and blown up as the dev process proceeds, and later on review you find major architectural pieces have just been overlooked, all flowing from some small incidental implementation choice a long time before.

This is one among a number of issues, but it's a big one. Once I saw it happening I tried an approach to mitigate it by developing a set of golden "goal" documents that describe directionality at the project level: what you are working towards and what design components need to exist.

This doesn't eliminate the "oversight inception" issue, but it does catch them earlier.

When I started applying the goal documentation aggressively to re-align the project implementation direction, I found velocity dropped a lot.

And as I progress, I'm balancing this out a bit - to allow the system to diverge a bit, but force reconvergence towards the goals at some specific cadence. I haven't found the right candence yet but I'm getting there.

This new style of development feels more like claymoulding pottery than lego assembly. You sort of "get it into shape". It's a very interesting new set of process assumptions.


I agree, but I would add that they can be very useful even if you do not have clear expectations but have some solid ways to verify their claims. Often in doing this verification I came up with new ideas.


I'm no physics professor but this aligns with the way I use the tools in my "senior engineer" space. I bring the fundamentals to sanity-check the trigger-happy agent and try to imbue other humans with those fundamentals so they can move towards doing the same. It feels like the only way this whole thing will work (besides eventually moving to local models that do less but companies can afford).


Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn. It does not, and it is for the human brain a formidable task to remember that something as smart as an LLM does not learn. I keep catching myself making the same mistake.

It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually.

I have not yet played with the long term memory feature, but I fear it will be even less reliable than prompts, simply because in one year or two years so much will have changed again that this “memory” will have to be redone multiple times by then.


They can form new associations between concepts via their input prompts and thinking text. That is a form of learning. Just not very durable. I liken it to https://en.wikipedia.org/wiki/Anterograde_amnesia


yeah, I should have been more specific: I meant the type of learning that mentoring fosters, the long term learning.


I hear you. I think we are already seeing some middle ground with agentic systems using RAG, skills.md files, etc. It's a sort of disassociated card catalog memory. An engineer's notebook. Not the integrated, correlated, pre-processed set of relationships in the model. How to go backward from the notebook -> model cheaply without tanking performance is definitely one of those billion dollar questions.


a little glib, but there is in fact long term learning. It's just that you are not the one mentoring- the models go to intensive OpenAI/Anthropic/Google school for a quarter or half a year and come back (hopefully) improved. You just hope they're getting a good education. Certainly it's a very prestigious one.


Current LLM architecture doesn't learn - and you're right this is a huge piece that normal folks fail to understand, since in many ways, it's the opposite of what years of AI research has been trying to create.

However, I think it's important to remember that LLMs are embedded in larger systems, and those larger systems do learn.


If I was a frontier lab and I solved continual learning, as of today I would absolutely not release it - the society isn't ready for this; society isn't even ready for widespread diffusion of current publicly available frontier models.

If however I was a frontier lab who solved continual learning and my competitor also solved and released it, I would release mine immediately, obviously.

The point is, continual learning might be solved already, we just don't know and those who might know would rather keep their mouths shut. It isn't my base case (financial situation of frontier labs is such that they'd probably release immediately as long as they have inference compute to serve this revolutionary capability), but it isn't impossible.


You're not a frontier lab, the shareholders own those. And if shareholders get a private briefing about an unprecedented breakthrough in continual learning, they would announce it from the rooftops to take credit for the progress ASAP and reap the rewards for their stock value.

The only lab that I can exempt from this is DARPA.


Shareholders are not insiders. Public companies do secret projects all the time of which shareholders know absolutely nothing about and may never learn the details of them if they get cancelled.


Private market dynamics are not the same buddy.


Everyone owns them at this point and Google is outright public.


No youre missing the point of the poster - disclosures in private markets are different than public, especially in the context of large commitments - the company has no choice.


The implications are very different if everyone owns them even if they aren’t public. They may have no choice whether to share, but the owners which have the privilege to know (not everyone because earlier owners aren’t stupid) don’t act the same, right?

And let’s be honest - rules get bent all the time, especially when valuations are 9 figures. Stakeholders at this point won’t risk killing a golden goose.


exactly like you said - the harness might learn.

we do also have training on synthetic data. it might compound.


I mostly agree, though after a mentoring session you can ask it to write skill or a memory and it can be reasonably durable. For Claude at least, the memories work pretty well (though I am still at a small scale with them. As they grow it might start to break somewhat. Doesn't always work, but has often enough that I thought it worth a mention.


> Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn.

I think this is a bit pedantic. Obviously the parent you’re replying to is referring to the concept of “in-context learning”, which is the actual industry / academic term for this. So you feed it a paper, and then it can use that info, and it needs steering / “mentoring” to be guided into the right direction.

Heck the whole name of “machine learning” suggests these things can actually learn. “reasoning” suggests that these things can reason, instead of being fancy, directed autocomplete. Etc.

In other news: data hydration doesn’t actually make your data wet. People use / misuse words all the time, and that causes their meaning to evolve.


Anthropomorphism is a subtle marketing tool used by these big AI companies, who are financially incentivized to push the myth of AGI and want everyone to believe they're right on the cusp of achieving it. It's good to be pedantic in this case, we shouldn't anthropomorphize these tools.


This is just a “hurr durr AI companies evil” argument without substance.

It’s the people that are the problem, nobody told the grandparent to use “mentoring” as a word, and my argument is that it’s a complete overreaction to classify them as anthropomorphizing AIs, and I’d argue default to that argument would be an insult to them, and it’s super pedantic.


> This is just a “hurr durr AI companies evil” argument without substance.

If you say so bud.

> nobody told the grandparent to use “mentoring” as a word

Nobody told people to say "Google it" either; nobody told us to use the word "Kleenex" when we mean tissue; nobody told us to use the word "Chapstick" when we mean lip balm. Nobody told British people to say "Hoover" when they mean vacuum, or "Sellotape" when they mean transparent tape.

This is literally how soft influence works, it's how brands "colonize" language. A professor using the anthropomorphized word "mentoring" when talking about a machine, as if it's a student that can learn and develop relationships, is this same soft influence at work. The AI companies' websites are all riddled with cognitive language, their chat bots all use conversational UI like you're talking to a person, the bots answer with "we," "me," and "I." They created an environment that made anthropomorphized language feel natural, which only helps their marketing goals.

Go ahead and call it pedantry all you want, but that's the whole point. The problem is epistemic.


I agree it’s pedantic and personally don’t get bent out of shape with people anthropomorphizing the llms. But I do think you get better results if keep the text prediction machine mental model in your head as you work with them.

And that can be very hard to do given the ui we most interact with them in is a chat session.


Absolutely, but there is no evidence that the grandparent was doing that, all they did was use the word “mentoring” and my argument is not that anthropomorphizing isn’t a problem - it is - but that the response to this particular HN is super pedantic.

Obviously the real people that are classifying AI as human intelligence aren’t going to be the top comment on reviewing LLM’s PhD-level papers. They are on very different, much more problematic areas of the internet.


But in-context learning is like a student only remembering what they’re being taught for the duration of the discussion. That’s not really how mentoring is meant to work, so pointing out the issues with the metaphor seems pretty reasonable.

In other news: That words can change meaning doesn’t mean that every possible change in meaning would be beneficial to communication and therefore desirable. Would you advocate in support of someone suggesting to use “left” to mean “right” simply on the basis words can change in meaning?


> ... that something as smart as an LLM does not learn.

what? training is learning, as long as weights are available continual learning is perfectly feasible: just keep training the LLM with the user corpus alternated with a frozen version to prevent catastrophic drift / collapse.

it's not because model providers don't want to provide user specific continual learning, that we don't know how to do it.

it would be a lot more expensive to host user-specific model weights, and would prevent amortizing the weights over many requests in batches...


I agree and put it this way: LLMs sound so convincing presenting you the work it does rose colored and promising to give you more if you keep going.

There is a 50/50 chance that it turns out to be right or letting you jump of the cliff.

Only the trip stays the same beautiful 5 star plus travel.

Also, spotting an error and telling LLM makes it in most cases worse, because the LLM wants to please you and goes on to apologize and change course.

The moment I find myself in such a situation I save or cancel the session and start from scratch in most cases or pivot with drastic measures.

Gemini to me is the most unpredictable LLM while GPT works best overall for me.

Gemini lately gave me two different answers to the same question. This was an intentional test because I was bored and wanted to see what happens if you simply open a new chat and paste the same prompt everything else being the same.

Reasoning doesn’t help much in the Coding domain for me because it is very high level and formally right what the LLM comes up with as an explanation.

I google more due to LLMs than before, because essentially what I witnessed is someone producing something that I gotta control first before I hit the button that it comes with. However, you only find out shortly afterwards whether the polished button started working or gave you a warm welcome to hell.


Reusing the same prompt several times is something I've started doing too. The contrast is often illuminating.

In one case, it made a thoroughly convincing argument that an approach was justified. The second time it made exactly the opposite argument, which was equally compelling.

I now see LLMs as persuasion machines.


Before AI happened I watched youtube. Occasionally I encountered there very convincing arguments. Same person often made very convincing arguments on many subjects.

But noticed that the closer the domain they were talking about was to my area of competence the less convincing their arguments were. There were more holes, errors and wrong conclusions.

I recalibrated my bs meter thanks to that.

Since AI came I successfully used this strategy of being extremely cautious towards convincing arguments to not become mislead by AI.

However this year I'm working with AI more in the domain of software development. Where I can see the competence. And I see the competence. This had opposite effect on me. I tend to trust AI outside my domain of expertise much more after I saw what can it do in software.

One caveat though is that there are a lot of areas of human culture where there's very little actual knowledge, but a lot of opinions, like politics, economy, diet, business, health. I still don't trust AI in those domains. But then again, I don't trust humans there either.

For me basically AI achieved the threshold of useful reliability for any domain that humans are reliable at.

I don't really care about sycophancy. I might have a slight advantage that I don't talk to AI in my native language. So its responses don't have a direct line to my emotions.


One thing I've been doing lately -- and I'm in a business function, not a technical one, although I have an engineering background -- is pitting LLMs against each other. For example, if I'm structuring a proposal or a contract with the assistance of Claude, I'll begin my 360 feedback review first by asking Claude how it would react if it were the counter-party receiving the proposal. After some iterative changes, mostly manual, I will then run the same output document past Gemini and ask it to adopt personas from both sides and provide reactive feedback. The result of this is almost always a stronger proposal that I can also accompany with proactive objection handling and a solid FAQ, as well as clear points of negotiation that will likely be acceptable to both parties.

For this sort of thing, using multiple LLMs is extremely helpful.


Ever since they started getting really sycophantic, I’ve been presenting my ideas as “my co-worker says this is a good approach but I disagree, can you help me convince him that it’s wrong?”


>LLM wants to please you

I was using Copilot and asked it a question about a PDF file (a concept search). It turned out the file was images of text. I was anticipating that and had the text ready to paste in.

Instead, it started writing an OCR program in python.

I stopped it after several minutes.

Often Copilot says it can't do something (sometimes it's even correct), that's preferential to the try-hard behaviour here.


> Gemini to me is the most unpredictable LLM while GPT works best overall for me.

This nails an important thing IMHO. I've absolutely noticed this, for better or worse. Gemini can produce surprisingly excellent things, but it's unpredictability make me go for GPT when I only want to ask it once.


I think that ultimately, the largest change brought on by LLMs will be due not to their intelligence, but to their tenacity.

If you had an infinite number of monkeys, each with a typewriter, one would eventually write Shakespeare. If you had an infinite number of college-educated interns, each with access to all the public records you can possibly get via FOIA, one would eventually get enough evidence to prove that a top politician is cheating on their partner, evidence which you could use to blackmail that politician.

You don't need that much intelligence to do that, you just need somebody who's willing to dedicate their life to knowing everything there is to know about that guy from Louisiana.

With humans, the amount of money you'd need to pay such a person just isn't worth the reward. With LLMs, it may very well be.


please, sign up for a paid plan of either chatgpt or claude. gemini is while close, still noticeably behind

you deserve opinions shaped by interactions with the best tools that are out there.


Gemini feels deep and philosophical. Especially for product management. Tell him you're a product manager and we're a team of two.

But regular reminder - All LLMs can be wrong all the time. I only work with LLMs in domains I'm expert in OR I have other sources to verify their output with utmost certainty.


Or when you don't care about results being very correct.

When I'm cooking meatballs with sauce and the recipe calls for frying them, I'll have an LLM guestimate how long and which program to use in an air fryer to mimic the frying pan, based on a picture of balls in a Pyrex. So I can just move on with the sauce, instead of spending time browsing websites and stressing about getting it perfect.

I used to hate these non-deterministic instructions, now I treat it as their own game. When I will publish my first recipe, I'll have an LLM randomize the ingredient amounts, round them up to some imprecise units and also randomize the times. Psychologists say we artists need to participate and I WILL participate.


> I only work with LLMs in domains I'm expert in

This. Should become a general rule for any non-trivial use of LLM in a professionel setting.


LLMs can also be really good in fields where you are not an expert. You just need to be very aware of your limitations, and start parallel conversation so one agent fact checks the other.


Seriously, it’s not worth reaching for less intelligence. Use Extended Pro 100% of the time for things you’d spend the amount of time GP spent writing their post.


Gemini is certainly not behind Claude in terms of physics.


Agreed, Gemini is clearly a capable model, but the tool use is lagging behind the other two. Ironically it regularly gets things wrong (ie. the current version of some software) because of an unwillingness to use web search.


ChatGPT and Gemini are actually fairly comparable.

Claude has been utterly useless with most math problems in my experience because, much like less capable students, it tends to get overly bogged down in tedious details before it gets to the big picture. That's great for programming, not so much for frontier math. If you're giving it little lemmas, then sure it's great, but otherwise you're just burning tokens.


We've got a rather extensive AI setup through our equity fund and I've setup a group of agents for data architecture at scale. One is the main agent I discuss with and it's setup to know our infrastructure and has access to image generation tools, websearch, hand off agents and other things. I tend to use Opus (4-6 currently) and I find it to be rather great. As you point out it comes with the danger of making mistakes, and again, as you point out, it's not an issue for things I'm an expert on. What I rely on it for, however, is analysing how specific tools would fit into our architecture. In the past you would likely have hired a group of consultants to do this research, but now you can have an AI agent tell you what the advantages and disadvantages of Microsoft Fabric in your setup. Since I don't know the capabilities of Fabric I can't tell if the AI gives me the correct analysis of a Lakehouse and a Warehouse (fabric tools).

What I do to mitigate this is that I have fact checking agents configured to be extremely critical and non-biased on Opus, Gemini and GPT. Which are then handed the entire conversation to review it. Then it's handed off to a Opus agent which is setup to assume everything is wrong. After this, and if I'm convinced something is correct I'll hand the entire thing off to a sonnet agent, which is setup to go through the source material and give me a compiled list of exactly what I'll need to verify.

It's ridicilously effective, but I do wonder how it would work with someone who couldn't challenge to analytic agent on domain knowledge it gets wrong. Because despite knowing our architecture and needs, it'll often make conceptional errors in the "science" (I'm not sure what the English word for this is) of data architecture. Each iteration gets better though, and with the image generation tools, "drawing" the architecture for presentations from c-level to nerds is ridiclously easy.


Are you using this agent hive for any repeatable tasks? What you described, superficially, seems like a one off. Genuinely curious.


I think it depends on what you mean by repeatable tasks. I reuse the critical handoff agents quite a lot since they are basically just set up to help spot bias and errors. I kind of reuse the top agent. I have a few "core" configurations that I can add to. So one will know our network, one will know our data architecture and so on, to keep them a little more focused. So for this specific agent that I described, I'll add a few lines on what I'm considering to the configuration, that I'll not reuse for anything unrelated to Microsoft Fabric. I've tried using these "core" agent configurations as hand-off agents in the past, but it doesn't seem to work well in our setup which is very isolated because we're NIS2 compliant.

I don't usually go back to the original prompt. I've actually done it a few times in regards to the presentation, to get some refined images but usually I'll start a new prompt.


In my previous jobby job I needed to pull CSVs out of Tableau, then from an ancient monolithic PHP admin and other sources, then manually merge them, reformat them in G Sheets, pivot this and that and send the report to my supervisor. Initially took 2 hours then down to one, but still senseless busy work. It was the “fault” of the incumbent IT, but if I could turn that hour into a minute… I wouldn’t get a raise, but I’d have more time for something else or nothing. I feel like this is still a scenario for countless many and perhaps the valley of the low hanging fruit. That’s where my question was coming from.

Your firm seems to operate on a higher plane, jealous :)


I've been watching the automation of things like flight control systems for the past decade, and the evolution of the fallback to a real pilot in the event of a emergency is what's most concerning about where LLMs are being embedded.

Right now, we have a lot of smart people who have trained for decades to understand where these things go wrong and how to nudge them back, but the pool of people are going to slowly be replaced by less knowledgeable.

At some point, a rubicon will be crossed where these systems can't fallback to a human operator and will fail spectacularly.


Watching a teenager approach their homework, instead of struggling to answer questions they don't know, they ask Gemini. Unfortunately, I think the mental struggle to approach an answer is where much of the learning is. They also miss out on the reward for persistence of seeing things fall together.

It is troubling. It suggests a plateauing of human understanding.


It absolutely is where the learning is, that's pretty well established brain science.


It's a struggle to map a good approach - LLM-based tools are a boon in some respects, like having a personal tutor. But in others are fundamentally opposed to the process of education.

Like, I asked ChatGPT to make me some problems, it did, then I got to check my answers. In the past I'd have had a textbook for that; but schools stopped giving those out decades ago.


What that means practically is that we've got a generation - 25 years or less - to evolve these things not to need the fallback. If such a thing is possible.


We're on the road to Idiocracy.


This doesn't surprise me since the coding agents are similar. I've previously compared them to very fast, ambitious junior programmers. I think they are probably mid-level coders now, but they continue to make mistakes that a senior programmer wouldn't. Or at least shouldn't.


This is close to my experience with code. LLMs can pick out small mistakes from giant code changes with surprising accuracy, or slowly narrow down a weird. On the other hand I've seen them bravely shoulder on under completely incorrect conceptual models of what they're working with and churn around in circles consequently, spin up giant piles of slop to re-implement something they decided was necessary, but didn't bother to search for, or outright dismiss important error signals as just 'transient failures'. Unlimited stamina, low wisdom.


Hi ziotom! I wonder about you work in 3D Cifford Algebras. May you share some links to the research you do? I also have interest in this topic I research on my own.

Just in case if you don't want to disclose your name my email is northzen@gmail.com


Gemini’s smug and over-confident “this is the gold standard in 2026” definitely leaves little space for nuance if you don’t know the subject matter. Human students would, hopefully, know they don’t know everything.


> Gemini’s smug...

Anthropomorphizing these systems is dangerous, whether coming from the bullish or bearish perspective. The output is statistically generated by a machine lacking the capability to be smug.


It's only "statistically generated" in the same way that your brain is just "neurons firing." That's the low-level description of what's happening, but on a higher level, it's correct to say that it's being smug.


> it's correct to say that it's being smug.

It's not correct to say that it's being smug, because when people are being smug, we do it for a purpose - e.g. to signal higher social status or superior knowledge.

A machine has no such imperative, so what you call 'being smug' is statistical mimicry.


The LLM has learned certain behaviors, including smugness. Its motivations for being smug may be different, but it's being smug nonetheless.


I suppose the model was trained in such a way, the smugness is a facsimile of reality. Other models offer concise, direct answers without these idiotic qualifiers.


>Anthropomorphizing these systems is dangerous

That ship has sailed. Humans will anthropomorphize a rock if you put googly eyes on it.


First I thought to myself, "my daughter does this and it looks so cute". And only as a second thought, that your comment just proved itself.


I find exactly the same for legal analysis. Great at ideation and proofreading but frequently misunderstands concepts and hallucinates conclusions from faulty premises.


I assume that once LLMs are trained with large [synthetic] information about 3D Clifford algebras it will work better.


> in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

I have no idea what any of those words even mean. I'm sure LLMs make similar obvious-to-professors mistakes in all the domains. Not long ago, we didn't even have chatbots capable of basic conversation...


Ironically, it's sort of the other way around! Every frontier chatbot since GPT 4 (at least) has had a pretty good understanding of even very esoteric technical concepts.

Bivectors and pseudoscalars (in a 3D context) are "just" signed areas and volumes. Easy!

Back around the GPT 3, 3.5, and 4.0 era I used to ask the bots to explain "counterfactual determinism", which is one of the most complex topics I personally understand.

Then I would lie to the bot about it, and see if it corrected me or not.

This test is useless now, the frontier models can't be fooled any longer on such "basic" concepts.

Conversely, LLMs are basically useless at anything that doesn't have enough (or no) public information for their training. Think: obscure proprietary product config files and the like, even if the concepts involved are trivial.

Similarly, Clifford Algebra is a relatively niche (even "alternative") area of mathematics and physics, with vastly less written material about it than the competing linear algebra. Hence, the AIs are bad at it.


Any experience with NotebookLM?

Mine has been epically bad.


I don't think the experience with Gemini will be the same when using GPT.


Chiming in to agree but clarify that the latest sota models are no better than Gemini.

I put my stuff through several sota models and round robin them in adversarial collaboration and they are all useful even though, fundamentally, they don’t “understand” anything. But they are super useful delegates as long as deciding on the problem and approach and solution all sits safely in your head so you can challenge them and steer them.

So I know the article is about one particular new model acing something and each vendor wants these stories to position their model as now good enough to replace humans and all other models, but working somewhere where I am lucky enough to be able to use all the sota models all the time, I can say that all keep making obvious mistakes and using all adversarially is way better than trusting just one.

I look forward to the day one a small open model that we can run ourselves outperforms the sum of all today’s models. That’s when enough is enough and we can let things plateau.


Basically all Erdos problems that get solved with AI use ChatGPT 5.* Pro, not Gemini/Opus.


I would guess it's because ChatGPT Pro allows for 80min "think". I've never had even remotely similar think times with Gemini Deep Think. It's generally around 10-15min for math problems, and get increasingly shorter for continued interactions.


intern that never sleeps


LLM’s are the most powerful tool invented to search across a huge information space in response to human input.

That’s all they are. They don’t ‘know’ anything intrinsically and do know ‘know’ what reasoning even is.


Initially, I was skeptical and thought that this is the millionth vibe-coded project that will die once the author gets bored.

However, when I checked who the author is, I found he is is Graeme Geldenhuys, the author of the fpGUI library [1]. He surely has a lot of experience with FreePascal and the Pascal language, and in these years he has proven to be a committed worker (fpGUI exists since 2010!). So, this project seems to start on good grounds!

I really hope it will gain traction, as I have often wondered myself why somebody would not create a “clean” Pascal compiler from scratch with no legacy cruft and good defaults (e.g., UTF-8 strings, inline variable declaration).

[1]: https://fpgui.sourceforge.net/


Thank you. fpGUI is 20 years old this year! :-D


Giuseppe Tartini claimed his violin sonata "Il trillo del diavolo" (Devil's trill) was played to him by the devil itself during a dream. However, unlike you, Tartini declared that he had been able to capture only a small part of the music.

https://en.wikipedia.org/wiki/Devil%27s_Trill_Sonata


> they'll have to run for 25% of their expected life span to get them back

Do you have a Life Cicle Assessment source for this? This paper [1] quantifies the Energy Payback Time for a modern nuclear plant to be roughly 6 years (see Table 18), and EPT is a conservative metric because it accounts for the total embodied energy of construction (steel, concrete…). For a plant running for 60 years, this means that it will be significantly less than 10%, not 25%.

> solar would have made far fewer emissions

Again, do you have a source? Referring to this, it does not seem so [2]: 6 tonCO2/GWh for Nuclear vs 53 tonCO2/GWh for Solar.

> they are big up front money sinks, creating a sunk investment, diminishing the gamma of future options one might have wished to invest in, or take advantage of, something nobody talks about

True, nuclear has a big initial cost, but this is an incomplete metric. It ignores system integration costs, which grow non-linearly as solar penetration increases. Intermittency forces the grid to over-build capacity and storage, and significant investments are needed to fix it.

> They are perfect for government vanity projects, though, where a lot of money can be siphoned off to personal crypto gardens, repeatedly. Money laundering is likely the leitmotiv behind why you see them being built.

I agree, but this is true of any technology. In countries like Italy and Germany the Government provides >10 G€/year for renewables. It is quite likely that money laundering is happening in these cases as well, as corruption is generally a failure of the Government and auditing bodies, not a property of the energy source.

[1] https://www.sciencedirect.com/science/article/pii/S019689040...

[2] https://ourworldindata.org/safest-sources-of-energy


I too was perplexed, but the main use case seems to be when you want to share a particular configuration or need to be sure that you always use the same set of flags:

> Flags are ephemeral – you have to share the command line or wrap it in a script. Scripts depend on environment, which can break portability. Filenames solve both: the program describes itself, requires zero setup, and any configuration can be shared by simply renaming the file.

[Emphasis added] Although I find a script that wraps the command and calls it more versatile, there might be some value in this idea for some very simple cases, like example #4.


I suppose scripts are OS specific (mainly Windows and "everything else", because #/bin/sh is everywhere else).

That said, apparently there's cursed methods of having a universal shell/batch file of sorts, according to https://stackoverflow.com/questions/17510688/single-script-t....

Anyway, I'd argue for the vast majority of cases, a shell script that wraps the command and its flags is fine.


> you have to share the command line or wrap it in a script. Scripts depend on environment, which can break portability

I get the problems but I don't think I've ever had both at once. A need to portably wrap and share a specific command line for a specific program?

For the case of broadcast it seems easiest to just document the proper command line options. For the case of "unicast" I can just ask the other person what their environment is so I can craft the appropriate wrapper for them.

The area of overlap in the Venn diagram is infinitesimally narrow.


Also, you can share the generic program and then share wrapper scripts that are named for what they do rather than a series of flags. Then to share, you're just sharing a config file, script or similar that calls "whatever.exe --dir=./blah --run=12 --batch=false"


It just doesn't seem that hard to send someone a message stating: "Run foo.exe --bar from command prompt/the terminal"


True, but what if you need to re-run that command several months later and you can no longer find the message?


Not the OP, but a few weeks ago I posted a comment about my problems with Wayland, which forced me to go back to X11. (I am still using it.)

https://news.ycombinator.com/item?id=46001622


Thank you! I regularly use Reveal.js to create interactive slide decks for my classes, and your project will be a great tool to have!


Correct, but I would add: Julia is better than Python+NumPy/SciPy when you need extreme speed in custom logic that can’t be easily vectorized. As Julia is JIT-compiled, if your code calls most of the functions just once it won’t provide a big advantage, as the time spent compiling functions can be significant (e.g., if you use some library heavily based on macros).

To produce plots out of data files, Python and R are probably the best solutions.


Disagree on the last statement. Makie is tremendously superior to matplotlib. I love ggplot but it is slow, as all of R is. And my work isn’t so heavy on statistics anyway.

Makie has the best API I’ve seen (mostly matlab / matplotlib inspired), the easiest layout engine, the best system for live interactive plots (Observables are amazing), and the best performance for large data and exploration. It’s just a phenomenal visualization library for anything I do. I suggest everyone to give it a try.

Matlab is the only one that comes close, but it has its own pros and cons. I could write about the topic in detail, as I’ve spent a lot of time trying almost everything that exists across the major languages.


I love Makie but for investigating our datasets Python is overall superior (I am not familiar enough with R), despite Julia having the superior Array Syntax and Makie having the better API. This is simply because of the brilliant library support available in scikit learn and the whole compilation overhead/TTFX issue. For these workflows it's a huge issue that restarting your interactive session takes minutes instead of seconds.


I recently used Makie to create an interactive tool for inspecting nodes of a search graph (dragging, hiding, expanding edges, custom graph layout), with floating windows of data and buttons. Yes, it's great for interactive plots (you can keep using the REPL to manipulate the plot, no freezing), yes Observables and GridLayout are great, and I was very impressed with Makie's plotting abilities from making the basics easy to the extremely advanced, but no, it was the wrong tool. Makie doesn't really do floating windows (subplots), and I had to jump through hoops to create my own float system which uses GridLayout for the GUI widgets inside them. I did get it to all work nearly flawlessly in the end, but I should probably have used a Julia imGUI wrapper instead: near instant start time!



Yes. And I did port my GUI layer to CimGui.jl. The rest of it is pretty intertwined with Makie, didn't do that yet. The Makie version does look better than ImGui though.


I tried some Julia plotting libraries a few years ago and they had apis that were bad for interactively creating plots as well as often being buggy. I don’t have performance problems with ggplot so that’s what I tend to lean to. Matplotlib being bad isn’t much of a problem anymore as LLMs can translate from ggplot to matplotlib for you.


And I would further add: In addition to performance, Julia's language and semantics are much more ergonomic and natural for mathematical and algorithmic code. Even linear algebra in Python is syntactically painful. (Yes, they added the "@" operator for matmul, but this is still true).


Even then, if you're familiar with NumPy it's pretty easy to switch to Jax's NumPy API, and then you can easily jit in Python as well.


As long as someone else does the porting and maintains the compatability between both subecosystems of thoose who prefer using Jax and thoose who prefer depending on the NumPy. Also not having zero overhead structs that one can in an array handicaps types of performance codes one can write.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: