Hacker Newsnew | past | comments | ask | show | jobs | submit | naasking's commentslogin

> Once again LLM defenders fall back on "lots of AI" as a success metric.

That's not implied by anything he said. He simply said that it was fascinating, and he's right.


I think it's correct to say that LLM have word models, and given words are correlated with the world, they also have degenerate world models, just with lots of inconsistencies and holes. Tokenization issues aside, LLMs will likely also have some limitations due to this. Multimodality should address many of these holes.

(editor here) yes, a central nuance i try to communicate is not that LLMs cannot have world models (and in fact they've improved a lot) - it is just that they are doing this so inefficiently as to be impractical for scaling - we'd have to scale them up to so many more trillions of parameters more whereas our human brains are capable of very good multiplayer adversarial world models on 20W of power and 100T neurons.

I agree LLMs are inefficient, but I don't think they are as inefficient as you imply. Human brains use a lot less power sure, but they're also a lot slower and worse at parallelism. An LLM can write an essay in a few minutes that would take a human days. If you aggregate all the power used by the human you're looking at kWh, much higher than the LLM used (an order of magnitude higher or more). And this doesn't even consider batch parallelism, which can further reduce power use per request.

But I do think that there is further underlying structure that can be exploited. A lot of recent work on geometric and latent interpretations of reasoning, geometric approaches to accelerate grokking, and as linear replacements for attention are promising directions, and multimodal training will further improve semantic synthesis.


It's also important to handle cases where the word patterns (or token patterns, rather) have a negative correlation with the patterns in reality. There are some domains where the majority of content on the internet is actually just wrong, or where different approaches lead to contradictory conclusions.

E.g. syllogistic arguments based on linguistic semantics can lead you deeply astray if you those arguments don't properly measure and quantify at each step.

I ran into this in a somewhat trivial case recently, trying to get ChatGPT to tell me if washing mushrooms ever really actually matters practically in cooking (anyone who cooks and has tested knows, in fact, a quick wash has basically no impact ever for any conceivable cooking method, except if you wash e.g. after cutting and are immediately serving them raw).

Until I forced it to cite respectable sources, it just repeated the usual (false) advice about not washing (i.e. most of the training data is wrong and repeats a myth), and it even gave absolute nonsense arguments about water percentages and thermal energy required for evaporating even small amounts of surface water as pushback (i.e. using theory that just isn't relevant when you actually properly quantify). It also made up stuff about surface moisture interfering with breading (when all competent breading has a dredging step that actually won't work if the surface is bone dry anyway...), and only after a lot of prompts and demands to only make claims supported by reputable sources, did it finally find McGee's and Kenji Lopez's actual empirical tests showing that it just doesn't matter practically.

So because the training data is utterly polluted for cooking, and since it has no ACTUAL understanding or model of how things in cooking actually work, and since physics and chemistry are actually not very useful when it comes to the messy reality of cooking, LLMs really fail quite horribly at producing useful info for cooking.


So you think that enough of the complexity of the universe we live in is faithfully represented in the products of language and culture?

People won’t even admit their sexual desires to themselves and yet they keep shaping the world. Can ChatGPT access that information somehow?


The amount of faith a person has in LLMs getting us to e.g. AGI is a good implicit test of how much a person (incorrectly) thinks most thinking is linguistic (and to some degree, conscious).

Or at least, this is the case if we mean LLM in the classic sense, where the "language" in the middle L refers to natural language. Also note GP carefully mentioned the importance of multimodality, which, if you include e.g. images, audio, and video in this, starts to look like much closer to the majority of the same kinds of inputs humans learn from. LLMs can't go too far, for sure, but VLMs could conceivably go much, much farther.


And the latest large models are predominantly LMMs (large multimodal models).

Sort of, but the images, video, and audio they have available are far more limited in range and depth than the textual sources, and it also isn't clear that most LLM textual outputs are actually drawing too much on anything learned from these other modalities. Most of the VLM setups are the other way around, using textual information to augment their vision capacities, and even further, most mostly aren't truly multi-modal, but just have different backbones to handle the different modalities, or are even just models that are switched between with a broader dispatch model. There are exceptions, of course, but it is still today an accurate generalization that the multimodality of these models is kind of one-way and limited at this point.

So right now the limitation is that an LMM is probably not trained on any images or audio that is going to be helpful for stuff outside specific tasks. E.g. I'm sure years of recorded customer service calls might make LMMs good at replacing a lot of call-centre work, but the relative absence of e.g. unedited videos of people cooking is going to mean that LLMs just fall back to mostly text when it comes to providing cooking advice (and this is why they so often fail here).

But yes, that's why the modality caveat is so important. We're still nowhere close to the ceiling for LMMs.


> So you think that enough of the complexity of the universe we live in is faithfully represented in the products of language and culture?

Math is language, and we've modelled a lot of the universe with math. I think there's still a lot of synthesis needed to bridge visual, auditory and linguistic modalities though.


> you think that enough of the complexity of the universe we live in is faithfully represented in the products of language and culture?

Absolutely. There is only one model that can consistently produce novel sentences that aren't absurd, and that is a world model.

> People won’t even admit their sexual desires to themselves and yet they keep shaping the world

How do you know about other people's sexual desires then, if not through language? (excluding a very limited first hand experience)


> Can ChatGPT access that information somehow?

Sure. Just like any other information. The system makes a prediction. If the prediction does not use sexual desires as a factor, it's more likely to be wrong. Backpropagation deals with it.


Intelligence must be sigmoid of course, but it may not saturate until well past human intelligence.

Intelligence might be more like an optimization problem, fitting inputs to optimal outputs. Sometimes reality is simply too chaotic to model precisely so there is a limit to how good that optimization can be.

It would be like distance to the top of a mountain. Even if someone is 10x closer, they could still only be within arms reach.


Unless I'm misunderstanding what they are, planners seem kind of important.

As you mentioned, that depends on what you mean by planners.

An LLM will implicitly decompose a prompt into tasks and then sequentially execute them, calling the appropriate tools. The architecture diagram helpfully visualizes this [0]

Here though, planners means autonomous planners that exist as higher level infrastructure, that does external task decomposition, persistent state, tool scheduling, error recovery/replanning, and branching/search. Think a task like “Prompt: “Scan repo for auth bugs, run tests, open PR with fixes, notify Slack.” that just runs continuously 24/7, that would be beyond what nanobot could do. However, something like “find all the receipts in my emails for this year, then zip and email them to my accountant for my tax return” is something nanobot would do.

[0] https://github.com/HKUDS/nanobot/blob/main/nanobot_arch.png


Sure, instruction tuned models implicitly plan, but they can easily lose the plot on long contexts. If you're going to have an agent running continuously and accumulating memory (parsing results from tool use, web fetches, previous history, etc.), then plan decomposition, persistence and error recovery seems like a good idea, so you can start subagents with fresh contexts for task items and they stay on task or can recover without starting everything over again. Also seems better for cost since input and output contexts are more bounded.

I don’t know what these planners do, but I’ve had reasonably good luck asking a coding agent to write a design doc and then reviewing it a few times.

Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.

It's a necessary assumption for the universal approximation property; if you assume some structure then your LLM can no longer solve problems that don't fit into that structure as effectively.

But language does have structure, as does logic and reasoning. Universal approximation is great when you don't know the structure and want to brute force search to find an approximate solution. That's not optimal by any stretch of the imagination though.

Neural nets are structured as matrix multiplication, yet, they are universal approximators.

You're missing the non-linear activations.

I think any kind of innovation here will have to take advantage of some structure inherent to the problem, like eliminating attention in favour of geometric structures like Grassman flows [1].

[1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428


Right - e.g., if you're modeling a physical system it makes sense to bake in some physics - like symmetry.

Indeed, and I think natural language and reasoning will have some kind of geometric properties as well. Attention is just a sledgehammer that lets us brute force our way around not understanding that structure well. I think the next step change in AI/LLM abilities will be exploiting this geometry somehow [1,2].

[1] GrokAlign: Geometric Characterisation and Acceleration of Grokking, https://arxiv.org/abs/2510.09782

[2] The Geometry of Reasoning: Flowing Logics in Representation Space, https://arxiv.org/abs/2506.12284


The LLM confusion is just the latest incarnation of the Confused Deputy problem. It's in the same class of vulnerabilities as CSRF.

The difference being that deterministic Confused Deputies can be fixed, LLMs cannot.

Calling it peer review suggests gatekeeping. I suggest no gatekeepind just let any academic post a review, and maybe upvote/downvote and let crowdsourcing handle the rest.

While I appreciate no gatekeeping, the other side of the coin is gatekeeping via bots (vote manipulation).

Something like rotten tomatoes could be useful. Have a list of "verified" users (critic score) in a separate voting column as anon users (audience score).

This will often serve useful in highly controversial situations to parse common narratives.


I'm not sure anonymous users should be able to join. Arxiv's system of only allowing academic users seems fine for this, although exceptions could be made for industry researchers.

> AI software engineering in a nutshell. Leaving the human artisan era of code behind. Function over form. Substance over style. Getting stuff done

The invention of calculators and computers also left the human artisan era of slide rules, calculation charts and accounting. If that's really what you care about, what are you even doing here?


> This is not correct. There are clear instructions on how a 4-way stop should operate and its yielding to the right

Yielding to the right only applies if you stop at roughly the same time, otherwise first to stop goes first. It's the first bullet point in your link.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: