>> LLMs are inherently statistical interpolators. They operate beautifully in an Open World (where missing information is just "unknown" and can be guessed or left vague) and they use Non-Monotonic Reasoning (where new information can invalidate previous conclusions).
I think LLM reasoning is not so much non-monotonic as unsound, in the sense that conclusions do not necessarily follow from the premises. New information may change conclusions but how that happens is anyone's guess. There's some scholarship on that, e.g. there's a series of papers by Subarao Kamphampathi and his students that show how reasoning models' "thiking" tokens don't really correspond to sound reasoning chains, even if they seem to improve performance overall [1].
But it is difficult to tell what reasoning really means in LLMs. I believe the charitable interpretation of claims about LLM reasoning is that it is supposed to be informal. There is evidence both for and against it (e.g. much testing is in fact on formal reasoning problems, like math exam questions or Sokoban, but there's tests of informal reasoning also, e.g. on the bar exam). However, different interpretations are hard to square with the claims that "we don't understand reasoning"; not a direct quote, but I'm aware of many claims like that by people whose job it is to develop LLMs and that were made at the height of activity around reasoning models (which seems now to have been superseded by activity around "world models") [1].
If LLMs are really capable of informal reasoning (I'm not necessarily opposed to that idea) then we really don't understand what that reasoning is, but it seems we're a bit stuck because to really understand it, we have to, well, formalise it.
That said, non-monotonic reasoning is supposed to be closer to the way humans do informal reasoning in the real world, compared to classical logic, even though classical logic started entirely as an effort to formalise human reasoning; I mean, with Aristotle's Syllogisms (literally "rsasonings" in Greek).
My claim was not that an LLM was a formal, mathematically sound non-monotonic logic engine, but that the problem space is "non-monotonic" and "open world". The fact that an LLM is "unsound" and "informal" is the exact reason why my approach is necessary. Because LLMs are unsound, informal, and probabilistic, as you say, forcing them to interface with Lean 4 is a disaster. Lean 4 demands 100% mathematical soundness, totality, and closed-world perfection at every step. An LLM will just hit a brick wall. Methods like Event-B (which I suggest in my article), however, are designed to tolerate under-specification. It allows the LLM to provide an "unsound" or incomplete sketch, and uses the Proof Obligations to guide the LLM into a sound state via refinement.
Reasoning is a pattern that is embedded within the token patterns but the llms are imitating reasoning via learning symbolic reasoning patterns.
The very fact that it memorized the Ceasar cipher rot13 pattern is due to it being a Linux command and it had examples of patterns of 13 shifted letters. If you asked it to figure out a different shift it struggled.
Now compound that across all intelligent reasoning problems in the entirety of human existence and you'll see how we will never have enough data to make agi with this architecture and training paradigm.
But we will have higher and higher fidelity maps of symbolic reasoning patterns as they suck up all the agent usage data for knowledge work tasks. Hopefully your tasks fall out of distribution of the median training data scope
I think this is absolute madness. I disabled most of Windows' scheduled tasks because I don't want automation messing up my system, and now I'm supposed to let LLM agents go wild on my data?
That's just insane. Insanity.
Edit: I mean, it's hard to believe that people who consider themselves as being tech savvy (as I assume most HN users do, I mean it's "Hacker" news) are fine with that sort of thing. What is a personal computer? A machine that someone else administers and that you just log in to look at what they did? What's happening to computer nerds?
Bath salts. Ever seen an alpha-PVP user with eyes out of their orbits, sitting through the night in front of basically a random string generator, sending you snippets of its output and firehosing with monologues about how they're right at the verge of discovering an epically groundbreaking correlation in it?
That is what's happening to nerds right now. Some next-level mind-boggling psychosis-inducing shit has to do with it.
Either this or a completely different substance: AI propaganda.
> And it's not that hard to just run it in docker if you're so worried
There is risk of damage to ones local machine and data as well as reputational risk if it has access to outside services. Imagine your socials filled with hate, ala Microsoft Tay, because it was red pilled.
Though given the current cultural winds perhaps that could be seen as a positive?
The computer nerds understand how to isolate this stuff to mitigate the risk. I’m not in on openclaw just yet but I do know it’s got isolation options to run in a vm. I’m curious to see how they handle controls on “write” operations to everyday life.
I could see something like having a very isolated process that can, for example, send email, which the claw can invoke, but the isolated process has sanity controls such as human intervention or whitelists. And this isolated process could be LLM-driven also (so it could make more sophisticated decisions about “is this ok”) but never exposed to untrusted input.
I don’t understand how “running it in a vm” Or a docker image, prevents the majority of problems. It’s an agent interacting with your bank, your calendar, your email, your home security system, and every subscription you have - DoorDash, Spotify, Netflix, etc. maybe your BTC wallet.
What protection is offered by running it in a docker container? Ok, It won’t overwrite local files. Is that the major concern?
It’s a matter of giving the system shims instead of direct access to “write” ops. Those shims have controls in place. Their only job is to examine the context and decide whether the (email|purchase|etx) is acceptable, either by static rules, human intervention, or, if you’re really getting spicy. separate-llm-model-that-isn’t-polluted-by-untrusted-data.
Edit: I actually wrote such a thing over the weekend as a toy PoC. It uses the LLM to generate a list of proposed operations, then you use a separate tool to iterate though them and approve/reject/skip each one. The only thing the LLM can do is suggest things from a modest set of capabilities with a fairly locked-down schema. Even if I were to automate the approvals, it’s far from able to run amok.
The idea that the majority of computer nerds are any more security conscious than the average normy has long been dispelled.
The run everything as root, they curl scripts, they npx typos, they give random internet apps "permission to act on your behalf" on repos millions of people depend on
L-Systems are not phrase-structure grammars. Some are similar to Regular, others to Context-Free grammars, but they're a different grammar formalism.
The difference is that L-system rules are all applied simultaneously and consume an entire string before repeating the process in steps called "generations", where each generation takes as input the output of the previous one. Generations continue indefinitely, generating ever-longer strings, up to a limit set by the user.
For instance, take the Dragon Curve L-System,
Axiom: f
Constants: +,-
f -> f+g
g -> f-g
Suppose we wanted to execute two generations of this L-System. Starting with the axiom, "f", the first generation replaces "f" with "f+g". The next generation operates on this new string. The "f" in "f+g" is replaced with "f+g", the "+" stays the same because it's a constant and the "g" is replaced with "f-g". This creates the string f+g+f-g. And so on.
If the same L-System was given as a phrase-structure grammar, given the axiom, "f", as input, the "f" would be expanded to f+g, and the process would stop there. No concept of recursive "generations" in phrase structure grammars! If we wanted to generate the string "f+g+f-g" with a phrase structure grammar we'd need to give the input "f+g", i.e. the L-System output of the first generation.
More to the point, the execution of L-Systems in successive generations limits the strings that can be generated by a set of rules, if we take those rules as an L-System grammar. For example, the Dragon Curve Rules above, interpreted as an L-System can only generate the strings:
f --> f+g
f+g --> f+g+f-g
f+g+f-g --> f+g+f-g+f+g-f-g
Where the strings before the "-->" are inputs and the ones after, outputs, and each string is the output of one generation.
Whereas the same rules, interpreted as a Regular, phrase-structure, grammar, accept all the following one- to three-character strings :
And so on up to 229 strings of one to six characters (as in the input of the third-generation Dragon Curve above).
But none of those strings, other than the penultimate one, are correct Dragon Curve strings, meaning that they can't be interpreted so as to draw a Dragon Curve fractal, e.g. with a Turtle interpreter as in the article above.
To be fair to your comment, the first time I read the claim that L-Systems are different in that way I scoffed and wrote it off as an attempt to justify an unnecessary new notation. I put the insistence on a new notation down to vanity and forgot about it.
I only realised that L-Systems are really a different grammar formalism when I tried to implement a simple L-System as a Regular grammar, in Prolog's Definite Clause Grammars (DCG) notation. DCGs are syntactic sugar for ordinary Prolog definite clauses so a "grammar" in DCG notation is at the same time a L/R parser, that can also be executed as a generator without any changes, a nifty trick that is well suited to the role of L-Systems as generators, rather than recognisers. I won't bore you with the details, but suffice it to say that this natural style of parsing simply doesn't work with L-System grammars, you need a special interpreter to count the generations and pass the output of one to the next. Not complicated, but the point is that just parsing an L-System string as a phrase-structure grammar string, doesn't give you the same results.
Bottom line: L-Systems aren't just trying to be different; they are. They're not phrase-structure grammars. For real.
P.S. And for our shameless plug today, my paper on learning grammars, both phrase-structure grammars and L-Systems:
> No concept of recursive "generations" in phrase structure grammars!
Not sure what you're getting at here. Generation in a CFG proceeds recursively until no more terminals can be rewritten. If "f" is a nonterminal then you'd keep rewriting just as in an L-system and you'd generate the strings you want just fine. Or is what is distinctive the fact that you can limit the number of "generations"? It seems you could simulate this in a CFG too.
Yes, that's what I mean by "recursive generations". L-System strings are interpreted in successive generations where the output of generation n becomes the input of generation n+1. The recursion I refer to here is at the level of generation steps.
You can simulate this with a parser for e.g. CFGs like you say, but then you need an outer loop that feeds the parsed strings back to the parser.
There's also a subtlety that I omitted in my earlier comment- apologies! but it's really a bit subtle. In L-Systems, strings are made up of "variables" and "constants", which are closely related to nonterminals and terminals, but are not exactly the same: in phrase-structure grammars, non-terminal symbols cannot appear in strings, only in grammar rules.
So for instance, the Dragon Curve rule I give above, "f -> f+g", means "wherever 'f' appears in a string, replace it with 'f+g'". To get the same result in a phrase-structure grammar you need to further define "f" as a pre-terminal, expanding to a single "f" character.
So for a CFG-parser-with-an-outer-loop to work on an L-System one would also need to modify the L-System to be a full CFG. To make it clear, here's the definition of the Dragon Curve L-System, from my example above:
Axiom: f
Constants: +,-
f -> f+g
g -> f-g
And here is its redefinition as a CFG, with nonterminals capitalised:
F -> F+G
G -> F-G
F -> f
G -> g
I think it's easier to see that these are different objects. I bet we'd find that, for every L-System, there's a phrase-structure grammar that accepts the same strings (but not necessarily generates the same strings only) and that is a simple re-write of the L-System with variables replaced by pre-terminals, as I do above, kind of like a standard form (not normal; because not fully equivalent). That may even be something well-known by L-Systems folks, I don't know.
Btw, the Dragon Curve L-system is described as an OL-System, that matches a Regular language. The grammar above is context-free but that's OK, a CFG can parse strings of a Regular language. I believe L-System languages go up to context-free expressivity, but I'm not sure, there are also parameterised L-Systems that may go beyond that.
Interesting. What I was thinking was you could get the bounded-number-of-generations behavior by defining nonterminals with indices. Like this CFG which would rewrite for K steps.
I believe that would work too and you could convert to this sub-scripted format with a simple pre-processor. I prefer the simpler and clearer notation that leaves the counting of generations to an interpreter because it's easier to read, but that's personal taste.
Thanks for the link to the paper which looks interesting. I'll have to read it more carefully.
Something I realised a while ago: everyone can write and that makes it very hard to stand out as a writer, and make a career out of writing. Same thing with singing more or less (although it's harder to sing well than write at all well).
I think I realised that while reading Harry Potter. To be fair the writing in the books is abysmally bad. It's written by an adult woman but it comes across as the writing of a 14 year old child, and that's to be charitable.
And it doesn't matter one bit. It still became the best-selling book in history with 600 million copies sold worldwide (as Wikipedia tells me). That's not to say that there aren't many hundreds, possibly even thousands of better written series, even in the Young Adult space. There are. But they're not that successful.
Why? I guess because good writing doesn't matter so much as what's being written. And I guess that also doesn't matter that much. You just have to connect somehow, be in the right place at the right time, when the need to read a certain piece of writing sort of emerges naturally as a result of whatever forces shape ambient taste.
Who knows. But most people wouldn't know what good writing looks like anymore than they could write well themselves, so it's obvious that the ability to write well is over-rated.
And so now we have LLMs generating prose and that's what we'll be reading henceforth. I think it will be gradual, but it's unavoidable. One day nobody will read anything anyone else has written anymore. Why do that? If you can just ask an LLM to generate whatever you want to read?
That's great, but I wonder whether SE Asian people are any better at telling e.g. what European country someone is from, or what African country etc. It's a bit shit but we all look the same if you look from far enough away. Like, I'm Greek but I have fair hair, skin and eyes despite most people in the UK expecting me to be "olive skinned" and I have friends who could easily pass for Swedish, or, conversely, Libyan or Pakistani. You just can't tell.
E.g. one of the Mongol invaders from Medieval Total War II had a soundbite that said "You all look the same to me!". I guess we all do.
Can't we just accept that some kids do cool stuff that many adults can't even think of? Note that this kid was distinguished in a competition for STEM projects by kids, so he's not unique, but that doesn't reduce the fact that he's uncommon to say the least.
I'm trying to think of what I did when I was 14 that was as cool as that and I can't. I had an imagination and I created stuff all the time, but nothing I'd brag about today. The Space Pistol with Disintegrator Ray was only a prototype that I never managed to really put into production.
It's really unbelievable how much data most people put online about themselves. "Valentina" has probably shared all the information about here the alleged system dashboard showed. Any interested party would only have to search the open internet (and some walled gardens like Facebook) and aggregate the information found in there.
Spy agencies and spyware companies don't have some magickal tech nobody else knows anything about. They take advantage of peoples' careless style of interacting online.
I've known people who were manually stalked through just information they posted to the internet. It really doesn't take anything more than a name and a few usernames.
>> I've learned from a former college colleague that got into cyber security that Israeli intelligence facial recognition is virtually error free.
What does "virtually error free" mean? There's no "error free" in facial recognition, or any other application of machine learning.
More to the point, who says all this, besides yourself in this thread? Why should anyone believe that "virtually error free" is a factual description of real technological capabilities rather than state propaganda?
Those are numbers claimed by the UK, the company behind it claims an order of magnitude over it with proper data (airport-level full face scans).
Even 89% isn't that bad imho, recognizing the overabundant majority of a population with random cameras that don't require the user to pose or assume specific positions is..quite something.
Its kinda of bad when factoring in the consequences for being misidentified. Getting misidentified can cost you your life. It can cause you months of time and legal fees to disprove. It can waste police resources and erode public trust.
And I trust the UK police data far far more than the company. Every company says they are 99% accurate.
* The National Physical Laboratory (NPL) tested the algorithm South Wales Police and the Metropolitan Police Service have been using for LFR.
* At the settings police use, the NPL found that for LFR there were no statistically significant differences in performance based on age, gender or ethnicity.
* There was an 89% chance of identifying someone on the specific watchlist of people wanted by the police, and at worst a 1 in 6,000 chance of incorrectly identifying someone on a watchlist with 10,000 images (known as a false alert). In practice, the false alert rate has been far better than this.
It's a big report and it'd take some time to go through it but it's clear that laboratory testing of a system deployed in the wild is not going to give accurate results, meaning the "89%" claimed is going to be significantly worse in reality. Anyway there's obvious limitations to the testing e.g. (from the report):
Large demographically balanced datasets: The testing of low error rates in a statistically
significant manner requires large datasets. To achieve the required scale, the
evaluation uses a supplementary reference image dataset of 178,000 face images
(Filler dataset). This is an order of magnitude larger than the typical watchlist size of
an operational Live Facial Recognition deployment. To avoid introducing a
demographic bias due to reference dataset composition, a demographically balanced
reference dataset was used, with equal numbers in each demographic category. For
assessment of equitability under operational settings, the results from the large
dataset are appropriately scaled to the size and composition of watchlist or reference
image database of the operational deployment.
I'd say "uh-oh" to that. Unbalanced classes is a perenial source of error in evaluations. "Equal numbers in each demographic category" is an obvious source of unrealistic bias.
Anyway, I don't have the time to go through that with a fine toothed comb, but just the fact that they report a 100 False Positive Rate for "operator initiated facial recognition" is another big, hot, red flag.
Also, from the UK gov link above:
* The 10 LFR vans rolled out in August 2025 are using the same algorithm that was tested by the NPL.
There's a bit of ambiguity there. The police are using "the same algorithm" tested by the NPL, but are they using the same settings? The report uses specific settings to come up with its conclusions (e.g. a "face match" setting of 0.6 for LFR), but there's nothing to say the police stick to the same. Lots of room for manoeuvering left there, I'd say.
>> The company between Oosto claims 99%.
We can easily dismiss this just by looking at the two digits preceding the "%".
I think LLM reasoning is not so much non-monotonic as unsound, in the sense that conclusions do not necessarily follow from the premises. New information may change conclusions but how that happens is anyone's guess. There's some scholarship on that, e.g. there's a series of papers by Subarao Kamphampathi and his students that show how reasoning models' "thiking" tokens don't really correspond to sound reasoning chains, even if they seem to improve performance overall [1].
But it is difficult to tell what reasoning really means in LLMs. I believe the charitable interpretation of claims about LLM reasoning is that it is supposed to be informal. There is evidence both for and against it (e.g. much testing is in fact on formal reasoning problems, like math exam questions or Sokoban, but there's tests of informal reasoning also, e.g. on the bar exam). However, different interpretations are hard to square with the claims that "we don't understand reasoning"; not a direct quote, but I'm aware of many claims like that by people whose job it is to develop LLMs and that were made at the height of activity around reasoning models (which seems now to have been superseded by activity around "world models") [1].
If LLMs are really capable of informal reasoning (I'm not necessarily opposed to that idea) then we really don't understand what that reasoning is, but it seems we're a bit stuck because to really understand it, we have to, well, formalise it.
That said, non-monotonic reasoning is supposed to be closer to the way humans do informal reasoning in the real world, compared to classical logic, even though classical logic started entirely as an effort to formalise human reasoning; I mean, with Aristotle's Syllogisms (literally "rsasonings" in Greek).
________________
[1] Happy to get links if needed.
reply