More

grey-area · 2026-02-12T07:56:19 1770882979

And yet LLMs still fail on simple questions of logic like ‘should I take the car to the car wash or walk?’

Generative AI is not making judgements or reasoning here, it is reproducing the most likely conclusions from its training data. I guess that might be useful for something but it is not judgement or reasoning.

What consideration was given to the original experiment and others like it being in the training set data?

grey-area · 2026-02-11T18:27:25 1770834445

No it’s like having a calculator which is unable to perform simple arithmetic, but lots of people think it is amazing and sentient and want to talk about that instead of why it can’t add 2 + 2.

viraptor · 2026-02-11T22:28:52 1770848932

We know why it's not going to do precise math and why you can have better experience asking for an app solving the math problem you want. There's no point talking about it - it's documented in many places for people who are actually interested.

grey-area · 2026-02-11T08:23:34 1770798214

Maybe he prefers to burn someone else’s money.

grey-area · 2026-02-11T08:21:23 1770798083

Natural language is in fact a terrible way to express goals, it is imprecise, contradictory, subjective, full of redundancies and constantly changing. So possibly the worst format to record business rules and logic.

This lesson has been learned over and over (see AppleScript) but it seems people need to keep learning it.

We use simple programming languages composed of logic and maths not just to talk to the machine but to codify our thoughts within a strict internally consistent and deterministic system.

So in no sense are the vague imprecise instructions fed to LLMs the true source code.

vineyardmike · 2026-02-11T09:08:36 1770800916

Before LLMs (and still now) a human will often write a doc explaining the desired UX and user journeys that a product needs to support. That doc gets provided to engineers to build.

I agree - at least with the thesis - that the more we "encode" the fuzzy ideas (as translated by an engineer) into the codebase the better. This isn't the same thing as an "English compiler". It'd be closer to the git commit messages, understanding why a change was happening, and what product decisions and compromises were being designed against.

grey-area · 2026-02-11T12:53:28 1770814408

I think I’d rather have the why in English and the how in code, i.e. keep both, keeping just the English instructions is nowhere near enough to fully specify what was done and why. These things evolve as they are produced and English is too fuzzy.

grey-area · 2026-02-09T15:43:44 1770651824

Delete the llmism after the dash and the title is correct.

grey-area · 2026-02-06T14:16:18 1770387378

Of course it didn't. Not sure you really can do that - LLMs are a collection of weights from the training set, take away the training set and they don't really exist. You'd have to train one from scratch excluding these books and all excerpts and articles about them somehow, which would be very expensive and I'm pretty sure the OP didn't do that.

So the test seems like a nonsensical test to me.

grey-area · 2026-02-06T09:52:31 1770371551

Surely the corpus Opus 4.6 ingested would include whatever reference you used to check the spells were there. I mean, there are probably dozens of pages on the internet like this:

https://www.wizardemporium.com/blog/complete-list-of-harry-p...

Why is this impressive?

Do you think it's actually ingesting the books and only using those as a reference? Is that how LLMs work at all? It seems more likely it's predicting these spell names from all the other references it has found on the internet, including lists of spells.

sigmoid10 · 2026-02-06T09:58:40 1770371920

Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

rendx · 2026-02-06T13:37:42 1770385062

> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data

No need for surprises! It is publicly known that the corpus of 'shadow libraries' such as Library Genesis and Anna's Archive were specifically and manually requested by at least NVIDIA for their training data [1], used by Google in their training [2], downloaded by Meta employees [3] etc.

[1] https://news.ycombinator.com/item?id=46572846

[2] https://www.theguardian.com/technology/2023/apr/20/fresh-con...

[3] https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...

paodealho · 2026-02-06T13:49:54 1770385794

also:

"Researchers Extract Nearly Entire Harry Potter Book From Commercial LLMs"

https://www.aitechsuite.com/ai-news/ai-shock-researchers-ext...

sigmoid10 · 2026-02-06T13:57:42 1770386262

The big AI houses are all in involved in varying degrees of litigation (all the way to class action lawsuits) with the big publishing houses. I think they at least have some level of filtering for their training data to keep them legally somewhat compliant. But considering how much copyrighted stuff is spread blisfully online, it is probably not enough to filter out the actual ebooks of certain publishers.

rendx · 2026-02-06T23:06:02 1770419162

> I think they at least have some level of filtering for their training data to keep them legally somewhat compliant.

So far, courts are siding with the "fair use" argument. No need to exclude any data.

https://natlawreview.com/article/anthropic-and-meta-fair-use...

"Even if LLM training is fair use, AI companies face potential liability for unauthorized copying and distribution. The extent of that liability and any damages remain unresolved."

https://www.whitecase.com/insight-alert/two-california-distr...

joenot443 · 2026-02-06T13:26:27 1770384387

> even proprietary content like the books themselves

This definitely raises an interesting question. It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files. Immediately to mind was House of Leaves, Infinite Jest, Harry Potter, basically any Stephen King book - they've all been posted at some point.

Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?

rendx · 2026-02-06T13:42:25 1770385345

> It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files

Anna's Archive alone claims to currently publicly host 61,654,285 books, more than 1PB in total.

yunohn · 2026-02-06T15:16:47 1770391007

Maybe y’all missed this?

https://www.washingtonpost.com/technology/2026/01/27/anthrop...

Anthropic, specifically, ingested libraries of books by scanning and then disposing of them.

beepbooptheory · 2026-02-06T16:30:20 1770395420

> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

The plot of Good Will Hunting would like a word.

MarcellusDrum · 2026-02-06T10:03:36 1770372216

So a good test would be replacing the spell names in the books with made-up spells. And if a "real" spell name was given, it also tests whether it "cheated".

outofpaper · 2026-02-06T10:38:07 1770374287

A real test is synthesizing 100,000 sentences of this slect random ones and then inject the traits you want thr LLM to detect and describe, eg have a set of words or phrases that may represent spells and have them used so that they do something. Then have the LLM find these random spells in the random corpus.

ggrab · 2026-02-10T23:03:12 1770764592

I've run that experiment now, spoiler: It cheated with its pre-training knowledge https://georggrab.net/content/opus46retrieval.html

lxgr · 2026-02-06T12:06:46 1770379606

It could still remember where each spell is mentioned. I think the only way to properly test this would be to run it against an unpublished manuscript.

staticman2 · 2026-02-06T12:54:59 1770382499

Any obscure work of fiction or fanfiction would likely be fine as a casual test.

If you ask a model to discuss an obscure work it'll have no clue what it's about.

This is very different than asking about Harry Potter.

lxgr · 2026-02-06T13:18:41 1770383921

Yeah, that's what I've been doing as well, and at least Gemini 3 Pro did not fare very well.

staticman2 · 2026-02-06T13:29:13 1770384553

For fun I've asked Gemini Pro to answer open ended questions about obscure books like "Read this novel and tell me what the hell is this book, do a deep reading and analyze" and I've gotten insightful/ enjoyable answers but I've never asked it to make lists of spells or anything like that.

vercaemert · 2026-02-06T11:10:29 1770376229

It's impressive, even if the books and the posts you're talking about were both key parts of the training data.

There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).

This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.

(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)

adastra22 · 2026-02-06T11:21:09 1770376869

I don’t think this example proves your point. There’s no indication that the model actually worked this out from the input context, instead of regurgitating it from the training weights. A better test would be to subtly modify the books fed in as input to the model so that there was actually 51 spells, and see if it pulls out the extra spell, or to modify the names of some spells, etc.

In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.

vercaemert · 2026-02-06T11:38:02 1770377882

Ah, that's a good point.

fastasucan · 2026-02-07T20:09:03 1770494943

Since it got 49 of 50 right its worse than what you would get using a simple google search. People would immediately disregard a conventional source that only listed 49 out of 50.

ehatr · 2026-02-06T13:55:25 1770386125

The poster you reply to works in AI. The marketing strategy is to always have a cute Pelican or Harry Potter comment as the top comment for positive associations.

The poster knows all of that, this is plain marketing.

throw10920 · 2026-02-06T14:11:38 1770387098

This sounds compelling, but also something that an armchair marketer would have theorycrafted without any real-world experience or evidence that it actually works - and I searched online and can't find any references to something like it.

Do you have a citation for this?

rlt · 2026-02-06T21:13:40 1770412420

They should try the same thing but replace the original spell names with something else.

zaphirplane · 2026-02-06T14:01:24 1770386484

Why doesn’t you ask it and find out ;)

grey-area · 2026-02-06T14:18:15 1770387495

Because the model doesn't know but will happily tell a convincing lie about how it works.

grey-area · 2026-02-06T09:40:26 1770370826

Well there are lots and lots of examples that don't end in bankruptcy, just a very large loss of capital for investors. The majority of the stars of the dotcom bubble just as one example: Qualcomm, pets.com, Yahoo!, MicroStrategy etc etc.

Uber, which you cite as a success, is only just starting to make any money, and any original investors are very unlikely to see a return given the huge amounts ploughed in.

MicroStrategy has transformed itself, same company, same founder, similar scam 20 years later, only this time they're peddling bitcoin as the bright new future. I'm surprised they didn't move on to GAI.

Qualcomm is now selling itself as an AI first company, is it, or is it trying to ride the next bubble?

Even if GAI becomes a roaring success, the prominent companies now are unlikely to be those with lasting success.

grey-area · 2026-02-06T07:45:49 1770363949

Isn't the AI basing what it does heavily on the publicly available source code for compilers in C though? Without that work it would not be able to generate this would it? Or in your opinion is it sufficiently different from the work people like you did to be classed as unique creation?

I'm curious on your take on the references the GAI might have used to create such a project and whether this matters.

grey-area · 2026-02-06T07:38:46 1770363526

I think things can only be called revolutions in hindsight - while they are going on it's hard to tell if they are a true revolution, an evolution or a dead-end. So I think it's a little premature to call Generative AI a revolution.

AI will get there and replace humans at many tasks, machine learning already has, I'm not completely sure that generative AI will be the route we take, it is certainly superficially convincing, but those three years have not in fact seen huge progress IMO - huge amounts of churn and marketing versions yes, but not huge amounts of concrete progress or upheaval. Lots of money has been spent for sure! It is telling for me that many of the real founders at OpenAI stepped away - and I don't think that's just Altman, they're skeptical of the current approach.

PS Superseded.