> I would also welcome someone writing a short takedown where they fix the prompts and get better-than-2022 results from nbp
NBP (and the new ChatGPT generator) are integrated with LLMs to various degrees, so seems like the obvious starting point is a reverse approach: ask them to describe the old images which has the esthetics that Fernando Borretti likes, and start generating from those prompts. If you can recover the old images, then it was just a prompting issue. ("Sampling can show the presence of knowledge but not the absence.") If you can't even with their own 'native' descriptions, then that points to mode-collapse (especially all of the 'esthetic tuning' like DPO everyone does now) as being the biggest problem.
There's a vein of research which interprets self-attention as a kind of gradient descent and says that LLMs have essentially pre-solved indefinitely large 'families' or 'classes' of tasks, and the 'learning' they do at runtime is simply gradient descent (possibly Newton) using the 'observations' to figure out which pre-solved instance they are now encountering; this explains why they fail in such strange ways, especially in agentic scenarios - because if the true task is not inside those pre-learned classes, no amount of additional descent can find it after you've found the 'closest' pre-learned task to the true task. (Some links: https://gwern.net/doc/ai/nn/transformer/attention/meta-desce... )
I wonder if this can be interpreted as consistent with that 'meta-learned descent' PoV? If the system is fixed and is just cycling through fixed strategies, that is what you'd expect from that: the descent will thrash around the nearest pre-learned tasks but won't change the overall system or create new solved tasks.
There's also the efficiency argument from new capability: even a tiny bit better weather forecast is highly economically valuable (and saves a lot of wasted energy) if it means that 1 city doesn't have to evacuate because of an erroneous hurricane forecast, say. But how much would it cost to do that with the rivals? I don't know but I would guess quite a lot.
And one of the biggest ironies of AI scaling is that where scaling succeeds the most in improving efficiency, we realize it the least, because we don't even think of it as an option. An example: a Transformer (or RNN) is not the only way to predict text. We have scaling laws for n-grams and text perplexity (most famously, from Jeff Dean et al at Google back in the 2000s), so you can actually ask the question, 'how much would I have to scale up n-grams to achieve the necessary perplexity for a useful code writer competitive with Claude Code, say?' This is a perfectly reasonable, well-defined question, as high-order n-grams could in theory write code without enough data and big enough lookup tables, and so it can be answered. The answer will look something like 'if we turned the whole earth into computronium, it still wouldn't be remotely enough'. The efficiency ratio is not 10:1 or 100:1 but closer to ∞:1. The efficiency gain is so big no one even thinks of it as an efficiency gain, because you just couldn't do it before using AI! You would have humans do it, or not do it at all.
> even a tiny bit better weather forecast is highly economically valuable (and saves a lot of wasted energy) if it means that 1 city doesn't have to evacuate because of an erroneous hurricane forecast
Here is the NOAA on the improvements:
> 8% better predictions for track, and 10% better predictions for intensity, especially at longer forecast lead times — with overall improvements of four to five days.(1)
I’d love someone to explain what these measurements mean though. Does better track mean 8% narrower angle? Something else? Compared to what baseline?
And am I reading this right that that improvement is measured at the point 4-5 days out from landfall? What’s the typical lead time for calling an evacuation, more or less than four days?
To have a competitive code writer with ngrams you need more than to "scale up the ngrams" you need to have a corpus that includes all possible codes that someone would want to write. And at that point you'd be better off with a lossless full text index like an r-index. But, the lack of any generalizability in this approach, coupled with its markovian features, will make this kind of model extremely brittle. Although, it would be efficient. You just need to somehow compute all possible language before hand. tldr; language models really are reasoning and generalizing over the domain they're trained on.
No, the search space is tiny: you can just attack 1 BPE at a time! Stuff like password guessing is almost trivial when you get to do a timing attack on each successive character. So that lets you quickly exfiltrate arbitrary numbers of prompts, especially if you have any idea what you are looking for. (Note that a lot of prompts are already public information, or you can already exfiltrate prompts quite easily from services and start attacking from there...)
Hill climbing a password would only be possible if intermediate KV cache entries were stored. To hillclimb "hunter2", you're going to try "a", "b", "c", etc, until you notice that "h" comes back faster. Then you try "ha", "hb" and so on.
But that's only going to work if the cache looks like: "h", "hu", "hun", ..., "hunter2"
If just "hunter2" is in the cache, you won't get any signal until you stumble on exactly that password. And that's before getting into the block size granularity of the caches discussed elsewhere in this thread.
That's not to say timing attacks aren't possible. I haven't looked at Claude Code's prompt generation, but there's no intrinsic reason why you couldn't do things like figure out what open source code and research papers your competitors are loading into context.
Sharing caches between orgs would be an incredible misstep.
Right, you can’t actually guess a letter (byte) at a time but you can guess a token at a time (I believe the vocabulary is 200000 possible tokens in gpt 5)
So you could send each of the 200000 possible tokens, see which is cached, and then send 200000 more tokens to find the next cached token
Certainly less efficient but well within the realm of a feasible attack
It's a good call out re: tokens vs letters, but I think you might have misunderstood my point - you can't do it a token at a time unless the intermediate KV cache is stored after each token is generated.
This won't be the case in any non toy implementation, as it would be unneccessary and slow.
Ah, fair enough. Anthropic caches at a block level (basically a single message) so for non-trivial messages this is really less of a concern, although I definitely understand why they still scope cache to a single tenant
While we're at it: my own work 2 years ago in creating an entire workflow for turning Midjourney or DALL-E dropcaps into attractive, lightweight, easy-to-create dropcaps for web pages: https://gwern.net/dropcap We use it for the cat, Gene Wolfe, and holiday pages.
No. I think the need for adversarial losses in order to distill diffusion models into one-step forward passes has provided additional evidence that GANs were much more viable than diffusimaxis loudly insisted.
(Although I'm not really current on where image generation is these days or who is using GAN-like approaches under the hood or what are the current theoretical understandings of GAN vs AR vs diffusion, so if you have some specific reason I should have "caved", feel free to mention it - I may well just be unaware of it.)
"SotA diffusion uses adversarial methods anyways" seems like a bit of a departure from the case you make in the blog post.
edit: For what it's worth - I agree. At least some auto-encoders (which will produce latents for diffusion models) use some form of adversarial method.
Still, I'm curious if you think GAN models in their more familiar form are going to eventually take on LCM/diffusion models?
This works surprisingly well. If you look into enough dark corners of Unicode, it turns out that you can do a shocking amount of typography, going far beyond the obvious italics and bolds: https://gwern.net/utext
In fact, I found that writing as much math as possible in Unicode makes for the best HTML reading experience: it's fast, simple, and looks more natural (avoids font inconsistency and line-height jagginess, among other things). https://gwern.net/design-graveyard#mathjax
Regarding typing latex vs unicode, I use WinCompose/XCompose with a list of bindings that include most latex symbols. So instead of \cup I'd type <compose>cup
This is the epitome of patching symptoms rather than treating the disease. Even if you suppress the obvious syntactic slop like 'it's not X but Y', you have no reason to believe you've fixed mode-collapse on higher more important levels like semantics and creativity. (For example, Claude LLMs have always struck me as mode-collapsed on a semantic level: they don't have the blatant verbal tics of 4o but somehow they still 'go in circles'.) Which will potentially severely hinder the truly high-value applications of LLMs to creative applications like frontier research. To the extent that this succeeds in hiding the brain damage in contemporary LLMs, it arguably is a cure worse than the disease.
Those higher level kinds of mode collapse are hard to quantify in an automated way. To fix that, you would need interventions upstream, at pre & post training.
This approach is targeted to the kinds of mode collapse that we can meaningfully measure and fix after the fact, which is constrained to these verbal tics. Which doesn't fix higher level mode collapse on semantics & creativity that you're identifying -- but I think fixing the verbal tics is still important and useful.
> but I think fixing the verbal tics is still important and useful.
I don't. I think they're useful for flagging the existence of mode-collapse and also providing convenient tracers for AI-written prose. Erasing only the verbal tics with the equivalent of 's/ - /; /g' (look ma! no more 4o em dashes!) is about the worst solution you could come up with and if adopted would lead to a kind of global gaslighting. The equivalent of a vaccine for COVID which only suppresses coughing but doesn't change R, or fixing a compiler warning by disabling the check.
If you wanted to do useful research here, you'd be doing the opposite. You'd be figuring out how to make the verbal expressions even more sensitive to the underlying mode-collapse, to help research into fixing it and raising awareness. (This would be useful even on the released models, to more precisely quantify their overall mode-collapse, which is poorly captured by existing creative writing benchmarks, I think, and one reason I've had a hard time believing things like Eqbench rankings.)
You're correct, but when the worst the ChatGPTisms get is turns of phrases like "LeetCode youth finally paid off: turns out all those "rebalance a binary search tree" problems were preparing me for salami, not FAANG interviews." or "Designing software for things that rot means optimising for variance, memory, and timing–not perfection. It turns out the hardest part of software isn't keeping things alive. It's knowing when to let them age.", then I'm inclined to forgive it compared to how many far more egregious offenders are at the top of HN these days. This is a rather mild use of ChatGPT for copyediting, and at least I feel like I can trust OP to factcheck everything and not put in any confabulations.
If you were talking about some essays I wrote in the early 2000s, you’d be buttering your Stetson. It’s hilarious to me that several of my blog posts from 20 years ago have been called out as AI generated lol.
I agree. I've written like this too, but these days when you see it it's more likely to be AI.
I actually think if I were writing blog posts these days I'd deliberately avoid these kinds of cliches for that reason. I'd try to write something no LLM is likely to spit out, even if it ends up weird.
NBP (and the new ChatGPT generator) are integrated with LLMs to various degrees, so seems like the obvious starting point is a reverse approach: ask them to describe the old images which has the esthetics that Fernando Borretti likes, and start generating from those prompts. If you can recover the old images, then it was just a prompting issue. ("Sampling can show the presence of knowledge but not the absence.") If you can't even with their own 'native' descriptions, then that points to mode-collapse (especially all of the 'esthetic tuning' like DPO everyone does now) as being the biggest problem.
reply