More

gwern · 2025-10-18T19:50:47 1760817047

> Note again that a residual connection is not just an arbitrary shortcut connection or skip connection (e.g., 1988)[LA88][SEG1-3] from one layer to another! No, its weight must be 1.0, like in the 1997 LSTM, or in the 1999 initialized LSTM, or the initialized Highway Net, or the ResNet. If the weight had some other arbitrary real value far from 1.0, then the vanishing/exploding gradient problem[VAN1] would raise its ugly head, unless it was under control by an initially open gate that learns when to keep or temporarily remove the connection's residual property, like in the 1999 initialized LSTM, or the initialized Highway Net.

After reading Lang & Witbrock 1988 https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf I'm not sure how convincing I find this explanation.

CamperBob2 · 2025-10-18T21:50:43 1760824243

That's a cool paper. Super interesting to see how work was progressing at the time, when Convex was the machine everybody wanted on (or rather next to) their desks.

imtringued · 2025-10-18T22:23:31 1760826211

For residual networks with an infinite number of layers it is absolutely correct. For a residual network with finite layers, you can get away with any non zero constant weight as long as the weight chosen appropriately for the fixed network depth. The problem is simply c^n gives you very big or very small numbers for large n and large deviations from 1.

Now let me address the other possibility that you are talking about: what if residual connections aren't necessary? What if there is another way? What are the criteria necessary to avoid exploding or vanishing gradient or slow learning in the absence of both?

For that we need to first know why residual connections work. There is no way around calculating the back propagation formula by hand, but there is an easy trick to make it simple. We don't care about the number of parameters in the network, we only care about the flow of the gradient. So just have a single input and output with hidden size 1 and two hidden layers.

Each layer has a bias and a single weight and an activation function.

Let's assume you initialize each weight and bias with zero. The forward pass returns zero for any input and the gradient is zero. In this artificial scenario the gradient starts vanished and stays vanished. The reason is pretty obvious when you apply back propagation. The second layer clips the gradient of the first layer. If there was a single layer, the gradient would be non zero and yield a non zero gradient, rescuing the network out of the vanishing gradient.

Now what if you add residual connections? The forward pass stays the same, but the backward pass changes for two layers and beyond. The gradient for the second layer consists of just the second layer activation function multiplied by the first layer activation of the forward pass. The first layer gradient consists of the second layer gradient where the first layer activation is substituted by the gradient of the first layer but because it is a residual net, you also add the gradient of just the first layer.

In other words, the first layer is trained independently of the layers that come after it, but also gets feedback from higher layers on top. This allows it to become non zero, which then lets the second layer become non zero, which lets the third be non zero and so on.

Since the degenerate case of a zero initialized network makes things easy to conceptualise, it should help you figure out what other ways there are to accomplish the same task.

For example, what if we apply the loss to every layer's output as a regularizer? That is essentially doing the same thing as a residual, but with skip connections that sum up the outputs. You could replace the sum with a weighted sum where the weights are not equal to 1.0.

But what if you don't want skip connections either, because they are too similar to residual networks? A residual network has one skip connection already and summing up in a different way is uninteresting. It is also too reliant on each layer being encouraged to produce an output that is matched against the label.

In other words, what if we wanted to let the inner layers not be subject to any correlation with the output data? You would need something that forces the gradients away from zero but also away from excessively high numbers. I.e. weight regularization or layer normalisation with a fixed non zero bias.

Predictive coding and especially batched predictive coding could also be a solution to this.

Predictive coding predicts the input of the next layer, so the only requirement is that the forward pass produces a non zero output. There is no requirement for the gradient to flow through the entire network.

gwern · 2025-10-19T01:37:54 1760837874

My point is more that Schmidhuber is saying that the gates or the initialization are the innovation solely because they produce well-behaved gradients, which is why Hochreiter's 1991 thing is where he starts and nothing before that counts. But it's not clear to me why we should define it like that when you can solve the gradient misbehavior other ways, which is why https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf#pa... works and doesn't diverge: if I'm understanding them right, they did warmup, so the gradients don't explode or vanish. So why doesn't that count? They have shortcut layers and a solution to exploding/vanishing gradients and it works to solve their problem. Is it literally 'well, you didn't use a gate neuron or fancy initialization to train your shortcuts stably, therefore it doesn't count'? Such an argument seems carefully tailored to exclude all prior work...

gwern · 2025-10-10T19:56:07 1760126167

This apparently doesn't apply here, but in fact, pixels can be generated independently of each other. There are architectures where you can generate an arbitrary pixel or element of the image without generating the others; they are just implicit. See NeRFs or 'single-pixel GANs' or MAEs: eg https://arxiv.org/abs/2003.08934 https://arxiv.org/abs/2011.13775 https://arxiv.org/abs/2401.14391

Why is this possible? I tend to think of it as reflecting the ability to 'memorize' all possible data, and the independent generation is just when you 'remember' a specific part of a memory. The latent space is a Platonic object which doesn't change, so why should your generative process for materializing any specific point in the latent space have to? It's not surprising if you could generate arbitrary points from a function like 'y = mx + b' without generating every other point, right? It's just an atemporal mathematical object. Similarly with 'generating images from a random seed'. They too are just (complicated) functions mapping one number to another number.

(You might wonder if this is limited to images? It is not. In fact, you can generate even natural language like this to some degree: https://github.com/ethan-w-roland/AUNN based on my proposal for taking the 'independent generation' idea to a pathological extreme: https://gwern.net/aunn )

gwern · 2025-09-30T18:32:12 1759257132

Requires a login?

nulltype · 2025-10-02T09:11:36 1759396296

Looks like it's probably https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

gwern · 2025-09-27T01:51:07 1758937867

I think OP is an instance that proves the point

this is a lazy, clumsy editing attempt, done through a document registration service which exists to prevent exactly this and yet, you have to be an experienced nerd (eg https://en.wikipedia.org/wiki/Matthew_Garrett has a doctorate and decades of software development experience) who will jump through a bunch of hoops to even begin to build a case beyond he-said-she-said. And he still doesn't have a settlement or criminal conviction in hand, so he's not even half done... Or look at the extensive forensics in the Craig Wright case just to establish simple things like that they were edited or backdated to a legally acceptable level.

Meanwhile, the original PDF edit in question took maybe 5 minutes with entry-level PDF tools.

gwern · 2025-09-26T22:56:15 1758927375

One of the most surprising Gwern.net bug reports was from a compulsive highlighter who noted that the skip-ink implementation (which uses the old text-shadow trick, because frustratingly, the recently standardized skip-ink CSS still manages to fail at its only job and it looks awful) looked bad because of how browsers handle shadows and highlighting.

We had known about that (and it can't be fixed because browsers don't let you control the highlighting), but we had never imagined it'd be a problem because you'd only see it briefly when once in a while copy-pasting some text for a quote - right? I mean, why else would anyone be highlighting text? You'd only highlight small bits of text you had already read, so if it looked bad in places, that was fine, surely.

(Narrator: "It was not fine.")

Just another instance of Hyrum's law, I guess...

We decided to WONTFIX that because we can't easily fix that without making it uglier for users who don't abuse highlighting and are reading normally, which is almost everyone else.

gwern · 2025-09-26T22:38:14 1758926294

Demonstrating that Rich Sutton was never really on the 'LLM bus' in the first place. Note the remarkable absence from the essay of language models & large language models from that essay despite BERT and GPT-2 and 'unreasonable effectiveness of data' etc. He only briefly mentions speech recognition. (Note also Sutton's general absence from LLM research, the Edmund Plan or switch from DeepMind to Keen Technologies as DeepMind was forced into LLM-centric research, and his published research since 2019's emphasis on small models and trying to fix their pathologies like catastrophic forgetting.)

> The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

You could easily seem most LLM work as a dead end because it is about 'building knowledge into your agents' (eg. by paying data labelers billions of dollars total to supplement your scrapes), and not about 'search' (still a major open problem for LLMs - o1-style serial reasoning traces are obviously inadequate) or 'learning' (LLMs depend so heavily on the knowledge already encoded in so much data for them).

gwern · 2025-08-24T01:30:33 1755999033

It is especially not obvious because this was written using ChatGPT-5. One appreciates the (deliberate?) irony, at least. (Or at least, surely if they had asymptoted, OP should've been able to write this upvoted HN article with an old GPT-4, say...)

mdp2021 · 2025-08-24T01:41:01 1755999661

> this was written using

How do you know?

gwern · 2025-08-24T01:54:11 1756000451

It is lacking in URLs or references. (The systematic error in the self-reference blog post URLs is also suspicious: outdated system prompt? If nothing else, shows the human involved is sloppy when every link is broken.) The assertions are broadly cliche and truisms, and the solutions are trendy buzzwords from a year ago or more (consistent with knowledge cutoffs and emphasizing mainstream sources/opinions). The tricolon and unordered bolded triplet lists are ChatGPT. The em dashes (which you should not need to be told about at this point) and it's-not-x-but-y formulation are extremely blatant, if not 4o-level, and lacking emoji or hyperbolic language; hence, it's probably GPT-5. (Sub-GPT-5 ChatGPTs would also generally balk at talking about a 'GPT-5' because they think it doesn't exist yet.) I don't know if it was 100% GPT-5-written, but I do note that when I try the intro thesis paragraph on GPT-5-Pro, it dislikes it, and identifies several stupid assertions (eg. the claim that power law scaling has now hit 'diminishing returns', which is meaningless because all log or power laws always have diminishing returns), so probably not completely-GPT-5-written (or least, sub-Pro).

mdp2021 · 2025-08-24T02:06:06 1756001166

> when I try the intro thesis paragraph on GPT-5-Pro, it dislikes it

I don't know about GPT-5-Pro, but LLMs can dislike their own output (when they work well...).

gwern · 2025-08-24T02:23:03 1756002183

They can, but they are known to have a self-favoring bias, and in this case, the error is so easily identified that it raises the question of why GPT-5 would both come up with it & preserve it when it can so easily identify it; while if that was part of OP's original inputs (whatever those were) it is much less surprising (because it is a common human error and mindlessly parroted in a lot of the 'scaling has hit a wall' human journalism).

Foreignborn · 2025-08-24T02:50:14 1756003814

do you have a source?

when i’ve done toy demos where GPT5, sonnet 4 and gemini 2.5 pro critique/vote on various docs (eg PRDs) they did not choose their own material more often than not.

my setup wasn’t intended to benchmark though so could be wrong over enough iterations.

gwern · 2025-08-24T03:16:29 1756005389

I don't have any particularly canonical reference I'd cite here, but self-preference bias in LLMs is well-established. (Just search Arxiv.)

alangou · 2025-08-24T03:35:34 1756006534

My favorite tell-tale sign:

> The gap isn’t just quantitative—it’s qualitative.

> LLMs don’t have memory—they engage in elaborate methods to fake it...

> This isn’t just database persistence—it’s building memory systems that evolve the way human memory does...

> The future isn’t one model to rule them all—it’s hundreds or thousands of specialized models working together in orchestrated workflows...

> The future of AGI is architectural, not algorithmic.

gwern · 2025-08-23T04:17:07 1755922627

https://gwern.net/doc/statistics/causality/1926-yule.pdf

gwern · 2025-08-23T04:15:01 1755922501

That is unintentional and a bug in the dark-mode.

For dark-mode, we rely on https://invertornot.com/ to dynamically decide whether to fade or negate/invert. (Background: https://gwern.net/invertornot ) The service uses a small NN and is not always correct, as in these cases. Sorry.

I have filed them as errors with InvertOrNot, and will manually set them to invert.

endymion-light · 2025-08-27T10:55:52 1756292152

Ah lovely! Not a critique as enjoyed the rest of the article, just wanted to comment in case it helped resolve it!

gwern · 2025-08-09T02:19:48 1754705988

The final section pounding the desk about how terrible ending the program was seems like it is oddly at variance with all the evidence OP had just laid out about how the program wasn't working well anymore and so wasn't actually financially a good idea. It's weird to quote a bunch of things like studies showing that 'internal R&D spending works worse than external for ROI' and then write a big moralizing sermonizing conclusion about how ending internal R&D is bad for profits and how terrible it is there's no 'patient capital' (capital which was plenty available before - what's the theory, investors stopped liking making money? insurance companies with century-long investment horizons ceased to exist? etc).