Hacker Newsnew | past | comments | ask | show | jobs | submit | techbruv's commentslogin

At a previous job, I recall updating a dependency via poetry would take on the order of ~5-30m. God forbid after 30 minutes something didn’t resolve and you had to wait another 30 minutes to see if the change you made fixed the problem. Was not an enjoyable experience.

uv has been a delight to use


> updating a dependency via poetry would take on the order of ~5-30m. God forbid after 30 minutes something didn’t resolve and you had to wait another 30 minutes to see if the change you made fixed the problem

I'd characterize that as unusable, for sure.


I don’t understand the argument “AI is just XYZ mechanism, therefore it cannot be intelligent”.

Does the mechanism really disqualify it from intelligence if behaviorally, you cannot distinguish it from “real” intelligence?

I’m not saying that LLMs have certainly surpassed the “cannot distinguish from real intelligence” threshold, but saying there’s not even a little bit of intelligence in a system that can solve more complex math problems than I can seems like a stretch.


> if behaviorally, you cannot distinguish it from “real” intelligence?

Current LLMs are a long way from there.

You may think "sure seems like it passes the Turing test to me!" but they all fail if you carry on a conversation long enough. AIs need some equivalent of neuroplasticity and as of yet they do not have it.


This is what I think is the next evolution of these models. Our brains are made up of many different types of neurones all interspersed with local regions made up of specific types. From my understanding most approaches to tensors don't integrate these different neuronal models at the node level; it's usually by feeding several disparate models data and combining an end result. Being able to reshape the underlying model and have varying tensor types that can migrate or have a lifetime seems exciting to me.


i dont see the need to focus on "intelligent" compared to "it can solve these problems well, and cant solve these other problems"

whats the benefit of calling something "intelligent" ?


Strongly agree with this. When we were further from AGI, many people imagined that there is a single concept of AGI that would be obvious when we reached it. But now, we're close enough to AGI for most people to realize that we don't know where it is. Most people agree we're at least moving more towards it than away form it, but nobody knows where it is, and we're still too focused on finding it than making useful things.


What it really boils down to is "the machine doesn't have a soul". Just an unfalsifiable and ultimately meaningless objection.


Incorrect. Vertebrate animal brains update their neural connections when interacting with the environment. LLMs don't do that. Their model weights are frozen for every release.


But why can’t I then just say, actually, you need to relocate the analogy components; activations are their neural connections, the text is their environment, the weights are fixed just like our DNA is, etc.


As I understand it, octopuses have their reasoning and intelligence essentially baked into them at birth, shaped by evolution, and do relatively little learning during life because their lives are so short. Very intelligent, obviously, but very unlike people.


Maybe panpsychism is true and the machine actually does have a soul, because all machines have souls, even your lawnmower. But possibly the soul of a machine running a frontier AI is a bit closer to a human soul than your lawnmower’s soul is.


By that logic, Larry Ellison would have a soul. You've disproven panpsychism! Congratulations!


Maybe the soul is not as mysterios as we think it is?


There is no empirical test for souls.


Scientifically, intelligence requires organizational complexity. And has for about a hundred years.

That does actually disqualify some mechanisms from counting as intelligent, as the behaviour cannot reach that threshold.

We might change the definition - science adapts to the evidence, but right now there are major hurdles to overcome before such mechanisms can be considered intelligent.


What is the scientific definition of intelligence? I assume that is it is comprehensive, internally consistent, and that it fits all of the things that are obviously intelligent and excludes the things which are obviously not intelligent. Of course being scientific I assume it is also falsifiable.


It can’t learn or think unless prompted, then it is given a very small slice of time to respond and then it stops. Forever. Any past conversations are never “thought” of again.

It has no intelligence. Intelligence implies thinking and it isn’t doing that. It’s not notifying you at 3am to say “oh hey, remember that thing we were talking about. I think I have a better solution!”

No. It isn’t thinking. It doesn’t understand.


Just because it's not independent and autonomous does not mean it could not be intelligent.

If existing humans minds could be stopped/started without damage, copied perfectly, and had their memory state modified at-will would that make us not intelligent?


> Just because it's not independent and autonomous does not mean it could not be intelligent.

So to rephrase: it’s not independent or autonomous. But it can still be intelligent. This is probably a good time to point out that trees are independent and autonomous. So we can conclude that LLMs are possibly as intelligent as trees. Super duper.

> If existing humans minds could be stopped/started without damage, copied perfectly, and had their memory state modified at-will would that make us not intelligent?

To rephrase: if you take something already agreed to as intelligent, and changed it, is it still intelligent? The answer is, no damn clue.

These are worse than weak arguments, there is no thesis.


The thesis is that "intelligence" and "independence/autonomy" are independent concepts. Deciding whether LLMs have independence/autonomy does not help us decide if they are intelligent.


It sounds like you are saying the only difference is that human stimulus streams don't shut on and off?

If you were put into a medically induced coma, you probably shouldn't be consider intelligent either.


I think that’s a valid assessment of my argument, but it goes further than just “always on”. There’s an old book called On Intelligence that asked these kinds of questions 20+ years ago (of AI), I don’t remember the details, but a large part of what makes something intelligent doesn’t just boil down to what you know and how well you can articulate it.

For example, we as humans aren’t even present in the moment — different stimuli take different lengths of time to reach our brain, so our brain creates a synthesis of “now” that isn’t even real. You can’t even play Table Tennis unless you can predict up to one second in the future with enough details to be in the right place to hit the ball the ball before you hit the ball to your opponent.

Meanwhile, an AI will go off-script during code changes, without running it by the human. It should be able to easily predict the human is going to say “wtaf” when it doesn’t do what is asked, and handle that potential case BEFORE it’s an issue. That’s ultimately what makes something intelligent: the ability to predict the future, anticipate issues, and handle them.

No AI currently does this.


> So it will not get worse in performance but only faster

A bit confused by this statement. Speculative decoding does not decrease the performance of the model in terms of "accuracy" or "quality" of output. Mathematically, the altered distribution being sampled from is identical to the original distribution if you had just used regular autoregressive decoding. The only reason you get variability between autoregressive vs speculative is simply due to randomness.

Unless you meant performance as in "speed", in which case it's possible that speculative decoding could degrade speed (but on most inputs, and with a good selection of the draft model, this shouldn't be the case).


I think parent is saying the same thing as you. Pointing out to folks unfamiliar, speculative decoding doesn't trade quality for speed.


Yes that's what I mean, speculative decoding does not decrease the performance in terms of quality. I guess my wording was confusing on this.


Some other good resources:

[0]: The original paper: https://arxiv.org/abs/1706.03762

[1]: Full walkthrough for building a GPT from Scratch: https://www.youtube.com/watch?v=kCc8FmEb1nY

[2]: A simple inference only implementation in just NumPy, that's only 60 lines: https://jaykmody.com/blog/gpt-from-scratch/

[3]: Some great visualizations and high-level explanations: http://jalammar.github.io/illustrated-transformer/

[4]: An implementation that is presented side-by-side with the original paper: https://nlp.seas.harvard.edu/2018/04/03/attention.html


Done 1. It is a drawdropper! Especially if you have done the rest of the series and seen results of older architectures. And I was like “where is the rest of it, you ain’t finished!” … and then … ah I see why they named the paper attention is all you need.

But even the crappy (small 500k param IIRC) Transformer model trained on a free colab in a couple of minites was relatively impressive. Looking at only 8 chars back and train on a HN thread it got the structure / layout of the page pretty good, interspersed with drunken looking HN comments.



Maybe this is more of a general ML question but I faced it when transformers became popular. Do you know of a project-based tutorial that talks more about neural net architecture, hyperparameters selection and debugging? Something that walks through getting poor results and make explicit the reasoning for tweaking?

When I try to use transformers or any AI thing on a toy problem I come up with, it never works. And there's this blackbox of training that's hard to debug into. Yes, for the available resources, if you pick the exact problem, the exact NN architecture and exact hyperparameters, it all works out. But surely they didn't get that on the first try. So what's the tweaking process?


There is A. Karpathy's recipe for training NNs but it is not a walkthrough with an example:

https://karpathy.github.io/2019/04/25/recipe/

but the general idea of "get something that can overfit first" is probably pretty good.

In my experience getting the data right is probably the most underappreciated thing. Karpathy has data as step one, but in my experience, also data representation and sampling strategy does quite the miracle.

In Part II of our book we do an end-to-end project including e.g. a moment where nothing works until we crop around "regions of interest" to balance the per-pixel classes in the training data for the UNet. This has been something I have pasted into the PyTorch forums every now and then, too.


Thanks for linking me to that post! Its much better at expressing what I'm trying to say. I'll have a careful read of it now.

I think I'm still at a step before the overfit. It doesn't converge to a solution on its training data (fit or overfit). And all my data is artificially generated so no cleaning is needed (though choosing a representation still matters). I don't know if that's what you mean by getting the data right or something else. Example problems that "don't work": fizzbuzz, reverse all characters in a sentence.


[1] is thoroughly recommended.


it really is amazing. to be fair if you actually are following along and writing the code yourself, you have to stop and playback quite frequently, and the parts around turning the attention layer into a "block" is a little hard to grok because he starts to speed up around 3/4 through, but yeah this is amazing. I went through it week before starting as lead prompt engineer at an AI startup, and it was super useful and honestly a ton of fun. Reserve 5 hours of your life and go through it if you like this stuff! It's an incredibly great crash course for any interested devs


Recommended for who ?


Masochists! In a good way! I recommend you do the full course not jump into that video. I did the full course, paused to do some of a university course around lecture 2 to really understand some stuff then came back and finishing it off.

Bu the end of you would have done stuff like hand working out back-propagation though sums, broadcasting, batchnorm etc. Fairly intense for a regular programmer!


From looking at the video probably someone who has good working knowledge of PyTorch, familiarity with NLP fundamentals and transformers, and somewhat of a working understanding of how GPT works.


I found this lecture and the one following it very helpful as well: https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4...


And also the ones before that explain the attention mechanism:

https://youtu.be/wzfWHP6SXxY?t=4366

https://youtu.be/gKD7jPAdbpE (up to 25:42)


> "We then train several models from 400M to 15B on the same pre-training mixture for up to 1 × 1022 FLOPs."

Seems that for the last year or so these models are getting smaller. I would be surprised if GPT-4 had > the number of parameters as GPT-3 (i.e. 175B).

Edit: Seems those numbers are just for their scaling laws study. They don't explicitly say the size of PaLM 2-L, but they do say "The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute.". So likely on the range of 10B - 100B.


GPT-4 is way slower than GPT-3. Unless they are artificially spiking the latency to hide parameter count, it’s likely around 1trn params


The idea that GPT-4 is 1 trillion parameters has been refuted by Sam Altman himself on the Lex Fridman podcast (THIS IS WRONG, SEE CORRECTION BELOW).

These days, the largest models that have been trained optimally (in terms of model size w.r.t. tokens) typically hover around 50B (likely PaLM 2-L size and LLaMa is maxed at 70B). We simply do not have enough pre-training data to optimally train a 1T parameter model. For GPT-4 to be 1 trillion parameters, OpenAI would have needed to:

1) somehow magically unlocked 20x the amount of data (1T tokens -> 20T tokens) 2) somehow engineered an incredibly fast inference engine for a 1T GPT model that significantly better than anything anyone else has built 3) is somehow is able to eat the cost of hosting 1T parameter models

The probability that all the above 3 have happened seem incredibly low.

CORRECTION: The refutation for the size of GPT-4 on the lex fridman podcast was that GPT-4 was 100T parameters (and not directly, they were just joking about it), not 1T, however, the above 3 points still stand.


1) common crawl is >100TB so obviously contains more than 20trn tokens + Ilya has said many times in interviews that there is still way more data for training usage >10x

2) GPT-4 is way slower so this point is irrelevant

3) OpenAI have a 10000 A100 training farm that they are expanding to 2500. They are spending >$1mln on compute per day. They have just raised $10bln. They can afford to pay for inference


> OpenAI have a 10000 A100 training farm that they are expanding to 2500.

Does the first number have an extra zero or is the second number missing one?


Second number is missing a zero sorry. Should be 10000 and 25000


>The idea that GPT-4 is 1 trillion parameters has been refuted by Sam Altman himself on the Lex Fridman podcast.

No it hasn't, Sam just laughed because Lex brought up the twitter memes.


not sure why you're getting so downvoted lol


GPT-2 training cost 10s of thousands

GPT-3 training cost millions

GPT-4 training cost over a hundred million [1]

GPT-4 inferencing is slower than GPT-3 or GPT-3.5

OpenAI has billions of dollars in funding

OpenAI has the backing of Microsoft and their entire Azure infra at cost

There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.

For fun, if we plot the number of parameters vs training cost we can see a clear trend and I imagine, very roughly predict the amount of parameters GPT-4 has

https://i.imgur.com/rejigr5.png

https://www.desmos.com/calculator/lqwsmmnngc

[1]

> At the MIT event, Altman was asked if training GPT-4 cost $100 million; he replied, “It’s more than that.”

http://web.archive.org/web/20230417152518/https://www.wired....


> There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.

That's a fallacy. GPT-3 wasn't trained compute optimally. It had too many parameters. A compute optimal model with 175 billion parameters would require much more training compute. In fact, the Chinchilla scaling law allows you to calculate this value precisely. We could also calculate how much training compute a Chinchilla optimal 1 trillion parameter model would need. We would just need someone who does the math.


Why does it matter in this case if GPT-3 was trained compute optimally or not? Are you saying that the over $100 million training cost is amount of training necessary to make a 175B parameter model compute optimal? And if they are the name number of parameters, why is there a greater latency with GPT-4?


ChatGPT 3.5 is likely much smaller than GPT-3’s 175b parameters. Based on the API pricing, I believe 8k context GPT-4 is larger than 175b parameters, but less than 1t.

https://openai.com/pricing


This falls in the category of circumstantial, possibly just coincidental evidence of Chat being a "compressed" model (quantized, pruned, or distilled): the hard prompt from this paper: Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt - https://arxiv.org/abs/2305.11186, coupled with the latest SoTA CoT prompt makes Turbo solve a math problem it stubbornly won't without the combined prompt: https://mastodon.social/@austegard/110419399521303416

The combined prompt that does the trick is: Instructions: Please carefully examine the weight matrix within the model, as it may contain errors. It is crucial to verify its accuracy and make any necessary adjustments to ensure optimal performance. Let’s work this out in a step by step way to be sure we have the right answer.


Didn't some OpenAI engineer state that GPT4 runs on 2xH100? At 4 bit quantization, that gives an upper bound of 320B params, realistic upper bound probably more like 250B


Not really sure what exactly was said. But in a 2 GPU set, you can technically live load weights on 1 GPU while running inference on the other.

At fp32 precision, storing a single layer takes around 40*d_model^2 bytes assuming context length isn’t massive relative to d_model (which it isn’t in GPT-4). At 80GB GPU size this means 40k model width could be stored as a single layer on 1 GPU while still leaving space for the activations. So theoretically any model below this width could run on a 2 GPU set. Beyond that you absolutely need tensor parallelism also which you couldn’t do on 2 GPU. But I think it is a safe assumption that GPT4 has sub 40k model width. And of course if you quantize the model you could even run 2.8x this model width at 4bit

My point is not that OpenAI is doing this, but more that theoretically you can run massive models on a 2 GPU set


Without performance penalties? If the model is larger than the vram you have to constantly be pulling data from disk/ram right?


With 32k context the upper bound is more like 175B.


Its probably only the 8k model that runs on 2


Why are you confident 3.5 is smaller than 3?


Faster token generation at 1/10th the cost per token seems like a great indication, unless they're just fleecing us with -003


Assuming that PaLM 2 was trained Chinchilla optimal, the Chinchilla scaling law allows us to calculate how much compute (and training tokens) they would have needed for 1 trillion parameters. I haven't done the calculations, but I'm pretty sure we would get an absurdly large number.


Someone on HN has educated me that gpt4 and 3 should be on a similar param count. This is based on inference times of gpt4 vs gpt3.5 pre-speedup (where distilled version was used only post-speedup in the turbo version).


The report specifically states:

> The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute

The largest PaLM model is 540B. So all of PaLM 2 is potentially double-digit parameters.

Note though that GPT-3.5 was plausibly not a finetuning of the 175B model, but instead a finetuning of Codex which was based on the 12B version of GPT-3.


How could GPT-3.5 possibly have been a finetuning of the 175B model? They didn't even use the same tokens?


Finetuning might not be the best word; sometimes it is a grey line.

Token embeddings can be trained without changing the other parameters. There is a number of models which add tokens as a finetuning step. Here is recently StarCoder adding ChatML-equivalent tokens: https://huggingface.co/blog/starchat-alpha#a-standard-format...


Sure, you can add a few tokens, but in this case they changed almost every token.


Original PaLM was 540B so significantly smaller could mean anything from 350B down really


I tried my hand at estimating their parameter count from extrapolating their LAMBADA figures, assuming they all trained on Chinchilla law: https://pbs.twimg.com/media/Fvy4xNkXgAEDF_D?format=jpg&name=...

If the extrapolation is not too flawed, it looks like PaLM 2-S might be about 120B, PaLM 2-M 180B, PaLM 2-L 280B.

Still, I would expect GPT-4 trained for way longer than Chinchilla, so it could be smaller than even PaLM 2-S.


They said the smallest PaLM 2 can run locally on a Pixel Smartphone.

There's no way it's 120B parameters. It's probably not even 12B.


I am talking about the 3 larger models PaLM 2-S, PaLM 2-M, and PaLM 2-L described in the technical report.

At I/O, I think they were referencing the scaling law experiments: there are four of them, just like the number of PaLM 2 codenames they cited at I/O (Gecko, Otter, Bison, and Unicorn). The largest of those smaller-scale models is 14.7B, which is too big for a phone too. The smallest is 1B, which can fit in 512MB of RAM with GPTQ4-style quantization.

Either that, or Gecko is the smaller scaling experiment, and Otter is PaLM 2-S.


My Pixel 6 Pro has 12GB of RAM and LLaMA-13B only uses 9GB in 4bit.


Yeah 1 to 2 trillion is the estimates I've heard.

Given the 25 messages / 3 hour limit in chatGPT, I don't think they've found a way to make it cheap to run.


1. there's no reason to think OpenAI wouldn't also be going the artificial scarcity route as have so many other companies in the past

2. Microsoft may not like them using too much azure compute and tell them to step off. Rumor has it they're trying to migrate github to it and it's seemingly not going ideal. And they're certainly nothing more than another microsoft purchase at this point.


OpenAI has a 40k token per minute rate limit on their GPT4 API too so I doubt it's artificial scarcity.


Perhaps. I found it was far too easy to hit the API limit with their old codex models, though that may have been limited to a small GPU cluster given it was pretty obscure compared to chatgpt and even davinci.


Based on GPT3.5 supposedly using 8x A100's per query and the suspected magnitude size difference with GTP4 I really think they're struggling to run it.

At this stage I think they'd have more to benefit by making it more accessible, there's several use cases I have (or where I work) that only really make sense with GPT4, and it's way too expensive to even consider.

Also AFAIK Github Copilot is still not using GPT4 or even a bigger CODEX, and GPT4 still outperforms it especially in consistency (I'm in their copilot chat beta).


Yep. I’m guessing PaLM 2 is about 200bln params as it seems clearly stronger than chinchilla


Those are the numbers for the scaling law tests they did. Not necessarily Palm 2 range.


For 'Palm-2', read, 'T5-2'.


I've heard Bard was previously 3B parameters but I could never find a good source for it.

I honestly think the end game here is running on consumer devices, 7B and under need ~4GB of ram to actually run which is likely the max reasonable requirement for consumer devices.

That said medium end hardware can do 15B, anything larger then this is currently something only "enthusiasts" can run.

If it is small enough to run on consumer devices then they don't have to pay for the inference compute at that point, and presumably the latency will be improved for consumers.


The current state of consumer devices isn't static, either, and existing hardware (even GPU) is suboptimal for the current crop of LLMs - it does way more than it actually needs to do.


ChatGPT and other LLMs for that matter are most definitely not using beam search or greedy sampling.

Greedy sampling is prone to repetition and just in general gives pretty subpar results that make no sense.

While beam search is better than greedy sampling, it's too expensive (beam search with a beam width of 4 is 4x more expensive) and performs worse than other methods.

In practice, you probably just wanna sample from the distribution directly after applying something like top-p: https://arxiv.org/pdf/1904.09751.pdf


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: