Hacker Newsnew | past | comments | ask | show | jobs | submit | CuriouslyC's commentslogin

#1 isn't going to happen because we're actually data limited, not compute limited. You can throw all the compute in the world at bad data and it won't make a difference, but an undertrained model with perfect training data will absolutely slay.

#2 isn't going to happen, because these labs have shown they have limited app/design sense, and they also lack the industry connections and domain wisdom to execute.

The way things are actually going to go is that these labs will set up partnerships with huge biotech/engineering/etc firms, and do custom training/inference on specific tasks that promise to be wildly profitable with them, then take royalties on the creation in perpetuity. Why sell inference when you can partner with Pfizer to make a version of Ozempic that also makes people freaky jacked, or partner with Bectel to make a radically safer, more efficient Nuclear power plant?


I don't think "data limited" is true anymore outside of very specialized cases (for instance: https://arxiv.org/abs/2510.01631). As weird as it sounds, training improves a lot with synthetic data.

You do need business development to create those relationships. Saying they "have limited ___" mostly means they "haven't yet hired people who are good at ___". That's been changing already; the Claude app is steadily improving and handling more use cases simply through understanding which tools to use, Anthropic is building more relationships to create more tools, and all the frontier model companies are building relationships with companies that have specialized data and want specialized solutions.

I think we're also seeing the frontier model companies offer partners their own ability to run RL on their own data, and then retrain new models on the same data. That's going to make those relationships VERY sticky in ways that won't be obvious from the outside.


Can you point me to the parts in that paper that meet those claims, I am reading something different and want to know what I am missing.

This study seems to show that there are places where synthetic data, especially related to common crawl.

> Pure synthetic data remains non-advantageous over CC; notably, models trained on pure rephrased synthetic data will underperform those trained on CC at larger models.

But the tradeoffs seem to be different at large scale.

> Overall, these model scaling results suggest synthetic data appears comparably less favorable for pre-training larger LMs relative to its utility in data scaling scenarios. Despite outperforming training on CC, larger models are not as tolerant to a higher ratio synthetic data as larger data budgets. This observation aligns with practices where synthetic data is effective for smaller LMs or specific pre-training phases, but less predominantly used for the largest models.

How I am reading it is there are places where it is useful:

> Notably, any mixture involving synthetic data, or pure synthetic data (except pure QA), is projected to achieve a lower irreducible loss than training only on CommonCrawl.

But it also seems that on textbook scale synthetic data, they did show model collapse vs rephrased data.

> These results contribute mixed evidence on “model collapse" during large-scale single-round (n=1) model training on synthetic data–training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by “model collapse".

IMHO there are some very specific areas where we aren't "data limited", like math, but as your reference states "Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance."

Note the cost of 30% of the total dataset being synthetic, where the model starts amplifying the generator's biases, leading to a permanent degradation in downstream zero-shot capability on unseen out-of-domain natural tasks.

My takeaway is there is nuance where synthetic data is an amplifier and where it is a problem, and in my mind that paper demonstrates it will not solve the data problem in general.


> we're actually data limited

Correction: public text data limited.

There's a ridiculous amount of proprietary text and non-text data out there that much of society is run on.


what is 'bad data' and 'perfect data' according to you?

Worst possible bad data is where the data is orthogonal to the task, so increasing the data never provides information on the task. Perfect data is where the data exactly encapsulates the task being trained.

The government is going to ban foreign models and foreign inference providers, without question. The US govt is going to dig its dirty little fingers into OAI/Anthropic/Oracle/(probably)SpaceX and end up taking some stock for a sovereign wealth fund (probably timed to prop up flagging share prices, and with the promise of sweet government grift down the line), and at that point the bans will be framed as protecting that investment.

The AI writing perspective is ok, it at least feels honest. It doesn't sell me to read the article as well as a human intro though, so I wouldn't go all in that way going forward. I like the idea of an article being a human/AI dialog, echoing what you might expect if Picard was giving a talk where he pulled Data in to cover details.

The issue wasn't artistic blindness, the id art was solid. The issue was a lack of game design. Doom worked because it was crazy fast and you could have a lot of sprites on the screen at once, so it had this crazy hectic quality. Quake's 3d engine meant you couldn't be so fast and couldn't anywhere close to the number of enemies on screen at once, so the game wanted a more soulslike design, but they stuck with the run and gun design but with spongier enemies, which just didn't feel great.

Only if the primary school is exclusively for the children of oligarchs.

The training costs are very likely the reason. Dario has talked about how each individual model is profitable, but how the expenditure training the next generation of models makes it look like they're not profitable at any given moment in time, and I believe he's being honest about that.

The market for open weight model hosting gives you an idea of the profitable price floor, it's pretty clear there's markup baked into OAI/Anthropic's APIs.


The whole hidden plateau hypothesis is kinda bunk, because we're already pretty far in a plateau for general knowledge/question answering, but there are many subdomains where we can push model capabilities, and as we saturate one subdomain we can just shift to another economically valuable one.

There isn't one AI intelligence S curve, there are thousands of them, and they're mostly invisible in the major benchmarks, but for someone trying to do work in that specific area of capability, the progress is transformative.


I'm skeptical of a hidden plateau, but I really think it's overconfident to assume there's not one. Remember that it doesn't even have to be a technical plateau; the effective plateau of e.g. car speeds is determined by regulations and road conditions, and far below what "frontier cars" are capable of on a controlled racetrack.

100%. There will be strict quotas on the expensive models and day to day work will be done on the cheap models that are "good enough" with escalation to the metered models when the cheaper options are spinning their wheels. Eventually the US frontier lab APIs will only get the most heavily triaged work that multiple tiers of cheaper Chinese open weight models have failed on.

And of course the C-suite will have unlimited access to Mythos tier models, which they'll use to summarize reports, while passing down mandates to rank and file to increase usage of less expensive models.


Honestly operator overloading is almost always a bad choice IMO. There are cases (e.g. matrix math) where it's the right choice because of semantic clarity, but the indirection it creates in exchange for readability is costly. It's always obvious when a function is being called, and how to navigate to the implementation, but the same is not true of overloaded operators.

Well, it's like most tools. For this particular use, does it give more than it takes? If so, use it.

Matrix multiplication? Yeah, everybody knows there's a function being called. And, if it was implemented right, the users almost never have to look at the implementation.

Many other possible uses? Nope. Just nope. Not worth it.


Is the selling point for this vs e.g. Plotly just the ggplot style semantics?

The selling point is the grammar of graphics. See Wilkinson on the subject.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: