Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?
Doesn’t matter how you gain general knowledge of compiler techniques as long as you don’t have specific knowledge of the implementation of the compiler you are reverse engineering.
If you have ever read the source code of the compiler you are reverse engineering, you are by definition not doing a clean room implementation.
Claude was not reverse engineering here. By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.
No good companies for you, yet you bet on Chinese labs! Even if you have no moral problems at all with the China authoritarian, Chinese companies are as morally trustworthy as American ones. That is clear.
As it’s often said: there is no such thing as free product, you are the product. AI training is expensive even for Chinese companies.
I expect to some degree the Chinese models don't need immediate profits, because having them as a show of capability for the state is already a goal met? They're probably getting some support from the state at least.
> Even if you have no moral problems at all with the China authoritarian
It's funny how you framed your sentence. Let's unpack it.
1. I didn't say Chinese companies are good, I said my hope is on open models and only Chinese labs are doing good in that front
2. Chinese company doesn't immediately mean its about regime. Maybe its true in the US with the current admin, see how Meta, Google, Microsoft got immediately aligned with current admin
3. Even when company is associated with Chinese regime, I don't remember Chinese authoritarian regime kidnapping the head of another state, invading bunch of countries in the Middle East and supporting states committing genocide and ethnic cleansing (Israel in Gaza, UAE in Sudan and many more small militant groups across Africa and ME), and authoritarian regimes like Saudi Arabia.
If you ask me to rate them by evil level, I would give the US 80/100, and China 25/100 - no invasions, no kidnapping of head of states, no obvious terror acts - but unfortunate situation with Uyghurs.
What do you mean, "are you sure"? I literally saw and see it happen in front of my eyes. Just now tested it with slight variations of "ideal temperature waterfowl cooking", "best temperature waterfowl roasting", etc. and all these questions yield different answers, with temperatures ranging from 47c-57c (ignoring the 74c food safety ones).
That's my entire point. Even adding an "is" or "the" can get you way different advice. No human would give you different info when you ask "what's the waterfowl's best cooking temperature" vs "what is waterfowl's best roasting temperature".
Did you point that out to one of them… like “hey bro, I’ve asked y’all this question in multiple threads and get wildly different answers. Why?”
And the answer is probably because there is no such thing as an ideal temperature for waterfowl because the answer is “it depends” and you didn’t give it enough context to better answer your question.
Context is everything. Give it poor prompts, you’ll get poor answers. LLMs are no different than programming a computer or anything else in this domain.
And learning how to give good context is a skill. One we all need to learn.
But that isn't how normal people interact with search engines. Which is the whole argument everyone is saying here, how LLMs are now better 'correct answer generators' than search engine. They're not. My mother directly experienced that. Her food would have come out much better if she completely ignored Gemini and checked a site.
One of the best things LLMs could do (and that no one seems to be doing) is allow it to admit uncertainty. If the average weight of all tokens in a response drops below X, it should just say "I don't know, you should check a different source."
At any rate, if my mother has to figure out some 10 sentence stunted multiform question for the LLM to finally get a good consistent answer, or can just type "best Indian restaurant in Brooklyn" (maybe even with site:restaurant reviews.com"), which experience is superior?
> LLMs are no different than programming a computer or anything else in this domain.
Just feel like reiterating against this: virtually no one programs their search queries or query engineers a 10 sentence search query.
If I made a new, not-AI tool called 'correct answer provider' which provided definitive, incorrect answers to things you'd call it bad software. But because it is AI we're going to blame the user for not second guessing the answers or holding it wrong ie. bad prompting.
Which spec? Is there a spec that says if you use a particular set of libraries you’d get less than 10 millisecond response? You can’t even know that for sure if you roll your own code, with no 3rd party libraries.
Bugs are by definition issues arise when developers expect they code to do one thing, but it does another thing, because of unforeseen combination of factors. Yet we all are ok with that. That’s why we accept AI code. They work well enough.
For OSes: POSIX, or the MSDN documentation for Windows.
Compiler bugs and OS bugs are extremely rare so we can rely on them to follow their spec.
AI bugs are very much expected when the "spec" (the prompt) is correct, and since the prompt is written using imprecise human language likely by people that are not used to writing precise specifications, the prompt is likely either mistaken or insufficiently specified.
> Is there a spec that says if you use a particular set of libraries you’d get less than 10 millisecond response?
There can be. But you’d have to map the libraries to opcodes and then count the cycles. That’s what people do when they care about that particular optimization. They measure and make guaranties.
Assuming also that you are not running on top of an operating system, running in a VM with “noisy neighbors”…
I haven’t counted cycles since programming assembly on a 65C02 where you cooks save a clock cycle by accessing memory in the first page of memory - two opcodes to do LDA $02 instead of LDA $0201
Huh? how can one possibly generalize whatever experience they have not only to one country but to “other countries”, i.e. to the world. I’ve taken taxi in many countries, in all continents, and my experience have been that the drivers are generally helpful. There are scams and bad experience, but that’s minority. That applies to any country, the US included
1) The CEO said there was a JS engine, but it didn't work.
2) It didn't build when they published the blog post.
Therefore it lacks integrity! Except that it built (I took Simon's words for it), and building a browser is beside the point, there are a few other big projects listed (Java LSP, Windows 7 emulator, Excel, etc.)
The blog stated:
"Our goal is to understand how far we can push the frontier of agentic coding for projects that typically take human teams months to complete.
This post describes what we've learned from running hundreds of concurrent agents on a single project, coordinating their work, and watching them write over a million lines of code and trillions of tokens."
They didn't set the goal of building a browser. It's an experiment about coordinating AI agents within a context of a complex software project, yet you complained they exaggerating about a JS engine?
The blog post itself is one of the first that describes a large scale experiment of agents, what works, what doesn't. There is very little hype. They didn't say it's game changing or Cursor is the best AI tool.
> The blog post itself is one of the first that describes a large scale experiment of agents, what works, what doesn't. There is very little hype. They didn't say it's game changing or Cursor is the best AI tool.
To test this system, we pointed it at an ambitious goal: building a web browser from scratch. The agents ran for close to a week, writing over 1 million lines of code across 1,000 files [...]
Despite the codebase size, new agents can still understand it and make meaningful progress. Hundreds of workers run concurrently, pushing to the same branch with minimal conflicts.
The point is that the agents can comprehend the huge amount of code generated and continue to meaningfully contribute to the goal of the project. We didn't know if that was possible. They wanted to find out. Now we have a data point.
Also, a popular opinion on any vibecoding discussion is that AI can help, but only on greenfield, toy, personal projects. This experiment shows that AI agents can work together on a very complex codebase with ambitious goals. Looks like there was a human plus 2,000 agents, in two months. How much progress do you think a project with 2,000 engineers can achieve in the first two months?
> What matters is whether those lines build, function as expected (especially in edge cases) and perform decently. As far as I can tell, AI has not been demonstrated to be useful yet at those three things.
They did build. You can give it a try. They did function as expected. How many edge cases would you like it to pass? Perform decently? How could you tell if you didn't try?
Yes of course. These are calculators - they are meant to reliably calculate things.
I think the difference is that building 60 interactive calculators manually would force you to do a lot of manual testing. If someone built up that many interactive calculators I would imagine a lot of attention has gone on each one. Why would they spend so much time on something and not test it?
Are you sure about that? Do you have some examples? The older Claude models can’t do it according to TFA.
reply