> I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.
The trick with Mythos wasn't that it didn't hallucinate nonsense vulnerabilities, it absolutely did. It was able to verify some were real though by testing them.
The question is if smaller models can verify and test the vulnerabilities too, and can it be done cheaper than these Mythos experiments.
People often undervalue scaffolding. I was looking at a bug yesterday, reported by a tester. He has access to Opus, but he's looking through a single repo, and Amazon Q. It provided some useful information, but the scaffolding wasn't good enough.
I took its preliminary findings into Claude Code with the same model. But in mine it knows where every adjacent system is, the entire git history, deployment history, and state of the feature flags. So instead of pointing at a vague problem, it knew which flag had been flipped in a different service, see how it changed behavior, and how, if the flag was flipped in prod, it'd make the service under testing cry, and which code change to make to make sure it works both ways.
It's not as if a modern Opus is a small model: Just a stronger scaffold, along with more CLI tools available in the context.
The issue here in the security testing is to know exactly what was visible, and how much it failed, because it makes a huge difference. A middling chess player can find amazing combinations at a good speed when playing puzzle rush: You are handed a position where you know a decisive combination exist, and that it works. The same combination, however, might be really hard to find over the board, because in a typical chess game, it's rare for those combinations to exist, and the energy needed to thoroughly check for them, and calculate all the way through every possible thing. This is why chess grandmasters would consider just being able to see the computer score for a position to be massive cheating: Just knowing when the last move was a blunder would be a decisive advantage.
When we ask a cheap model to look for a vulnerability with the right context to actually find it, we are already priming it, vs asking to find one when there's nothing.
Calling it “expert orchestration” is misleading when they were pointing it at the vulnerable functions and giving it hints about what to look for because they already knew the vulnerability.
You know for loops exist and you can run opencode against any section of code with just a small amount of templating, right? There's zero stopping you from writing a harness that does what you're saying.
The argument against rejecting to cancel seems like a stretch to me. It's completely fine if you view cancellation as a error condition, it allows you to recover from a cancellation if you want (swallow the error w catch) or to propagate it.
Hard disagree, TC39 has done great work over the last 10 years. To name a few:
- Async/await
- Rest/spread
- Async iterators
- WeakRefs
- Explicit Resource Management
- Temporal
It's decisions are much more well thought out than WHATWG standards. AbortSignal extending from EventTarget was a terrible call.
More good works from the last 10 years includes .at(), nullish chaining, BigInt etc.
But most of what you mentioned is closing in on 10 years in the standard (Async/Await is from 2017) meaning the bulk of the work done is from over 10 years ago.
The failure of AbortSignal is exactly the kind of failure TC39 has been doing in bulk lately. I have been following the proposal to add Observables to the language, which is a stage 1 proposal (and has been for over 10 years!!!). There were talks 5 years ago (!) to align the API with AbortSignal[1] which I think really exemplifies the inability for TC39 to reach a workable decision (at least as it operates now).
Another example I like to bring up are the failure of the pipeline operator[2], which was advanced to stage-2 four years ago and has been in hiatus ever since with very little work to show for it. After years of deliberation very controversal version of the operator with a massive community backlash. Before they advanced it it was one of the more popular proposals, now, not so much, and personally I sense any enthusiasm for this feature has pretty much vanished. In other words I think they took half a decade to make the obviously wrong decision, and have since given up.
From the failure of the pipeline operator followed a bunch of half-measures such as array grouping, and iterator helpers etc. which could have easily been implemented in userland libraries if the more functional version of the pipeline operator would have advanced.
The goal of web hosting is to provide low latency wide availability to many users.
AI in this context has a very different goal as a tool for individual users.
You wouldn't say that hosting instances of Photoshop on servers and charging for usage is a long term viable business would you? Even if current consumer computers struggled to run Photoshop.
I don't see an issue with the comparison, I don't think it is meant to be a 1 to 1 or anything, just an illustration of how consumers are overwhelmingly lazy.
I'd take issue with the statement that it is for the paranoid, but I guess it might be a defense mechanism because of course i am interested in local models. If my new workflow is going to be dependent on 3 companies, I'd prefer if there is a light at the end of the tunnel that breaks us free.
It's interesting to see so many people agree with this perspective when it comes to articles yet disagree when it comes to writing software.
Perhaps some form of Gell-Mann Amnesia, people are better at recognizing good articles than they are at recognizing good software. Combined with a vibe coding effect of never actually reading the source, and thus recognizing how bad it is.
We need to let the AI as a service businesses fail.
reply