SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.
Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.
The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.
I'm thinking about the same things and landed on Rust. I think we're at a very critical point in software development and would love to chat with you and share/learn ideas. Please let me know if you're interested.
I am planning to add similar concepts to Yek. Either tree-sitter or ast-grep. Your work here and Aider's work would be my guiding prior art. Thank you for sharing!
Hah. "If it's not too much trouble, would you mind if we disable the rimraf root feature?"
Gotta bully that thing man. There's probably room in the market for a local tool that strips the superfluous niceties from instructions. Probably gonna save a material amount of tokens in aggregate.
I am thinking about this a lot right now. Pretty existential stuff.
I think builders are gonna be fine. The type of programmer were people would put up with just because they could really go in their cave for a few days and come out with a bug fix that nobody else on the team could figure out is going to have a hard time.
Interestingly AI coding is really good at that sort of thing and less good at fully grasping user requirements or big picture systems. Basically things that we had to sit in meetings a lot for.
This has been my experience too. That insane race condition inside the language runtime that is completely inscrutable? Claude one-shots it. Ask it to work on that same logic to add features and it will happily introduce race conditions that are obvious to an engineer but a local test will never uncover.
I’m not convinced. That sort of thing usually depends on some very specific arcana or weird interaction between systems that is not in the code. It usually requires either external knowledge or deep investigation and compilation of evidence from multiple sources. I haven’t seen AI do that much.
Look at recent examples of browsers and matrix servers. AI can’t even follow extremely detailed specs with extensive test suites.
If anything, nice and friendly but mediocre devs are in more immediate danger than rough but extremely competent devs.
But we’ve seen C-suits losing institutional knowledge at a drop of a hat for decades so who knows? Maybe knowledge and skill are not that valued.
> The type of programmer were people would put up with just because they could really go in their cave for a few days and come out with a bug fix that nobody else on the team could figure out is going to have a hard time.
meetings hardly reach anywhere. most of the details are eventually figured out by developers when interacting with the code. If all ideas from PMs are implemented in a software, it would eventually turn into bloatware before even reaching MVP stage.
Not really, in my experience you still have to be good at solving problems to use it effectively. Claude (and other AI) can help folks find a "fix", but a lot of times it's a band-aid if the user doesn't understand how to debug / solve things themselves.
So the type of programmers you're talking about, who could solve complex problems, are actually just enhanced by it.
> The type of programmer were people would put up with just because they could really go in their cave for a few days and come out with a bug fix that nobody else on the team could figure out is going to have a hard time.
This is the exact type of programmer that isn't going to have any issues - ones who actually know what they're doing and aren't just going to vibecode react slop.
GLM is OK (haven't used it heavily but seems alright so far), a bit slow with ZAI's coding plan, amazingly fast on Cerebras but their coding plan is sold out.
reply