it is very hard for me to take seriously any system that is not proven for shipping production code in complex codebases that have been around for a while.
I've been down the "don't read the code" path and I can say it leads nowhere good.
I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"
I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.
#1 rejection reason: missing context. 80% needed human fixes. Agents can write code fine. They just don't know what "done" looks like in your codebase.
Count successful merges into repos with real history instead of LOC and the hard part is specification, not execution.
I personally care deeply when content intended as communication is AI generated (much more so than if code is generated).
On the surface level, I find it a bit disrespectful when I'm communicating with someone who's just using an LLM to generate their responses. Imagine if you are talking to someone in-person, and they pull out a phone, generate a response then read it back out to you?
On a deeper level, if someone's generated a bunch of text and clearly hasn't devoted the time into generating/editing it that they're expecting me to invest while reading it, I'm just not going to read it.
Yes it's very obviously written by AI and made me immediately close the tab. Not gonna read a self-promotional piece written by an LLM that someone probably only gave it one sentence prompt: "merge these ideas".
I don’t think anyone serious would recommend it for serious production systems. I respect the Ralph technique as a fascinating learning exercise in understanding llm context windows and how to squeeze more performance (read: quality) from today’s models
Even if in the absolute the ceiling remains low, it’s interesting the degree to which good context engineering raises it
How is it a “fascinating learning exercise” when the intention is to run the model in a closed loop with zero transparency. Running a black box in a black box to learn? What signals are you even listening to to determine whether your context engineering is good or whether the quality has improved aside from a brief glimpse at the final product. So essentially every time I want to test a prompt I waste $100 on Claude and have it an entire project for me?
I’m all for AI and it’s evident that the future of AI is more transparency (MLOPs, tracing, mech interp, AI safety) not less.
there is the theoretical "how the world should be" and there is the practical "what's working today" - decry the latter and wait around for the former at your peril
I read it. i agree this is out of touch. Not because the things its saying are wrong, but because the things its saying have been true for almost a year now. They are not "getting worse" they "have been bad". I am staggered to find this article qualifies as "news".
If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".
The models are gotten very good, but I rather have an obviously broken pile of crap that I can spot immediately, than something that is deep fried with RL to always succeed, but has subtle problems that someone will lgtm :( I guess its not much different with human written code, but the models seem to have weirdly inhuman failures - like, you would just skim some code, cause you just cant believe that anyone can do it wrong, and it turns out to be.
Well, for some reason it doesnt let me respond to the child comments :(
The problem (which should be obvious) is that with a/b real you cant construct an exhaustive input/output set. The test case can just prove the presence of a bug, but not its absence.
Another category of problems that you cant just test and have to prove is concurrency problems.
Of course you can. You can write test cases for anything.
Even an add_numbers function can have bugs, e.g. you have to ensure the inputs are numbers. Most coding agents would catch this in loosely-typed languages.
engineers always want to re write from scratch and it never works.
a tale as old as time - my second job out of college back in like 2016, I landed at the tail end of a 3-month feature-freeze refactor project. was pitched to the CEO as 1-month, sprawled out to 3 months, still wasn't finished. Non-technical teams were pissed, technical teams were exhausted, all hope was lost. Ended up cutting a bunch of scope and slopping out a bunch of bugs anyway.
i had the privilege of working w/ some incredible eng leaders at my previous gig - they were very good at working both upwards and downwards to execute against the "50/50" rule - half of any given sprint's work is focused on new features, and half is focused on bug fixes, chores, things that improve team velocity.
I think the key here is “if X then Y syntax” - this seems to be quite effective at piercing through the “probably ignore this” system message by highlighting WHEN a given instruction is “highly relevant”
I've been down the "don't read the code" path and I can say it leads nowhere good.
I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"
I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.
reply