Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It's worth pointing out that on their eval set for "issues resolved" they are getting 13.86%. While visually this looks impressive compared to the others, anything that only really works 13.86% of the time, when the verification of the work takes nearly as much time as the work would have anyway, isn't useful.

Yeah, I remember speech recognition taking decades to improve, and being more of a novelty - not useful at all - even when it was at 95% accuracy (1 word in 20 wrong). It really had to get almost perfect until it was a time saver.

As far as coding goes, it'd be faster to write it yourself and get it right first time rather than have an LLM write it where you can't trust it and still have to check it yourself.



You can't compare the accuracy of speech recognition to LLM task completion rates. A nearly-there yet incomplete solution to a Github issue is still valuable to an engineer who knows how to debug it.


Sure, and no doubt people paying for speech recognition 25 years ago were finding uses for it too. It depends on your use case.

A 13% success rate is both wildly impressive and also WAY below the level where I would personally find something like this useful. I can't even see reaching for a tool that I knew would fail 90% of the time, unless I was desperate and out of ideas.


I disagree. I think about this a bit as having a developer intern, on whom I can't rely to take much of a workload, and definitely nothing on the critical path, but I could say to them "Take a look at these particular well-defined tasks on the backlog and see which ones you could make some progress on" - I feel there's good value in that.

And the nice thing about an AI here is that I think it will actually find a different subset of these tasks to be easy than a human would.


Yeah, but a developer intern already has human-level AGI to support the on-the-job developer training you are going to help give them. Any LLM available today, or probably in next 5-10 years for that matter, has neither AGI nor the ability to learn on the job.

My experience of working with interns, or low-skill developers, is that the benefit normally flows one way. You are taking time out from completing the project to help them learn. Someone/something of low capability isn't going to be relieving you of the large or complex tasks that would actually be useful, and be a time saver - they are going to try to do the small/simple tasks you could have breezed through, and suck up a lot of your time having to find out and explain to them how they messed up. Of course Devin doesn't even have online learning, so he'd be making the same mistakes over and over.


> A nearly-there yet incomplete solution to a Github issue is still valuable to an engineer who knows how to debug it.

Not sure if I can agree. There would definitely be a value in looking at what libraries the solution uses, but otherwise it may be easier to write it oneself, especially when the mistakes are not humanlike.


I can see this being useful already (assuming context length is not an issue) as some sort of github service trying to solve github issues throughout the day.

Or for example if you commit todo's in your code, the ai will pick up on them and give you some options later on.

If the failure rate is 14%, just let it try a bunch of times. (half joking here)

The way I see it is at least the project issues are getting some attention, which is arguably better than no attention. If it can just fix simple things, at least you can focus on the complex things and not worry about postponing the low hanging fruits.


What about Youre Holding It Wrong?

IE: Maybe rather than throwing it find problems, also use it to just build things from scratch - as an example:

The side hustle: make that a product:

Let someone like me who is not a coder have access to Devin for a month with the only goal of building a side hustle that brings a solo person a monthly income.

Then - Sell that so that the millions of people who have a solo idea and just need that "technical co-foumder" can use it to build. and limit it to one devin instance to a person to start...

I dont want it to do a one fell swoop - I'd like to say "build this module...

--

Have a contest where you get What can you build with Devin in TOPIC in 30 mins.


I could perhaps see more value for this, at this level of capability, in writing test cases, in cases where the project is setup in a way to let them be run and get feedback.

This would be useful in cases where test coverage is incomplete, maybe for auto-discovering/confirming bugs, and would really be needed if it's trying to fix bugs itself, especially if one dared let it commit bug fixes - would want to know that the fix worked and didn't break anything else (regression test - run other test cases too).


Even now, automatic speech recognition is a big timesaver, but you _need_ a human to look through the transcript to pick out the obviously wrong stuff, let alone the stuff that's wrong buy could be right in context.


but* just wanted to mention the error because the comment was about speech recognition errors




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: