Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The agent specifies what function, class, method etc to replace, along with its full source. It's more costly, but I believe it leads to fewer hallucinations as it is generating a coherent piece of code.

But it requires parsing AST and language specific instructions. And things like metaprogramming or macros could cause some hairy confusion.

All of these factors don't hurt my use case.



We have a similar method under the hood, except it's purely text-based search-and-replace. The model decides what to replace. It seems to be consistent and is easy to implement.


My gut feeling based on my experience over the last couple of months is that substitution of an entire function is more reliable than some lines of a function. The surrounding context reduces the chance of hallucinations.

Gut feeling doesn't account for much though - I'm working on an evals system to be able to quantify system performance. It won't be cheap to run.

It could easily be that your method is superior.


From our experience, single or few line replacements generally fine, since many of the changes are many few-line changes in multiple spots accross multiple files. We also provide surrounding context for the search-and-replace pairs, which helps with the model. Beyond 10 lines and the model also usually add the function headers in there which helps with the code generation.

I'm also curious, how are you guys evaluating the performance of your models?


There's no systematic evaluation yet which is the next step. It's successfully bootstrapping itself which is a fairly high bar, but quantitative performance measurements are getting more and more important as the project progresses.


I feel the same, benchmarking in general is a pain but a good benchmark for us could go a long way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: