Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?

In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.



It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.

Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.


Ok, thanks for clearing that up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: