Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I do agree with his identification of the problem: sometimes agents fail because of the tools around it and not because of the model's reasoning. However, for the failing tests I think he is not making the distinction between a failed test due to a harness failure or due to a reasoning failure. It would be nice if someone analyzed that from the data set.
 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: