Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Its almost certain that it was, but the purpose of this puzzle benchmark is that it shouldn't really be possible just to be memorized by the amount of variations that can be created and other criteria detailed in it.


Sure, but the types of pattern in these problems do repeat, so I don't think it'd be too hard to RL train on these, whether public samples, or a privately generated more-of-the-same dataset, to improve performance a lot.

Every company releasing new models leads with benchmark numbers, so it's hard to imagine they are not all putting a lot of effort into benchmark-maxxing.


Yes everyone is doing that on benchmarks but they are still somewhat useful and the likes arc agi even more, though we are not be able to quantize exactly how much better they are getting they are still necessary. For arc agi these are some big gains by which ever way the went about it, since everyone also has been trying to max it for the last 3 years but we do need to come up with better benchmarks/evals like arc tried.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: