Its almost certain that it was, but the purpose of this puzzle benchmark is that...

HarHarVeryFunny · 2025-11-19T17:42:19 1763574139

Sure, but the types of pattern in these problems do repeat, so I don't think it'd be too hard to RL train on these, whether public samples, or a privately generated more-of-the-same dataset, to improve performance a lot.

Every company releasing new models leads with benchmark numbers, so it's hard to imagine they are not all putting a lot of effort into benchmark-maxxing.

ld4nt3 · 2025-11-19T22:50:04 1763592604

Yes everyone is doing that on benchmarks but they are still somewhat useful and the likes arc agi even more, though we are not be able to quantize exactly how much better they are getting they are still necessary. For arc agi these are some big gains by which ever way the went about it, since everyone also has been trying to max it for the last 3 years but we do need to come up with better benchmarks/evals like arc tried.