Yes but then the benchmarks need to be presented as "this verifies whether the model can recall this exact same situation and does not actually benchmark any reasoning at all".
This is not the case, they're being presented as "how good is the model at software engineering". E.g. the benchmark in question says this:
"Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. "
When your benchmark is fundamentally embedded extremely well in the training data, such that you're actually just benchmarking "how well do you remember what sqlite looks like" rather than "do you understand all the tradeoffs, risks, design decisions that need to be made to build a bespoke database from scratch".
This is a VERY big caveat that, to me, for a decent part explains the discrepancy between benchmarks and reality.
This is not the case, they're being presented as "how good is the model at software engineering". E.g. the benchmark in question says this:
"Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. "
When your benchmark is fundamentally embedded extremely well in the training data, such that you're actually just benchmarking "how well do you remember what sqlite looks like" rather than "do you understand all the tradeoffs, risks, design decisions that need to be made to build a bespoke database from scratch".
This is a VERY big caveat that, to me, for a decent part explains the discrepancy between benchmarks and reality.