They cannot be found out as long as there is no better evaluation. Sure, if they...

They cannot be found out as long as there is no better evaluation. Sure, if they produce obvious nonsense, but the point of a systematic evaluation is exactly to overcome subjective impressions based on individual examples as a notion of quality.

Also, you are right that excluding test data from the training data improves your model. However, given the insane amounts of training data, this requires significant effort. If that additionally leads to your model performing worse in existing leaderboards, I doubt that (commercial) organizations would pay for such an effort.

And again, as long as there is no better evaluation method, you still won't know how much it really helps.