Stein's paradox is bogus. Somebody needs to say that.
Here's one wikipedia example:
> Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements.
Here's what's bogus about this: the "better estimate (on average)" is mathematically true ... for a certain definition of "better estimate". But whatever that definition is, it is irrelevant to the real world. If you believe you get a better estimate of the US wheat yield by estimating also the number of Wimbledon spectators and the weight of a candy bar in a shop, then you probably believe in telepathy and astrology too.
(Disclaimer, stats noob here) - I thought the point was that, you have a better chance of being -overall- closer to the mean (i.e., the 3D euclidean distance between your guess and the mean would be the smallest, on average), even though you may not necessarily have improved your odds of guessing any of the single individual means.
So it's not that "you get a better estimate of the US wheat yield by estimating also the number of Wimbledon spectators and the weight of a candy bar in a shop", it's simply that you get a better estimate for the combined vector of the three means. (Which, in this case, the vector of the three means is probably meaningless, since the three data sets are entirely unrelated. But we could also imagine scenarios where that vector is meaningful.)
I am personally bothered by the way it is presented as a "paradox", with the implication that it would have real world applications.
I have zero doubts that you can't improve the estimate of the US wheat yields by looking at some other unrelated things, like candy bars. Presenting the result as if it a real "improvement" is false advertisement.
On the other hand, if we look at related observations, then the improvement is not a paradox at all. Let's say I want to estimate the average temperature in the US and in Europe. They are related, and combining the estimates will result to a better result, to nobody's surprise.
You are correct in that the combined estimator is actually worse at estimating an individual value. Its only better if you specifically care about the combination (which you probably don’t in this contrived example)
Right. The question is when (if ever) you would actually want to be minimizing the rms of the vector error. For most of us, the answer is "never".
I remember back in 7th or 8th grade I asked my math teacher why we want to minimize the rms error rather than the sum of the absolute values of the errors. She couldn't give me a good answer, but the book All of Statistics does answer why (and under what circumstances) that is the right thing to do.
So this is just showing a bit of your ignorance of stats.
The general notion of compound risk is not specific to MSE loss. You can formulate it for any loss function, including L1 loss which you seem to prefer.
Steins paradox and James Stein estimator is just a special case for normal random variables and MSE loss of the more general theory of compound estimation, which is trying to find an estimator which can leverage all the data to reduce overall error.
This idea, compound estimation and James-Stein, is by now out-dated. Later came the invention of empirical Bayes estimation and the more modern bayesian hierarchical modelling eventually once we had compute for that.
One thing you can recover from EB is the James-Stein estimator, as a special case, in fact, you can design much better families of estimators that are optimal with respect to Bayes risk in compound estimation settings.
This is broadly useful in pretty much any situation where you have a large scale experiment where many small samples are drawn and similar stats are computed in parallel, or when the data has a natural hierarchical structure. For examples, biostats, but also various internet data applications.
so yeah, suggest to be a bit more open to ideas you dont know anything about. @zeroonetwothree is not agreeing with you here, they're pointing out that you cooked up an irrelevant "example" and then claim the technique doesnt make sense there. Of course, it doesnt, but thats not because the idea of JS isnt broadly useful.
----
Another thing is that JS estimator can be viewed as an example of improving overall bias-variance by regularization, although the connection to regularization as most people in ML use it is maybe less obvious. If you think regularization isn't broadly applicable and very important... i've got some news for you.
Here's one wikipedia example:
Here's what's bogus about this: the "better estimate (on average)" is mathematically true ... for a certain definition of "better estimate". But whatever that definition is, it is irrelevant to the real world. If you believe you get a better estimate of the US wheat yield by estimating also the number of Wimbledon spectators and the weight of a candy bar in a shop, then you probably believe in telepathy and astrology too.