To answer my own questions, the number of clusters is determined by the number of max points of the kernel density function, but the kernel bandwidth does have to be specified.
Thanks for not just looking for an answer to your own question but also sharing it!
> So how does mean shift come into the picture? Mean shift exploits this KDE idea by imagining what the points would do if they all climbed up hill to the nearest peak on the KDE surface. It does so by iteratively shifting each point uphill until it reaches a peak.
I wonder if this would still work with high-dimensional data. Distance and space start to act weird in higher dimensions, right?
Without looking much further into it, I can't imagine it going very wrong. My intuition says that the density function will be bounded, since you're working with a finite number of points, therefore you'll always be able to climb to a local maximum.
The problem I see is the word "near" in "climb to nearest peak". To quote the curse of dimensionality wiki page:
> The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance.
Are you concerned that in a very high dimensional space you'd need an absurd number of points in order to produce a meaninful KDE, since otherwise the points will be too sparse? If so I think I'm with you... you'll probably just end up with a very large number of small, weak clusters.
Although I wonder if you can't account for that by simply increasing the kernel bandwidth. Perhaps not. Perhaps this is why mean-shift seems to be mostly used for computer vision and image processing, where (I guess?) the number of dimensions is low.
https://en.wikipedia.org/wiki/Mean_shift#Clustering
but I don't get how C and r are chosen, nor how the method is supposed to be "non-parametric" given that you need such parameters to begin with.
Also, how are you supposed to assign all points to separate clusters, and how would you determine how many clusters to use in the first place?
Having looked at the that page I also definitely don't think it's simpler than K-means!
Edit: This article gives a much more accessible descriptions than wikipedia:
https://spin.atomicobject.com/2015/05/26/mean-shift-clusteri...
To answer my own questions, the number of clusters is determined by the number of max points of the kernel density function, but the kernel bandwidth does have to be specified.