K - means clustering, which I learned about recently. For non-machine learning f...

stevesimmons · on Oct 17, 2018

The scikit-learn page on clustering [1] is a really great way to see K-means and other algorithms side-by-side. The pictures really make it easy to see how all the algorithms perform on various data distributions.

[1] http://scikit-learn.org/stable/modules/clustering.html

failsafe · on Oct 17, 2018

The interesting part is solving the "closest pair problem" which is a part of the clustering algorithm. Hurts my head just thinking about it, god knows how someone came up with a solution like this. https://www.geeksforgeeks.org/closest-pair-of-points-using-d...

n4r9 · on Oct 17, 2018

The closest-pair algorithm is something you might use in single-linkage clustering, not K-means clustering:

https://en.wikipedia.org/wiki/Single-linkage_clustering

rripken · on Oct 17, 2018

Lesson two of this course does a great job explaining K-means and its connection to Expectation Maximization. https://www.udacity.com/course/machine-learning-unsupervised...

jacknews · on Oct 17, 2018

I prefer mean-shift for it's simplicity, as it even creates it's own clusters.

n4r9 · on Oct 17, 2018

Could you explain what you mean by that a little more? I looked at the Clustering section of the Wikipedia page for Mean-Shift:

https://en.wikipedia.org/wiki/Mean_shift#Clustering

but I don't get how C and r are chosen, nor how the method is supposed to be "non-parametric" given that you need such parameters to begin with.

Also, how are you supposed to assign all points to separate clusters, and how would you determine how many clusters to use in the first place?

Having looked at the that page I also definitely don't think it's simpler than K-means!

Edit: This article gives a much more accessible descriptions than wikipedia:

https://spin.atomicobject.com/2015/05/26/mean-shift-clusteri...

To answer my own questions, the number of clusters is determined by the number of max points of the kernel density function, but the kernel bandwidth does have to be specified.

vanderZwan · on Oct 17, 2018

Thanks for not just looking for an answer to your own question but also sharing it!

> So how does mean shift come into the picture? Mean shift exploits this KDE idea by imagining what the points would do if they all climbed up hill to the nearest peak on the KDE surface. It does so by iteratively shifting each point uphill until it reaches a peak.

I wonder if this would still work with high-dimensional data. Distance and space start to act weird in higher dimensions, right?

n4r9 · on Oct 17, 2018

Without looking much further into it, I can't imagine it going very wrong. My intuition says that the density function will be bounded, since you're working with a finite number of points, therefore you'll always be able to climb to a local maximum.

vanderZwan · on Oct 17, 2018

The problem I see is the word "near" in "climb to nearest peak". To quote the curse of dimensionality wiki page:

> The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance.

https://en.wikipedia.org/wiki/Curse_of_dimensionality

n4r9 · on Oct 17, 2018

Are you concerned that in a very high dimensional space you'd need an absurd number of points in order to produce a meaninful KDE, since otherwise the points will be too sparse? If so I think I'm with you... you'll probably just end up with a very large number of small, weak clusters.

Although I wonder if you can't account for that by simply increasing the kernel bandwidth. Perhaps not. Perhaps this is why mean-shift seems to be mostly used for computer vision and image processing, where (I guess?) the number of dimensions is low.

another-cuppa · on Oct 17, 2018

K-means is not an algorithm, it's a heuristic for an Np-hard problem.

n4r9 · on Oct 17, 2018

It is absolutely an algorithm in the sense of "a set of rules to be followed". I think you mean that it doesn't guarantee an optimal solution. That just means it's a heuristic algorithm, same as simulated annealing is a heuristic algorithm for solving optimisation problems.

another-cuppa · on Oct 17, 2018

Nope. An algorithm has to be effective. You can find pathological cases for k-means such that it will never converge on anything useful. So if you set your termination case to be convergence it will never terminate and if you don't then it will never be effective.

zaphar · on Oct 17, 2018

I think you might be in the minority in this opinion. Many algorithms have pathological cases but are still considered algorithms

another-cuppa · on Oct 17, 2018

Minority? This is directly from Knuth.

bstamour · on Oct 17, 2018

Knuth defines effectiveness as: "... all of the operations to be performed in the algorithm must be sufficiently basic that they can in principle be done exactly and in a finite length of time by a man using paper and pencil."

K-means and other heuristic algorithms fit that description.

skykooler · on Oct 17, 2018

BogoSort is an algorithm. Not a very good algorithm, but an algorithm nevertheless.

another-cuppa · on Oct 17, 2018

No it absolutely is not.

The lack of fundamental computer science knowledge in this thread is alarming.

another-cuppa · on Oct 17, 2018

Negative 2 points on a post saying that a computational method that possibly never terminates is not an algorithm... Oh dear...

wnoise · on Oct 17, 2018

It never terminates with probability 0.

n4r9 · on Oct 17, 2018

As far as I can tell you're only arguing against poor implementations of K-means. If you demand that the score strictly improves at each iteration then the algorithm must terminate.

another-cuppa · on Oct 17, 2018

And how do you "demand that the score strictly improves"? It's an NP-hard problem.

n4r9 · on Oct 17, 2018

K-means implementations generally terminate once there's an iteration where the score doesn't improve. This happens when there is convergence to a local minimum or - less likely - the state hops between two nearby local minima with the same score. But it will terminate on something, and most of the time that something will be pretty good.

I saw your mention of Knuth elsewhere, I looked it up and he demanded that

> An algorithm must always terminate after a finite number of steps ... a very finite number, a reasonable number

This is a pretty niche characterization and almost certainly not what the original post was asking for. However, I concur that there is no guarantee on how quickly K-means terminates or on how good the output will be,. But... if you're going to be that strict about it you would even have to rule out the Simplex Algorithm, which everyone I've ever spoken to thinks of as an algorithm.

billfruit · on Oct 17, 2018

In that sense kmeans may be better referred to as a 'computational method' rather than an algorithm.

another-cuppa · on Oct 17, 2018

Indeed.

wnkrshm · on Oct 17, 2018

Isn't a method that gives an approximate or best-fit estimate to a problem still an algorithm, if it terminates?

another-cuppa · on Oct 17, 2018

No. You can't prove that k-means does anything useful.

alanbernstein · on Oct 17, 2018

Is the definition of "algorithm" that you're using here useful?

another-cuppa · on Oct 17, 2018

It's one of the most fundamental concepts in computer science and underpins decades of research. You can decide if it's useful.

mindcrime · on Oct 17, 2018

This isn't a classroom, and your pedantry isn't adding anything useful to the conversation. We all understand these pedantic quibbles you're arguing about... and what the community is more or less collectively saying is "in this context, we don't care about the distinction between an 'algorithm' in the textbook sense, and a 'heuristic' in the textbook sense".

another-cuppa · on Oct 17, 2018

Nah. Most of them don't understand the difference. If you did you wouldn't can it pedantry.

I personally don't find heuristics beautiful. That's why I commented.

n4r9 · on Oct 18, 2018

To be fair, you haven't explained at all clearly why you don't think k-means adheres to Knuth's notion of an algorithm.

Your objection seems to be

> You can find pathological cases for k-means such that it will never converge on anything useful

As has been pointed out more than once, a good implementation of k-means is guaranteed to terminate in a finite time. And whatever you mean by "useful" doesn't seem to appear in Knuth's definition of an algorithm.