In MATLAB, there is a command kmeans() that divides an array into $k$ clusters and calculates the centroid of each cluster. Is there any command in Mathematica to perform the same action?
For example:
x = {{1,2,3,4,5}, {6,7,8,9,10}, {11,12,13,14,15}, {16,17,18,19,20}, {21,22,23,24,25}}
The MATLAB kmeans command does this:
[b, c] = kmeans(x, 2)
It divides x into two clusters and calculates the centroids of these two clusters, and indicates for each element in the array which cluster it is a member of
b = {1,1,2,2,2}
and
c={{3.5000,4.5000,5.5000,6.5000,7.5000},{16.0000,17.0000,18.0000,19.0000,20.0000}}
Answer
ClusteringComponents is indeed the function to go for. To get the same results as MATLAB you need to do the following:
x = {{1, 2, 3, 4, 5}, {6, 7, 8, 9, 10}, {11, 12, 13, 14,
15}, {16, 17, 18, 19, 20}, {21, 22, 23, 24, 25}};
cc = ClusteringComponents[x, 2, 1, Method -> "KMeans",
"DistanceFunction" -> SquaredEuclideanDistance, "RandomSeed" -> 1]
{1, 1, 2, 2, 2}
The arguments x and 2 are the same as with MATLAB. The '1' is used to indicate the level of the nested array to consider the data points. In this case we're looking at the top level, so we're considering 5D points.
Mathematica can use various methods. "KMeans" is default, so it isn't necessary to provide it here. The default "DistanceFunction" is EuclideanDistance. MATLAB's is SquaredEuclideanDistance, so we have to explicitly use that.
Since clustering uses a process with random initializations the results may differ depending on the RNG state. I used "RandomSeed" -> 1 to initialize the RNG to a state that yields the results you showed. {1,1,1,2,2} is a possible output too.
Now to the centroids:
Mean /@ {Pick[x, cc, 1], Pick[x, cc, 2]} // N
{{3.5, 4.5, 5.5, 6.5, 7.5}, {16., 17., 18., 19., 20.}}
I added //N since you seemed to want machine precision results. Leave it away for exact results.
On a side note: KMeans may sometimes yield disastrous results. It's a well-known property of this algorithm.
x1 = RandomVariate[MultinormalDistribution[{0, 0}, {{1, 0}, {0, 20}}],500];
x2 = RandomVariate[MultinormalDistribution[{6, 0}, {{1, 0}, {0, 20}}],500];
Graphics@{Red, Point@x1, Green, Point@x2}

xx = Join[x1, x2];
cc = ClusteringComponents[xx, 2, 1,
Method -> "KMeans",
"DistanceFunction" -> SquaredEuclideanDistance, "RandomSeed" -> 1];
{c1, c2} = {Pick[xx, cc, 1], Pick[xx, cc, 2]};
Graphics@{Red, Point@c1, Green, Point@c2}

In this case, one of the additional three clustering methods that Mathematica knows (Method -> "PAM") works wonders.
Comments
Post a Comment