Hi,
I am working on implementing a variant of the k-means algorithm, namely
Bisecting K-means [1].
The basic premise is to run the original k-means algorithm on
increasingly smaller subsets of the original input data.
In each step of the outer loop, it splits the current cluster in 2 new
smaller clusters and delete the corresponding parent cluster.
I am currently using a modified version of the existing k-means
implementation from the Flink examples.
Pseudocode:
while currentClusterNumber < finalClusterNumber
currentCluster = Pick current largest cluster
for i = 1 to innerIterations
Pick 2 random starting centroids
Run k-means on currentCluster with centroids
Store output and compute similarity of temporary result
Pick the one innerIteration result with highest similarity from
temporary results
Replace currentCluster with the two smaller subsets
It all comes down to nested iterations, which are not supported by Flink
at the moment.
Does anyone has experiences or workarounds to avoid this issue?
Best,
Adrian
----
[1] A Comparison of Document Clustering Techniques -
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.9225