(DEPRECATED) Apache Flink User Mailing List archive.

multiple k-means in parallel

Classic

List

Threaded

2 messages Options

Lydia Ickler

multiple k-means in parallel

Hi,

I want to run k-means with different k in parallel.
So each worker should calculate its own k-means. Is that possible?

If I do a map on a list of integers to then apply k-means I get the following error:
Task not serializable

I am looking forward to your answers!
Lydia

Fabian Hueske-2

Re: multiple k-means in parallel

Hi Lydia,

that is certainly possible, however you need to adapt the algorithm a bit.

The straight-forward approach would be to replicate the input data and assign IDs for each k-means run.

If you have a data point (1, 2, 3) you could replicate it to three data points (10, 1, 2, 3), (15, 1, 2, 3), (20, 1, 2, 3) where the first field identifies the number of centers of a run.

From there you need a bit of custom partitioning and composite keys to shuffle the data to the right workers.

Hope that helps,

Fabian

2016-11-27 11:48 GMT+01:00 Lydia Ickler <[hidden email]>:

Hi,

I want to run k-means with different k in parallel.
So each worker should calculate its own k-means. Is that possible?

If I do a map on a list of integers to then apply k-means I get the following error:
Task not serializable

I am looking forward to your answers!
Lydia