(DEPRECATED) Apache Flink User Mailing List archive.

How to prepare data for K means clustering

Classic

List

Threaded

2 messages Options

Ashutosh Kumar

How to prepare data for K means clustering

I saw example code for K means clustering . It takes input data points as pair of double values (1.2 2.3\n5.3 7.2\.). My question is how do I convert my business data to this format. I have customer data which has columns like house hold income , education and several others. I want to do clustering on multiple columns something like Neilsen segments.

Thanks

Ashutosh

Chiwan Park-2

Re: How to prepare data for K means clustering

Hi Ashutosh,

You can use basic Flink DataSet operations such as map and filter to transform your data. Basically, you have to declare a distance metric between each record in data. In example, we use euclidean distance (see euclideanDistance method in Point class).

In map method in SelectNearestCenter class, euclideanDistance method is used to measure the distance between each point. For your implementation, you have to substitute type to your data type (It can be your custom class or Flink-provided Tuple) and change distance metric for your data.

Regards,
Chiwan Park

> On Jan 21, 2016, at 4:14 PM, Ashutosh Kumar <[hidden email]> wrote:
>
> I saw example code for K means clustering . It takes input data points as pair of double values (1.2 2.3\n5.3 7.2\.). My question is how do I convert my business data to this format. I have customer data which has columns like house hold income , education and several others. I want to do clustering on multiple columns something like Neilsen segments.
>
> Thanks
> Ashutosh