How to prepare data for K means clustering

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to prepare data for K means clustering

Ashutosh Kumar
I saw example code for K means clustering . It takes input  data points as pair of double values (1.2 2.3\n5.3 7.2\.). My question is how do I convert my business data to this format. I have customer data which has columns like house hold income , education and several others. I want to do clustering on multiple columns something like Neilsen segments.

Thanks
Ashutosh
Reply | Threaded
Open this post in threaded view
|

Re: How to prepare data for K means clustering

Chiwan Park-2
Hi Ashutosh,

You can use basic Flink DataSet operations such as map and filter to transform your data. Basically, you have to declare a distance metric between each record in data. In example, we use euclidean distance (see euclideanDistance method in Point class).

In map method in SelectNearestCenter class, euclideanDistance method is used to measure the distance between each point. For your implementation, you have to substitute type to your data type (It can be your custom class or Flink-provided Tuple) and change distance metric for your data.

Regards,
Chiwan Park

> On Jan 21, 2016, at 4:14 PM, Ashutosh Kumar <[hidden email]> wrote:
>
> I saw example code for K means clustering . It takes input  data points as pair of double values (1.2 2.3\n5.3 7.2\.). My question is how do I convert my business data to this format. I have customer data which has columns like house hold income , education and several others. I want to do clustering on multiple columns something like Neilsen segments.
>
> Thanks
> Ashutosh