Distinct lines in a file

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Distinct lines in a file

Flavio Pompermaier
Hi to all,
I'd like to do a unique lines of a file with Flink. Do I really need to make a map from String to Tuple1<String>, call unique() and then another map from Tuple1 to String again before output?
Is there a smarter way to do it?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Distinct lines in a file

Fabian Hueske-2
Hi Flavio,

I agree, distinct() is a bit limited right now and in fact, there is no good reason for that except nobody found time to improve it.
You can use distinct(KeySelector k) to work directly on DataSet<String> but that's not very convenient either:

DataSet<String> strings = env.fromElements("Hello", "Hello", "World", "Hello");

strings.distinct(new KeySelector<String, String>() {
   @Override
   public String getKey(String value) throws Exception {
      return value;
   }
}).print();

Making distinct more generic should take long.
I'll open a JIRA and might eventually fix it, if nobody picks it up.

Cheers, Fabian

2015-04-30 12:05 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
I'd like to do a unique lines of a file with Flink. Do I really need to make a map from String to Tuple1<String>, call unique() and then another map from Tuple1 to String again before output?
Is there a smarter way to do it?

Best,
Flavio