Hello, I would like to a way to count a dataset to check if it is empty or not.. But .count() throw an execution and I do not want to do separe job execution plan, as hthis will trigger multiple reading.. I would like to have something like.. Source -> map -> count -> if 0 -> do someting if not -> do something More concrete i would like to check if one of my dataset is empty before doing a cross operation.. Thanks, Bastien |
Hi, Counting always requires a job to be executed. Not sure if this is what you want to do, but if you want to prevent to get an empty result due to an empty cross input, you can use a mapPartition() with parallelism 1 to emit a special record, in case the MapPartitionFunction didn't see any data. Best, Fabian Am Mi., 7. Nov. 2018 um 14:02 Uhr schrieb bastien dine <[hidden email]>:
|
Another option for certain tasks is to work with broadcast variables [1]. The value could be use to configure two filters. DataSet<Data> input = .... DataSet<Long> count = input.map(-> 1L).sum() DataSet<OutCntZero> input.filter(if cnt == 0).withBroadcastSet("cnt", count).doSomething DataSet<OutCntNonZero> input.filter(if cnt != 0).withBroadcastSet("cnt", count).doSomethingElse Fabian Am Mi., 7. Nov. 2018 um 14:10 Uhr schrieb Fabian Hueske <[hidden email]>:
|
Hi Fabian, Thanks for the response, I am going to use the second solution ! Regards, Bastien Le mer. 7 nov. 2018 à 14:16, Fabian Hueske <[hidden email]> a écrit :
|
Free forum by Nabble | Edit this page |