Hi everyone,
In a step-function (bulk) I'd like to join the working set W with another data set T. The join field of T depends on the current super step. Unfortunately, W has no access to the iteration runtime context. I tried to extract the current superstep at the beginning of the step function and broadcasted it to a UDF applied on T (which sets the correct value join field) and perform the join always on the same fields. Unfortunately, this does not seem to work either. I could work around that by replicating the elements of T and join multiple times but this does not scale very well. Any suggestion would be appreciated. Cheers and thank you, Martin |
Hi again,
I had a bug in my logic. It works as expected (which is perfect). So maybe for others: Problem: - execute superstep-dependent UDFs on datasets which do not have access to the iteration context Solution: - add dummy element to the working set (W) at the beginning of the step function - extract dummy from W using a filter function - convert dummy into DataSet<Integer> (superstep) using a map function - broadcast that 1-element dataset to the UDFs applied on the "external" datasets - filter non-dummy elements (if necessary) and continue step function Note, that it should also work with cross instead of broadcasting, I did not test which way works faster, yet. Apologies if anyone thought about this when it was my error in the end :) Cheers, Martin On 29.05.2016 14:05, Martin Junghanns wrote: > Hi everyone, > > In a step-function (bulk) I'd like to join the working set W > with another data set T. The join field of T depends on > the current super step. Unfortunately, W has no access > to the iteration runtime context. > > I tried to extract the current superstep at the beginning of > the step function and broadcasted it to a UDF applied on T > (which sets the correct value join field) and perform the join > always on the same fields. Unfortunately, this does not seem > to work either. > > I could work around that by replicating the elements of T and > join multiple times but this does not scale very well. > > Any suggestion would be appreciated. > > Cheers and thank you, > > Martin > |
Hi Martin,
No worries. Thanks for letting us know! Cheers, Max On Mon, May 30, 2016 at 9:17 AM, Martin Junghanns <[hidden email]> wrote: > Hi again, > > I had a bug in my logic. It works as expected (which is perfect). > > So maybe for others: > > Problem: > - execute superstep-dependent UDFs on datasets which do not have access to > the iteration context > > Solution: > - add dummy element to the working set (W) at the beginning of the step > function > - extract dummy from W using a filter function > - convert dummy into DataSet<Integer> (superstep) using a map function > - broadcast that 1-element dataset to the UDFs applied on the "external" > datasets > - filter non-dummy elements (if necessary) and continue step function > > Note, that it should also work with cross instead of broadcasting, I did not > test which way works faster, yet. > > Apologies if anyone thought about this when it was my error in the end :) > > Cheers, > Martin > > > > On 29.05.2016 14:05, Martin Junghanns wrote: >> >> Hi everyone, >> >> In a step-function (bulk) I'd like to join the working set W >> with another data set T. The join field of T depends on >> the current super step. Unfortunately, W has no access >> to the iteration runtime context. >> >> I tried to extract the current superstep at the beginning of >> the step function and broadcasted it to a UDF applied on T >> (which sets the correct value join field) and perform the join >> always on the same fields. Unfortunately, this does not seem >> to work either. >> >> I could work around that by replicating the elements of T and >> join multiple times but this does not scale very well. >> >> Any suggestion would be appreciated. >> >> Cheers and thank you, >> >> Martin >> > |
Free forum by Nabble | Edit this page |