Context-specific step function in Iteration

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Context-specific step function in Iteration

Martin Junghanns-2
Hi everyone,

In a step-function (bulk) I'd like to join the working set W
with another data set T. The join field of T depends on
the current super step. Unfortunately, W has no access
to the iteration runtime context.

I tried to extract the current superstep at the beginning of
the step function and broadcasted it to a UDF applied on T
(which sets the correct value join field) and perform the join
always on the same fields. Unfortunately, this does not seem
to work either.

I could work around that by replicating the elements of T and
join multiple times but this does not scale very well.

Any suggestion would be appreciated.

Cheers and thank you,

Martin

Reply | Threaded
Open this post in threaded view
|

Re: Context-specific step function in Iteration

Martin Junghanns-2
Hi again,

I had a bug in my logic. It works as expected (which is perfect).

So maybe for others:

Problem:
- execute superstep-dependent UDFs on datasets which do not have access
to the iteration context

Solution:
- add dummy element to the working set (W) at the beginning of the step
function
- extract dummy from W using a filter function
- convert dummy into DataSet<Integer> (superstep) using a map function
- broadcast that 1-element dataset to the UDFs applied on the "external"
datasets
- filter non-dummy elements (if necessary) and continue step function

Note, that it should also work with cross instead of broadcasting, I did
not test which way works faster, yet.

Apologies if anyone thought about this when it was my error in the end :)

Cheers,
Martin


On 29.05.2016 14:05, Martin Junghanns wrote:

> Hi everyone,
>
> In a step-function (bulk) I'd like to join the working set W
> with another data set T. The join field of T depends on
> the current super step. Unfortunately, W has no access
> to the iteration runtime context.
>
> I tried to extract the current superstep at the beginning of
> the step function and broadcasted it to a UDF applied on T
> (which sets the correct value join field) and perform the join
> always on the same fields. Unfortunately, this does not seem
> to work either.
>
> I could work around that by replicating the elements of T and
> join multiple times but this does not scale very well.
>
> Any suggestion would be appreciated.
>
> Cheers and thank you,
>
> Martin
>
Reply | Threaded
Open this post in threaded view
|

Re: Context-specific step function in Iteration

Maximilian Michels
Hi Martin,

No worries. Thanks for letting us know!

Cheers,
Max

On Mon, May 30, 2016 at 9:17 AM, Martin Junghanns
<[hidden email]> wrote:

> Hi again,
>
> I had a bug in my logic. It works as expected (which is perfect).
>
> So maybe for others:
>
> Problem:
> - execute superstep-dependent UDFs on datasets which do not have access to
> the iteration context
>
> Solution:
> - add dummy element to the working set (W) at the beginning of the step
> function
> - extract dummy from W using a filter function
> - convert dummy into DataSet<Integer> (superstep) using a map function
> - broadcast that 1-element dataset to the UDFs applied on the "external"
> datasets
> - filter non-dummy elements (if necessary) and continue step function
>
> Note, that it should also work with cross instead of broadcasting, I did not
> test which way works faster, yet.
>
> Apologies if anyone thought about this when it was my error in the end :)
>
> Cheers,
> Martin
>
>
>
> On 29.05.2016 14:05, Martin Junghanns wrote:
>>
>> Hi everyone,
>>
>> In a step-function (bulk) I'd like to join the working set W
>> with another data set T. The join field of T depends on
>> the current super step. Unfortunately, W has no access
>> to the iteration runtime context.
>>
>> I tried to extract the current superstep at the beginning of
>> the step function and broadcasted it to a UDF applied on T
>> (which sets the correct value join field) and perform the join
>> always on the same fields. Unfortunately, this does not seem
>> to work either.
>>
>> I could work around that by replicating the elements of T and
>> join multiple times but this does not scale very well.
>>
>> Any suggestion would be appreciated.
>>
>> Cheers and thank you,
>>
>> Martin
>>
>