Maintaining Stream Partitioning after Mapping?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Maintaining Stream Partitioning after Mapping?

Ryan Conway
Greetings,

Is there a means of maintaining a stream's partitioning after running it through an operation such as map or filter?

I have a pipeline stage S that operates on a stream partitioned by an ID field. S flat maps objects of type A to type B, which both have an "ID" field, and where each instance of B that S outputs has the same ID as its input instance of A. I hope to add a pipeline stage T immediately after S that operates using the same partitioning as S, so that I can avoid the expense of re-keying the instances of type B.

If I am understanding the DataStream API correctly this is not feasible with Flink, as map(), filter() etc. all outputĀ SingleOutputStreamOperator. But I am hoping that I am missing something.

Thank you,
Ryan
Reply | Threaded
Open this post in threaded view
|

Re: Maintaining Stream Partitioning after Mapping?

Chesnay Schepler
Hello,

I think if you have multiple keyBy() transformations with identical
parallelism the partitioning should
be "preserved". The second keyBy() will still go through the
partitioning process, but since both the key
and parallelism are identical the resulting partition should be
identical as well. resulting in no data being
shuffled around. We aren't really preserving the partitioning, but
re-creating the original one.

Regards,
Chesnay

On 12.04.2017 21:37, Ryan Conway wrote:

> Greetings,
>
> Is there a means of maintaining a stream's partitioning after running
> it through an operation such as map or filter?
>
> I have a pipeline stage S that operates on a stream partitioned by an
> ID field. S flat maps objects of type A to type B, which both have an
> "ID" field, and where each instance of B that S outputs has the same
> ID as its input instance of A. I hope to add a pipeline stage T
> immediately after S that operates using the same partitioning as S, so
> that I can avoid the expense of re-keying the instances of type B.
>
> If I am understanding the DataStream API correctly this is not
> feasible with Flink, as map(), filter() etc. all
> output SingleOutputStreamOperator. But I am hoping that I am missing
> something.
>
> Thank you,
> Ryan


Reply | Threaded
Open this post in threaded view
|

Re: Maintaining Stream Partitioning after Mapping?

Ryan Conway
Thank you, Chesnay. My hope is to keep things computationally inexpensive, and if I understand you correctly, that is satisfied even with this rekeying.

Ryan

On Sat, Apr 15, 2017 at 4:22 AM, Chesnay Schepler <[hidden email]> wrote:
Hello,

I think if you have multiple keyBy() transformations with identical parallelism the partitioning should
be "preserved". The second keyBy() will still go through the partitioning process, but since both the key
and parallelism are identical the resulting partition should be identical as well. resulting in no data being
shuffled around. We aren't really preserving the partitioning, but re-creating the original one.

Regards,
Chesnay


On 12.04.2017 21:37, Ryan Conway wrote:
Greetings,

Is there a means of maintaining a stream's partitioning after running it through an operation such as map or filter?

I have a pipeline stage S that operates on a stream partitioned by an ID field. S flat maps objects of type A to type B, which both have an "ID" field, and where each instance of B that S outputs has the same ID as its input instance of A. I hope to add a pipeline stage T immediately after S that operates using the same partitioning as S, so that I can avoid the expense of re-keying the instances of type B.

If I am understanding the DataStream API correctly this is not feasible with Flink, as map(), filter() etc. all output SingleOutputStreamOperator. But I am hoping that I am missing something.

Thank you,
Ryan