(DEPRECATED) Apache Flink User Mailing List archive.

Maintaining Stream Partitioning after Mapping?

Classic

List

Threaded

3 messages Options

Ryan Conway

Maintaining Stream Partitioning after Mapping?

Greetings,

Is there a means of maintaining a stream's partitioning after running it through an operation such as map or filter?

I have a pipeline stage S that operates on a stream partitioned by an ID field. S flat maps objects of type A to type B, which both have an "ID" field, and where each instance of B that S outputs has the same ID as its input instance of A. I hope to add a pipeline stage T immediately after S that operates using the same partitioning as S, so that I can avoid the expense of re-keying the instances of type B.

If I am understanding the DataStream API correctly this is not feasible with Flink, as map(), filter() etc. all output SingleOutputStreamOperator. But I am hoping that I am missing something.

Thank you,

Ryan

Chesnay Schepler

Re: Maintaining Stream Partitioning after Mapping?

Hello,

I think if you have multiple keyBy() transformations with identical
parallelism the partitioning should
be "preserved". The second keyBy() will still go through the
partitioning process, but since both the key
and parallelism are identical the resulting partition should be
identical as well. resulting in no data being
shuffled around. We aren't really preserving the partitioning, but
re-creating the original one.

Regards,
Chesnay

On 12.04.2017 21:37, Ryan Conway wrote:

> Greetings,
>
> Is there a means of maintaining a stream's partitioning after running
> it through an operation such as map or filter?
>
> I have a pipeline stage S that operates on a stream partitioned by an
> ID field. S flat maps objects of type A to type B, which both have an
> "ID" field, and where each instance of B that S outputs has the same
> ID as its input instance of A. I hope to add a pipeline stage T
> immediately after S that operates using the same partitioning as S, so
> that I can avoid the expense of re-keying the instances of type B.
>
> If I am understanding the DataStream API correctly this is not
> feasible with Flink, as map(), filter() etc. all
> output SingleOutputStreamOperator. But I am hoping that I am missing
> something.
>
> Thank you,
> Ryan

Ryan Conway

Re: Maintaining Stream Partitioning after Mapping?

Thank you, Chesnay. My hope is to keep things computationally inexpensive, and if I understand you correctly, that is satisfied even with this rekeying.

Ryan

On Sat, Apr 15, 2017 at 4:22 AM, Chesnay Schepler <[hidden email]> wrote:

Hello,

I think if you have multiple keyBy() transformations with identical parallelism the partitioning should
be "preserved". The second keyBy() will still go through the partitioning process, but since both the key
and parallelism are identical the resulting partition should be identical as well. resulting in no data being
shuffled around. We aren't really preserving the partitioning, but re-creating the original one.

Regards,
Chesnay

On 12.04.2017 21:37, Ryan Conway wrote:

Greetings,

Is there a means of maintaining a stream's partitioning after running it through an operation such as map or filter?

I have a pipeline stage S that operates on a stream partitioned by an ID field. S flat maps objects of type A to type B, which both have an "ID" field, and where each instance of B that S outputs has the same ID as its input instance of A. I hope to add a pipeline stage T immediately after S that operates using the same partitioning as S, so that I can avoid the expense of re-keying the instances of type B.

If I am understanding the DataStream API correctly this is not feasible with Flink, as map(), filter() etc. all output SingleOutputStreamOperator. But I am hoping that I am missing something.

Thank you,
Ryan