[VOTE] How to Deal with Split/Select in DataStream API

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[VOTE] How to Deal with Split/Select in DataStream API

Xingcan Cui
Hi folks,

Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. 

The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2].

FLINK-11084 added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them.

In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API.

1) Port the side output feature to DataStream API's flatMap and replace split/select with it.

2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select.

3) Keep split/select but change the behavior/semantic to be "correct".

Note that this is just a vote for gathering information, so feel free to participate and share your opinions.

The voting time will end on July 7th 17:00 EDT.

Thanks,
Xingcan

[1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
[2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] How to Deal with Split/Select in DataStream API

Hao Sun
Personally I prefer 3) to keep split/select and correct the behavior. I feel side output is kind of overkill for such a primitive function, and I prefer simple APIs like split/select.

Hao Sun


On Thu, Jul 4, 2019 at 11:20 AM Xingcan Cui <[hidden email]> wrote:
Hi folks,

Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. 

The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2].

FLINK-11084 added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them.

In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API.

1) Port the side output feature to DataStream API's flatMap and replace split/select with it.

2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select.

3) Keep split/select but change the behavior/semantic to be "correct".

Note that this is just a vote for gathering information, so feel free to participate and share your opinions.

The voting time will end on July 7th 17:00 EDT.

Thanks,
Xingcan

[1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
[2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] How to Deal with Split/Select in DataStream API

Xingcan Cui
In reply to this post by Xingcan Cui
Hi all,

Thanks for your participation.

In this thread, we got one +1 for option 1 and option 3, respectively. In the original thread[1], we got two +1 for option 1, one +1 for option 2, and five +1 and one -1 for option 3.

To summarize,

Option 1 (port side output to flatMap and deprecate split/select): three +1
Option 2 (introduce a new split/select and deprecate existing one): one +1
Option 3 ("correct" the existing split/select): six +1 and one -1

It seems that most people involved are in favor of "correcting" the existing split/select. However, this will definitely break the API compatibility, in a subtle way.

IMO, the real behavior of consecutive split/select's has never been thoroughly clarified. Even in the community, it hard to say that we come into a consensus on its real semantics[2-4]. Though the initial design is not ambiguous, there's no doubt that its concept has drifted. 

As the split/select is quite an ancient API, I cc'ed this to more members. It couldn't be better if you can share your opinions on this.

Thanks,
Xingcan



On Jul 5, 2019, at 12:04 AM, 杨力 <[hidden email]> wrote:

I prefer the 1) approach. I used to carry fields, which is needed only for splitting, in the outputs of flatMap functions. Replacing it with outputTags would simplify data structures.

Xingcan Cui <[hidden email]> 于 2019年7月5日周五 上午2:20写道:
Hi folks,

Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. 

The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2].

FLINK-11084 added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them.

In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API.

1) Port the side output feature to DataStream API's flatMap and replace split/select with it.

2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select.

3) Keep split/select but change the behavior/semantic to be "correct".

Note that this is just a vote for gathering information, so feel free to participate and share your opinions.

The voting time will end on July 7th 17:00 EDT.

Thanks,
Xingcan

[1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
[2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] How to Deal with Split/Select in DataStream API

Aljoscha Krettek
I think this would benefit from a FLIP, that neatly sums up the options, and which then gives us also a point where we can vote and ratify a decision.

As a gut feeling, I most like Option 3). Initially I would have preferred option 1) (because of a sense of API purity), but by now I think it’s good that users have this simpler option.

Aljoscha 

On 8. Jul 2019, at 06:39, Xingcan Cui <[hidden email]> wrote:

Hi all,

Thanks for your participation.

In this thread, we got one +1 for option 1 and option 3, respectively. In the original thread[1], we got two +1 for option 1, one +1 for option 2, and five +1 and one -1 for option 3.

To summarize,

Option 1 (port side output to flatMap and deprecate split/select): three +1
Option 2 (introduce a new split/select and deprecate existing one): one +1
Option 3 ("correct" the existing split/select): six +1 and one -1

It seems that most people involved are in favor of "correcting" the existing split/select. However, this will definitely break the API compatibility, in a subtle way.

IMO, the real behavior of consecutive split/select's has never been thoroughly clarified. Even in the community, it hard to say that we come into a consensus on its real semantics[2-4]. Though the initial design is not ambiguous, there's no doubt that its concept has drifted. 

As the split/select is quite an ancient API, I cc'ed this to more members. It couldn't be better if you can share your opinions on this.

Thanks,
Xingcan

[1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/FLINK-1772
[3] https://issues.apache.org/jira/browse/FLINK-5031
[4] https://issues.apache.org/jira/browse/FLINK-11084


On Jul 5, 2019, at 12:04 AM, 杨力 <[hidden email]> wrote:

I prefer the 1) approach. I used to carry fields, which is needed only for splitting, in the outputs of flatMap functions. Replacing it with outputTags would simplify data structures.

Xingcan Cui <[hidden email] <[hidden email]>> 于 2019年7月5日周五 上午2:20写道:
Hi folks,

Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. 

The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2].

FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them.

In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API.

1) Port the side output feature to DataStream API's flatMap and replace split/select with it.

2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select.

3) Keep split/select but change the behavior/semantic to be "correct".

Note that this is just a vote for gathering information, so feel free to participate and share your opinions.

The voting time will end on July 7th 17:00 EDT.

Thanks,
Xingcan

[1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E<https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E>
[2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html>

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] How to Deal with Split/Select in DataStream API

Xingcan Cui
Hi Aljoscha,

Thanks for your response.

With all this preliminary information collected, I’ll start a formal process.

Thank everybody for your attention.

Best,
Xingcan

On Jul 8, 2019, at 10:17 AM, Aljoscha Krettek <[hidden email]> wrote:

I think this would benefit from a FLIP, that neatly sums up the options, and which then gives us also a point where we can vote and ratify a decision.

As a gut feeling, I most like Option 3). Initially I would have preferred option 1) (because of a sense of API purity), but by now I think it’s good that users have this simpler option.

Aljoscha 

On 8. Jul 2019, at 06:39, Xingcan Cui <[hidden email]> wrote:

Hi all,

Thanks for your participation.

In this thread, we got one +1 for option 1 and option 3, respectively. In the original thread[1], we got two +1 for option 1, one +1 for option 2, and five +1 and one -1 for option 3.

To summarize,

Option 1 (port side output to flatMap and deprecate split/select): three +1
Option 2 (introduce a new split/select and deprecate existing one): one +1
Option 3 ("correct" the existing split/select): six +1 and one -1

It seems that most people involved are in favor of "correcting" the existing split/select. However, this will definitely break the API compatibility, in a subtle way.

IMO, the real behavior of consecutive split/select's has never been thoroughly clarified. Even in the community, it hard to say that we come into a consensus on its real semantics[2-4]. Though the initial design is not ambiguous, there's no doubt that its concept has drifted. 

As the split/select is quite an ancient API, I cc'ed this to more members. It couldn't be better if you can share your opinions on this.

Thanks,
Xingcan

[1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/FLINK-1772
[3] https://issues.apache.org/jira/browse/FLINK-5031
[4] https://issues.apache.org/jira/browse/FLINK-11084


On Jul 5, 2019, at 12:04 AM, 杨力 <[hidden email]> wrote:

I prefer the 1) approach. I used to carry fields, which is needed only for splitting, in the outputs of flatMap functions. Replacing it with outputTags would simplify data structures.

Xingcan Cui <[hidden email] <[hidden email]>> 于 2019年7月5日周五 上午2:20写道:
Hi folks,

Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. 

The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2].

FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them.

In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API.

1) Port the side output feature to DataStream API's flatMap and replace split/select with it.

2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select.

3) Keep split/select but change the behavior/semantic to be "correct".

Note that this is just a vote for gathering information, so feel free to participate and share your opinions.

The voting time will end on July 7th 17:00 EDT.

Thanks,
Xingcan

[1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E<https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E>
[2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html>