(DEPRECATED) Apache Flink User Mailing List archive.

How can I find out which key group belongs to which subtask

Classic

List

Threaded

8 messages Options

sanmutongzi

How can I find out which key group belongs to which subtask

Hi , I'm trying to do some optimize about Flink 'keyby' processfunction. Is there any possible I can find out one key belongs to which key-group and essentially find out one key-group belongs to which subtask.
The motivation I want to know that is we want to force the data records from upstream still goes to same taskmanager downstream subtask .Which means even if we use a keyedstream function we still want no cross jvm communication happened during run time.
And if we can achieve that , can we also avoid the expensive cost for record serialization because data is only transferred in same taskmanager jvm instance?

Thanks.

Congxian Qiu

Re: How can I find out which key group belongs to which subtask

If you just want to make sure some key goes into the same subtask, does custom key selector[1] help?

For the keygroup and subtask information, you can ref to KeyGroupRangeAssignment[2] for more info, and the max parallelism logic you can ref to doc[3]

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#define-keys-using-key-selector-functions

[2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java

[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html#setting-the-maximum-parallelism

Best,

Congxian

杨东晓 <[hidden email]> 于2020年1月9日周四上午7:47写道：

Hi , I'm trying to do some optimize about Flink 'keyby' processfunction. Is there any possible I can find out one key belongs to which key-group and essentially find out one key-group belongs to which subtask.
The motivation I want to know that is we want to force the data records from upstream still goes to same taskmanager downstream subtask .Which means even if we use a keyedstream function we still want no cross jvm communication happened during run time.
And if we can achieve that , can we also avoid the expensive cost for record serialization because data is only transferred in same taskmanager jvm instance?

Thanks.

sanmutongzi

Re: How can I find out which key group belongs to which subtask

Thanks Congxian!

My purpose is not only make data goes into one same subtask but the specific subtask which belongs to same taskmanager with upstream record. The key idea is to avoid shuffling between taskmanagers.

I think the KeyGroupRangeAssignment.java explained a lot about how to get keygroup and subtask context that can make that happen.

Do you know if there are still serialization happening while data transferred between operator in same taskmanager?
Thanks.

Congxian Qiu <[hidden email]> 于2020年1月9日周四上午1:55写道：

Hi

If you just want to make sure some key goes into the same subtask, does custom key selector[1] help?

For the keygroup and subtask information, you can ref to KeyGroupRangeAssignment[2] for more info, and the max parallelism logic you can ref to doc[3]

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#define-keys-using-key-selector-functions
[2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java
[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html#setting-the-maximum-parallelism

Best,
Congxian

杨东晓 <[hidden email]> 于2020年1月9日周四上午7:47写道：
Hi , I'm trying to do some optimize about Flink 'keyby' processfunction. Is there any possible I can find out one key belongs to which key-group and essentially find out one key-group belongs to which subtask.
The motivation I want to know that is we want to force the data records from upstream still goes to same taskmanager downstream subtask .Which means even if we use a keyedstream function we still want no cross jvm communication happened during run time.
And if we can achieve that , can we also avoid the expensive cost for record serialization because data is only transferred in same taskmanager jvm instance?

Thanks.

Zhijiang(wangzhijiang999)

Re: How can I find out which key group belongs to which subtask

Only chained operators can avoid record serialization cost, but the chaining mode can not support keyed stream.

If you want to deploy downstream with upstream in the same task manager, it can avoid network shuffle cost which can still get performance benefits.

As I know @Till Rohrmann has implemented some enhancements in scheduler layer to support such requirement in release-1.10. You can have a try when the rc candidate is ready.

Best,

Zhijiang

------------------------------------------------------------------
From:杨东晓 <[hidden email]>
Send Time:2020 Jan. 10 (Fri.) 02:10
To:Congxian Qiu <[hidden email]>
Cc:user <[hidden email]>
Subject:Re: How can I find out which key group belongs to which subtask

Thanks Congxian!
My purpose is not only make data goes into one same subtask but the specific subtask which belongs to same taskmanager with upstream record. The key idea is to avoid shuffling between taskmanagers.
I think the KeyGroupRangeAssignment.java explained a lot about how to get keygroup and subtask context that can make that happen.
Do you know if there are still serialization happening while data transferred between operator in same taskmanager?
Thanks.

Congxian Qiu <[hidden email]> 于2020年1月9日周四上午1:55写道：
Hi

If you just want to make sure some key goes into the same subtask, does custom key selector[1] help?

For the keygroup and subtask information, you can ref to KeyGroupRangeAssignment[2] for more info, and the max parallelism logic you can ref to doc[3]

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#define-keys-using-key-selector-functions
[2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java
[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html#setting-the-maximum-parallelism

Best,
Congxian

杨东晓 <[hidden email]> 于2020年1月9日周四上午7:47写道：
Hi , I'm trying to do some optimize about Flink 'keyby' processfunction. Is there any possible I can find out one key belongs to which key-group and essentially find out one key-group belongs to which subtask.
The motivation I want to know that is we want to force the data records from upstream still goes to same taskmanager downstream subtask .Which means even if we use a keyedstream function we still want no cross jvm communication happened during run time.
And if we can achieve that , can we also avoid the expensive cost for record serialization because data is only transferred in same taskmanager jvm instance?

Thanks.

Till Rohrmann

Re: How can I find out which key group belongs to which subtask

Hi,

you would need to set the co-location constraint in order to ensure that the sub-tasks of operators are deployed to the same machine. It effectively means that subtasks a_i, b_i of operator a and b will be deployed to the same slot. This feature is not super well exposed but you can take a look at [1] to see how it can be used.

[1] https://issues.apache.org/jira/browse/FLINK-9809

Cheers,

Till

On Fri, Jan 10, 2020 at 9:08 AM Zhijiang <[hidden email]> wrote:

Only chained operators can avoid record serialization cost, but the chaining mode can not support keyed stream.
If you want to deploy downstream with upstream in the same task manager, it can avoid network shuffle cost which can still get performance benefits.
As I know @Till Rohrmann has implemented some enhancements in scheduler layer to support such requirement in release-1.10. You can have a try when the rc candidate is ready.

Best,
Zhijiang

------------------------------------------------------------------
From:杨东晓 <[hidden email]>
Send Time:2020 Jan. 10 (Fri.) 02:10
To:Congxian Qiu <[hidden email]>
Cc:user <[hidden email]>
Subject:Re: How can I find out which key group belongs to which subtask

Thanks Congxian!
My purpose is not only make data goes into one same subtask but the specific subtask which belongs to same taskmanager with upstream record. The key idea is to avoid shuffling between taskmanagers.
I think the KeyGroupRangeAssignment.java explained a lot about how to get keygroup and subtask context that can make that happen.
Do you know if there are still serialization happening while data transferred between operator in same taskmanager?
Thanks.

Congxian Qiu <[hidden email]> 于2020年1月9日周四上午1:55写道：
Hi

If you just want to make sure some key goes into the same subtask, does custom key selector[1] help?

For the keygroup and subtask information, you can ref to KeyGroupRangeAssignment[2] for more info, and the max parallelism logic you can ref to doc[3]

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#define-keys-using-key-selector-functions
[2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java
[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html#setting-the-maximum-parallelism

Best,
Congxian

杨东晓 <[hidden email]> 于2020年1月9日周四上午7:47写道：
Hi , I'm trying to do some optimize about Flink 'keyby' processfunction. Is there any possible I can find out one key belongs to which key-group and essentially find out one key-group belongs to which subtask.
The motivation I want to know that is we want to force the data records from upstream still goes to same taskmanager downstream subtask .Which means even if we use a keyedstream function we still want no cross jvm communication happened during run time.
And if we can achieve that , can we also avoid the expensive cost for record serialization because data is only transferred in same taskmanager jvm instance?

Thanks.

sanmutongzi

Re: How can I find out which key group belongs to which subtask

In reply to this post by Zhijiang(wangzhijiang999)

Thanks Zhijiang, looks like serialization will always be there in keyed stream

Zhijiang <[hidden email]> 于2020年1月10日周五上午12:08写道：

Only chained operators can avoid record serialization cost, but the chaining mode can not support keyed stream.
If you want to deploy downstream with upstream in the same task manager, it can avoid network shuffle cost which can still get performance benefits.
As I know @Till Rohrmann has implemented some enhancements in scheduler layer to support such requirement in release-1.10. You can have a try when the rc candidate is ready.

Best,
Zhijiang

------------------------------------------------------------------
From:杨东晓 <[hidden email]>
Send Time:2020 Jan. 10 (Fri.) 02:10
To:Congxian Qiu <[hidden email]>
Cc:user <[hidden email]>
Subject:Re: How can I find out which key group belongs to which subtask

Thanks Congxian!
My purpose is not only make data goes into one same subtask but the specific subtask which belongs to same taskmanager with upstream record. The key idea is to avoid shuffling between taskmanagers.
I think the KeyGroupRangeAssignment.java explained a lot about how to get keygroup and subtask context that can make that happen.
Do you know if there are still serialization happening while data transferred between operator in same taskmanager?
Thanks.

Congxian Qiu <[hidden email]> 于2020年1月9日周四上午1:55写道：
Hi

If you just want to make sure some key goes into the same subtask, does custom key selector[1] help?

For the keygroup and subtask information, you can ref to KeyGroupRangeAssignment[2] for more info, and the max parallelism logic you can ref to doc[3]

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#define-keys-using-key-selector-functions
[2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java
[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html#setting-the-maximum-parallelism

Best,
Congxian

杨东晓 <[hidden email]> 于2020年1月9日周四上午7:47写道：
Hi , I'm trying to do some optimize about Flink 'keyby' processfunction. Is there any possible I can find out one key belongs to which key-group and essentially find out one key-group belongs to which subtask.
The motivation I want to know that is we want to force the data records from upstream still goes to same taskmanager downstream subtask .Which means even if we use a keyedstream function we still want no cross jvm communication happened during run time.
And if we can achieve that , can we also avoid the expensive cost for record serialization because data is only transferred in same taskmanager jvm instance?

Thanks.

sanmutongzi

Re: How can I find out which key group belongs to which subtask

In reply to this post by Till Rohrmann

Thanks Till , I will do some test about this , will this be some public feature in next release version or later?

Till Rohrmann <[hidden email]> 于2020年1月10日周五上午6:15写道：

Hi,

you would need to set the co-location constraint in order to ensure that the sub-tasks of operators are deployed to the same machine. It effectively means that subtasks a_i, b_i of operator a and b will be deployed to the same slot. This feature is not super well exposed but you can take a look at [1] to see how it can be used.

[1] https://issues.apache.org/jira/browse/FLINK-9809

Cheers,
Till

On Fri, Jan 10, 2020 at 9:08 AM Zhijiang <[hidden email]> wrote:
Only chained operators can avoid record serialization cost, but the chaining mode can not support keyed stream.
If you want to deploy downstream with upstream in the same task manager, it can avoid network shuffle cost which can still get performance benefits.
As I know @Till Rohrmann has implemented some enhancements in scheduler layer to support such requirement in release-1.10. You can have a try when the rc candidate is ready.

Best,
Zhijiang

------------------------------------------------------------------
From:杨东晓 <[hidden email]>
Send Time:2020 Jan. 10 (Fri.) 02:10
To:Congxian Qiu <[hidden email]>
Cc:user <[hidden email]>
Subject:Re: How can I find out which key group belongs to which subtask

Thanks Congxian!
My purpose is not only make data goes into one same subtask but the specific subtask which belongs to same taskmanager with upstream record. The key idea is to avoid shuffling between taskmanagers.
I think the KeyGroupRangeAssignment.java explained a lot about how to get keygroup and subtask context that can make that happen.
Do you know if there are still serialization happening while data transferred between operator in same taskmanager?
Thanks.

Congxian Qiu <[hidden email]> 于2020年1月9日周四上午1:55写道：
Hi

If you just want to make sure some key goes into the same subtask, does custom key selector[1] help?

For the keygroup and subtask information, you can ref to KeyGroupRangeAssignment[2] for more info, and the max parallelism logic you can ref to doc[3]

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#define-keys-using-key-selector-functions
[2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java
[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html#setting-the-maximum-parallelism

Best,
Congxian

杨东晓 <[hidden email]> 于2020年1月9日周四上午7:47写道：
Hi , I'm trying to do some optimize about Flink 'keyby' processfunction. Is there any possible I can find out one key belongs to which key-group and essentially find out one key-group belongs to which subtask.
The motivation I want to know that is we want to force the data records from upstream still goes to same taskmanager downstream subtask .Which means even if we use a keyedstream function we still want no cross jvm communication happened during run time.
And if we can achieve that , can we also avoid the expensive cost for record serialization because data is only transferred in same taskmanager jvm instance?

Thanks.

Till Rohrmann

Re: How can I find out which key group belongs to which subtask

This feature won't be more public than it is today.

Cheers,

Till

On Fri, Jan 10, 2020 at 9:51 PM 杨东晓 <[hidden email]> wrote:

Thanks Till , I will do some test about this , will this be some public feature in next release version or later?

Till Rohrmann <[hidden email]> 于2020年1月10日周五上午6:15写道：
Hi,

you would need to set the co-location constraint in order to ensure that the sub-tasks of operators are deployed to the same machine. It effectively means that subtasks a_i, b_i of operator a and b will be deployed to the same slot. This feature is not super well exposed but you can take a look at [1] to see how it can be used.

[1] https://issues.apache.org/jira/browse/FLINK-9809

Cheers,
Till

On Fri, Jan 10, 2020 at 9:08 AM Zhijiang <[hidden email]> wrote:
Only chained operators can avoid record serialization cost, but the chaining mode can not support keyed stream.
If you want to deploy downstream with upstream in the same task manager, it can avoid network shuffle cost which can still get performance benefits.
As I know @Till Rohrmann has implemented some enhancements in scheduler layer to support such requirement in release-1.10. You can have a try when the rc candidate is ready.

Best,
Zhijiang

------------------------------------------------------------------
From:杨东晓 <[hidden email]>
Send Time:2020 Jan. 10 (Fri.) 02:10
To:Congxian Qiu <[hidden email]>
Cc:user <[hidden email]>
Subject:Re: How can I find out which key group belongs to which subtask

Thanks Congxian!
My purpose is not only make data goes into one same subtask but the specific subtask which belongs to same taskmanager with upstream record. The key idea is to avoid shuffling between taskmanagers.
I think the KeyGroupRangeAssignment.java explained a lot about how to get keygroup and subtask context that can make that happen.
Do you know if there are still serialization happening while data transferred between operator in same taskmanager?
Thanks.

Congxian Qiu <[hidden email]> 于2020年1月9日周四上午1:55写道：
Hi

If you just want to make sure some key goes into the same subtask, does custom key selector[1] help?

For the keygroup and subtask information, you can ref to KeyGroupRangeAssignment[2] for more info, and the max parallelism logic you can ref to doc[3]

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#define-keys-using-key-selector-functions
[2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java
[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html#setting-the-maximum-parallelism

Best,
Congxian

杨东晓 <[hidden email]> 于2020年1月9日周四上午7:47写道：
Hi , I'm trying to do some optimize about Flink 'keyby' processfunction. Is there any possible I can find out one key belongs to which key-group and essentially find out one key-group belongs to which subtask.
The motivation I want to know that is we want to force the data records from upstream still goes to same taskmanager downstream subtask .Which means even if we use a keyedstream function we still want no cross jvm communication happened during run time.
And if we can achieve that , can we also avoid the expensive cost for record serialization because data is only transferred in same taskmanager jvm instance?

Thanks.