(DEPRECATED) Apache Flink User Mailing List archive.

Apache Flink - Uid and name for Flink sources and sinks

Classic

List

Threaded

6 messages Options

M Singh

Apache Flink - Uid and name for Flink sources and sinks

Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?

2. What are the pros and cons of adding uid to sources and sinks ?

3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?

5. Can the sinks and sources without uid restart from savepoints ?

6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?

Thanks for your help.

Mans

M Singh

Re: Apache Flink - Uid and name for Flink sources and sinks

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks. Thanks. Mans

On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:

Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?

2. What are the pros and cons of adding uid to sources and sinks ?

3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?

5. Can the sinks and sources without uid restart from savepoints ?

6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?

Thanks for your help.

Mans

Dian Fu

Re: Apache Flink - Uid and name for Flink sources and sinks

1. Should we assign uid and name to the sources and sinks too ?
>> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources.

2. What are the pros and cons of adding uid to sources and sinks ?
>> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice.

3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?
>> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?

>> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset.

5. Can the sinks and sources without uid restart from savepoints ?
>> The same as above.

6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?
>> Not sure what do you mean by "attribute id". Could you give some more detailed information about it?

Regards,
Dian

On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks. Thanks. Mans

On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:

Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?

Thanks for your help.

Mans

M Singh

Re: Apache Flink - Uid and name for Flink sources and sinks

Thanks Dian for your answers.

A few more questions:

1. If I do not assign uids to operators/sources and sinks - I am assuming the framework assigns it one. Now how does another run of the the same application using the previous runs savepoint/checkpoint match it's tasks/operators to the savepoint/checkpoint state of the application ?

2. Is the operatorID in the checkpoint state the same as uid ?

3. Do you have any pointer as to how an operatorID is generated for the checkpoint and who can it be mapped to back to the operator for troubleshooting purposes ?

Regarding id attribute - I meant the following:

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java#L139

However, I realized that this is not unique across applications runs and so not a good candidate.

Thanks again for your help.

On Sunday, November 24, 2019, 04:55:55 AM EST, Dian Fu <[hidden email]> wrote:

1. Should we assign uid and name to the sources and sinks too ?
>> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources.

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?

>> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset.

5. Can the sinks and sources without uid restart from savepoints ?
>> The same as above.

On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks. Thanks. Mans

On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:

Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?

Thanks for your help.

Mans

Dian Fu

Re: Apache Flink - Uid and name for Flink sources and sinks

Hi Mans,

Please see my reply inline below.

在 2019年11月25日，上午5:42，M Singh <[hidden email]> 写道：

Thanks Dian for your answers.

A few more questions:

1. If I do not assign uids to operators/sources and sinks - I am assuming the framework assigns it one. Now how does another run of the the same application using the previous runs savepoint/checkpoint match it's tasks/operators to the savepoint/checkpoint state of the application ?

You are right that the framework will generate an uid for an operator if it's not assigned. The uid is generated in a deterministic way to ensure that the uid for the same operator remains the same as previous runs(under certain conditions).

The uid generation algorithm:

https://github.com/apache/flink/blob/fd511c345eac31f03b801ff19dbf1f8c86aae760/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphHasherV2.java#L78

2. Is the operatorID in the checkpoint state the same as uid ?

3. Do you have any pointer as to how an operatorID is generated for the checkpoint and who can it be mapped to back to the operator for troubleshooting purposes ?

The OperatorID is constructed from the uid and they are the same:

https://github.com/apache/flink/blob/66b979efc7786edb1a207339b8670d0e82c459a7/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamingJobGraphGenerator.java#L292

Regarding id attribute - I meant the following:

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java#L139

However, I realized that this is not unique across applications runs and so not a good candidate.

Thanks again for your help.

On Sunday, November 24, 2019, 04:55:55 AM EST, Dian Fu <[hidden email]> wrote:

1. Should we assign uid and name to the sources and sinks too ?
>> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources.

2. What are the pros and cons of adding uid to sources and sinks ?
>> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice.

3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?
>> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
>> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset.

5. Can the sinks and sources without uid restart from savepoints ?
>> The same as above.

6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?
>> Not sure what do you mean by "attribute id". Could you give some more detailed information about it?

Regards,
Dian

On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks. Thanks. Mans

On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:

Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?

Thanks for your help.

Mans

M Singh

Re: Apache Flink - Uid and name for Flink sources and sinks

Thanks DIan for your pointers. Mans

On Sunday, November 24, 2019, 08:57:53 PM EST, Dian Fu <[hidden email]> wrote:

Hi Mans,

Please see my reply inline below.

在 2019年11月25日，上午5:42，M Singh <[hidden email]> 写道：

Thanks Dian for your answers.

A few more questions:

1. If I do not assign uids to operators/sources and sinks - I am assuming the framework assigns it one. Now how does another run of the the same application using the previous runs savepoint/checkpoint match it's tasks/operators to the savepoint/checkpoint state of the application ?

The uid generation algorithm:

https://github.com/apache/flink/blob/fd511c345eac31f03b801ff19dbf1f8c86aae760/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphHasherV2.java#L78

2. Is the operatorID in the checkpoint state the same as uid ?

3. Do you have any pointer as to how an operatorID is generated for the checkpoint and who can it be mapped to back to the operator for troubleshooting purposes ?

The OperatorID is constructed from the uid and they are the same:

https://github.com/apache/flink/blob/66b979efc7786edb1a207339b8670d0e82c459a7/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamingJobGraphGenerator.java#L292

Regarding id attribute - I meant the following:

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java#L139

However, I realized that this is not unique across applications runs and so not a good candidate.

Thanks again for your help.

On Sunday, November 24, 2019, 04:55:55 AM EST, Dian Fu <[hidden email]> wrote:

1. Should we assign uid and name to the sources and sinks too ?
>> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources.

2. What are the pros and cons of adding uid to sources and sinks ?
>> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice.

3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?
>> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
>> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset.

5. Can the sinks and sources without uid restart from savepoints ?
>> The same as above.

6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?
>> Not sure what do you mean by "attribute id". Could you give some more detailed information about it?

Regards,
Dian

On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks. Thanks. Mans

On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:

Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ?

Thanks for your help.

Mans