Apache Flink - Uid and name for Flink sources and sinks

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Flink - Uid and name for Flink sources and sinks

M Singh
Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?  
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?  
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  

Thanks for your help.

Mans
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink - Uid and name for Flink sources and sinks

M Singh

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks.  Thanks.  Mans
On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:


Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?  
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?  
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  

Thanks for your help.

Mans
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink - Uid and name for Flink sources and sinks

Dian Fu
1. Should we assign uid and name to the sources and sinks too ?  
>> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources. 

2. What are the pros and cons of adding uid to sources and sinks ?
>> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice.

3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
>> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
>> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset.
  
5. Can the sinks and sources without uid restart from savepoints ?
>> The same as above.

6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  
>> Not sure what do you mean by "attribute id". Could you give some more detailed information about it?

Regards,
Dian

On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks.  Thanks.  Mans
On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:


Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?  
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?  
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  

Thanks for your help.

Mans
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink - Uid and name for Flink sources and sinks

M Singh
Thanks Dian for your answers.

A few more questions:

1. If I do not assign uids to operators/sources and sinks - I am assuming the framework assigns it one.  Now how does another run of the the same application using the previous runs savepoint/checkpoint match it's tasks/operators to the savepoint/checkpoint state of the application ? 
2. Is the operatorID in the checkpoint state the same as uid ?  
3. Do you have any pointer as to how an operatorID is generated for the checkpoint and who can it be mapped to back to the operator for troubleshooting purposes ?


Regarding id attribute - I meant the following:

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java#L139

However, I realized that this is not unique across applications runs and so not a good candidate.

Thanks again for your help.





On Sunday, November 24, 2019, 04:55:55 AM EST, Dian Fu <[hidden email]> wrote:


1. Should we assign uid and name to the sources and sinks too ?  
>> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources. 

2. What are the pros and cons of adding uid to sources and sinks ?
>> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice.

3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
>> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
>> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset.
  
5. Can the sinks and sources without uid restart from savepoints ?
>> The same as above.

6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  
>> Not sure what do you mean by "attribute id". Could you give some more detailed information about it?

Regards,
Dian

On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks.  Thanks.  Mans
On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:


Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?  
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?  
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  

Thanks for your help.

Mans
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink - Uid and name for Flink sources and sinks

Dian Fu
Hi Mans,

Please see my reply inline below.


在 2019年11月25日,上午5:42,M Singh <[hidden email]> 写道:

Thanks Dian for your answers.

A few more questions:

1. If I do not assign uids to operators/sources and sinks - I am assuming the framework assigns it one.  Now how does another run of the the same application using the previous runs savepoint/checkpoint match it's tasks/operators to the savepoint/checkpoint state of the application ? 

You are right that the framework will generate an uid for an operator if it's not assigned. The uid is generated in a deterministic way to ensure that the uid for the same operator remains the same as previous runs(under certain conditions). 

The uid generation algorithm:


2. Is the operatorID in the checkpoint state the same as uid ?  
3. Do you have any pointer as to how an operatorID is generated for the checkpoint and who can it be mapped to back to the operator for troubleshooting purposes ?

The OperatorID is constructed from the uid and they are the same:


However, I realized that this is not unique across applications runs and so not a good candidate.

Thanks again for your help.





On Sunday, November 24, 2019, 04:55:55 AM EST, Dian Fu <[hidden email]> wrote:


1. Should we assign uid and name to the sources and sinks too ?  
>> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources. 

2. What are the pros and cons of adding uid to sources and sinks ?
>> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice.

3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
>> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
>> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset.
  
5. Can the sinks and sources without uid restart from savepoints ?
>> The same as above.

6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  
>> Not sure what do you mean by "attribute id". Could you give some more detailed information about it?

Regards,
Dian

On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks.  Thanks.  Mans
On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:


Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?  
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?  
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  

Thanks for your help.

Mans

Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink - Uid and name for Flink sources and sinks

M Singh

Thanks DIan for your pointers.  Mans
On Sunday, November 24, 2019, 08:57:53 PM EST, Dian Fu <[hidden email]> wrote:


Hi Mans,

Please see my reply inline below.


在 2019年11月25日,上午5:42,M Singh <[hidden email]> 写道:

Thanks Dian for your answers.

A few more questions:

1. If I do not assign uids to operators/sources and sinks - I am assuming the framework assigns it one.  Now how does another run of the the same application using the previous runs savepoint/checkpoint match it's tasks/operators to the savepoint/checkpoint state of the application ? 

You are right that the framework will generate an uid for an operator if it's not assigned. The uid is generated in a deterministic way to ensure that the uid for the same operator remains the same as previous runs(under certain conditions). 

The uid generation algorithm:


2. Is the operatorID in the checkpoint state the same as uid ?  
3. Do you have any pointer as to how an operatorID is generated for the checkpoint and who can it be mapped to back to the operator for troubleshooting purposes ?

The OperatorID is constructed from the uid and they are the same:


However, I realized that this is not unique across applications runs and so not a good candidate.

Thanks again for your help.





On Sunday, November 24, 2019, 04:55:55 AM EST, Dian Fu <[hidden email]> wrote:


1. Should we assign uid and name to the sources and sinks too ?  
>> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources. 

2. What are the pros and cons of adding uid to sources and sinks ?
>> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice.

3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
>> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash

4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?
>> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset.
  
5. Can the sinks and sources without uid restart from savepoints ?
>> The same as above.

6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  
>> Not sure what do you mean by "attribute id". Could you give some more detailed information about it?

Regards,
Dian

On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:

Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks.  Thanks.  Mans
On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:


Hi Folks:

I am assigning uid and name for all stateful processors in our application and wanted to find out the following:

1. Should we assign uid and name to the sources and sinks too ?  
2. What are the pros and cons of adding uid to sources and sinks ?
3. The sinks have uid and hashUid - which is the preferred attribute to use  for allowing job restarts ?
4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ?  
5. Can the sinks and sources without uid restart from savepoints ?
6. The data streams have an attribute id -  How is this generated and can this be used for creating a uid for the sink ?  

Thanks for your help.

Mans