Hi Folks: I am assigning uid and name for all stateful processors in our application and wanted to find out the following: 1. Should we assign uid and name to the sources and sinks too ? 2. What are the pros and cons of adding uid to sources and sinks ? 3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ? 4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ? 5. Can the sinks and sources without uid restart from savepoints ? 6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ? Thanks for your help. Mans |
Hi Folks - Please let me know if you have any advice on the best practices for setting uid for sources and sinks. Thanks. Mans
On Thursday, November 21, 2019, 10:10:49 PM EST, M Singh <[hidden email]> wrote:
Hi Folks: I am assigning uid and name for all stateful processors in our application and wanted to find out the following: 1. Should we assign uid and name to the sources and sinks too ? 2. What are the pros and cons of adding uid to sources and sinks ? 3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ? 4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ? 5. Can the sinks and sources without uid restart from savepoints ? 6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ? Thanks for your help. Mans |
1. Should we assign uid and name to the sources and sinks too ? >> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources. 2. What are the pros and cons of adding uid to sources and sinks ? >> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice. 3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ? >> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash 4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ? >> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset. 5. Can the sinks and sources without uid restart from savepoints ? >> The same as above. 6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ? >> Not sure what do you mean by "attribute id". Could you give some more detailed information about it? Regards, Dian On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:
|
Thanks Dian for your answers. A few more questions: 1. If I do not assign uids to operators/sources and sinks - I am assuming the framework assigns it one. Now how does another run of the the same application using the previous runs savepoint/checkpoint match it's tasks/operators to the savepoint/checkpoint state of the application ? 2. Is the operatorID in the checkpoint state the same as uid ? 3. Do you have any pointer as to how an operatorID is generated for the checkpoint and who can it be mapped to back to the operator for troubleshooting purposes ? Regarding id attribute - I meant the following: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java#L139 However, I realized that this is not unique across applications runs and so not a good candidate. Thanks again for your help.
On Sunday, November 24, 2019, 04:55:55 AM EST, Dian Fu <[hidden email]> wrote:
1. Should we assign uid and name to the sources and sinks too ? >> If the sources/sinks have used state, you should assign uid for them. This is usually true for sources. 2. What are the pros and cons of adding uid to sources and sinks ? >> I'm not seeing the cons for assigning uid to sources and sinks. So I guess assigning the uids for sources/sinks is always a good practice. 3. The sinks have uid and hashUid - which is the preferred attribute to use for allowing job restarts ? >> Could you see if this could answer you question: https://stackoverflow.com/questions/46112142/apache-flink-set-operator-uid-vs-uidhash 4. If sink and sources uid are not provided in the application, can they still maintain state across job restarts from checkpoints ? >> It depends on whether the sources/sinks uses state. I think most sources use state to maintaining the read offset. 5. Can the sinks and sources without uid restart from savepoints ? >> The same as above. 6. The data streams have an attribute id - How is this generated and can this be used for creating a uid for the sink ? >> Not sure what do you mean by "attribute id". Could you give some more detailed information about it? Regards, Dian On Fri, Nov 22, 2019 at 6:27 PM M Singh <[hidden email]> wrote:
|
Hi Mans, Please see my reply inline below.
You are right that the framework will generate an uid for an operator if it's not assigned. The uid is generated in a deterministic way to ensure that the uid for the same operator remains the same as previous runs(under certain conditions). The uid generation algorithm:
The OperatorID is constructed from the uid and they are the same:
|
Thanks DIan for your pointers. Mans
On Sunday, November 24, 2019, 08:57:53 PM EST, Dian Fu <[hidden email]> wrote:
Hi Mans, Please see my reply inline below.
You are right that the framework will generate an uid for an operator if it's not assigned. The uid is generated in a deterministic way to ensure that the uid for the same operator remains the same as previous runs(under certain conditions). The uid generation algorithm:
The OperatorID is constructed from the uid and they are the same:
|
Free forum by Nabble | Edit this page |