SP with Drain and Cancel hangs after take a SP

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

SP with Drain and Cancel hangs after take a SP

Vishal Santoshi
Is this a known issue. We do a stop + savepoint with drain. I see no back pressure on our operators. It essentially takes a SP and then the SInk ( StreamingFileSink to S3 ) just stays in the RUNNING state. 

Without drain i stop + savepoint works fine.  I would imagine drain is important ( flush the buffers etc  ) but why this hang ( I did it 3 times and waited 15 minutes each time ). 

Regards. 
Reply | Threaded
Open this post in threaded view
|

Re: SP with Drain and Cancel hangs after take a SP

Vishal Santoshi
More interested whether a  StreamingFileSink without a drain negatively affects it's exactly-once semantics , given that I state on SP would have the offsets from kafka + the valid lengths of the part files at SP.  To be honest not sure whether the flushed buffers on sink are included in the length, or this is not an issue with StreamingFileSink. If it is the former then I would assume we should be documented and then have to look why this hang happens. 

On Mon, Mar 29, 2021 at 4:08 PM Vishal Santoshi <[hidden email]> wrote:
Is this a known issue. We do a stop + savepoint with drain. I see no back pressure on our operators. It essentially takes a SP and then the SInk ( StreamingFileSink to S3 ) just stays in the RUNNING state. 

Without drain i stop + savepoint works fine.  I would imagine drain is important ( flush the buffers etc  ) but why this hang ( I did it 3 times and waited 15 minutes each time ). 

Regards. 
Reply | Threaded
Open this post in threaded view
|

Re: SP with Drain and Cancel hangs after take a SP

Till Rohrmann
Hi Vishal,

The difference between stop-with-savepoint and stop-with-savepoint-with-drain is that the latter emits a max watermark before taking the snapshot. The idea is to trigger all pending timers and flush the content of some buffering operations like windowing. Semantically, you should use the first option if you want to stop the job and resume it at a later point in time. Stop-with-savepoint-with-drain should only be used if you want to terminate your job and don't intend to resume it because the max watermark destroys the correctness of results which are generated after the job is resumed.

For the concrete problem at hand it is difficult to say why it does not stop. It would be helpful if you could provide us with the debug logs of such a run. I am also pulling Arvid who works on Flink's connector ecosystem.

Cheers,
Till

On Mon, Mar 29, 2021 at 11:08 PM Vishal Santoshi <[hidden email]> wrote:
More interested whether a  StreamingFileSink without a drain negatively affects it's exactly-once semantics , given that I state on SP would have the offsets from kafka + the valid lengths of the part files at SP.  To be honest not sure whether the flushed buffers on sink are included in the length, or this is not an issue with StreamingFileSink. If it is the former then I would assume we should be documented and then have to look why this hang happens. 

On Mon, Mar 29, 2021 at 4:08 PM Vishal Santoshi <[hidden email]> wrote:
Is this a known issue. We do a stop + savepoint with drain. I see no back pressure on our operators. It essentially takes a SP and then the SInk ( StreamingFileSink to S3 ) just stays in the RUNNING state. 

Without drain i stop + savepoint works fine.  I would imagine drain is important ( flush the buffers etc  ) but why this hang ( I did it 3 times and waited 15 minutes each time ). 

Regards. 
Reply | Threaded
Open this post in threaded view
|

Re: SP with Drain and Cancel hangs after take a SP

Vishal Santoshi
Got it. Is it possible to add this very important note to the documentation. Our case is the former as in this is an infinite pipeline and we were establishing the CiCD release process when non breaking changes ( DAG compatible changes are made ) on a running pipe.

Regards

On Tue, Mar 30, 2021 at 8:14 AM Till Rohrmann <[hidden email]> wrote:
Hi Vishal,

The difference between stop-with-savepoint and stop-with-savepoint-with-drain is that the latter emits a max watermark before taking the snapshot. The idea is to trigger all pending timers and flush the content of some buffering operations like windowing. Semantically, you should use the first option if you want to stop the job and resume it at a later point in time. Stop-with-savepoint-with-drain should only be used if you want to terminate your job and don't intend to resume it because the max watermark destroys the correctness of results which are generated after the job is resumed.

For the concrete problem at hand it is difficult to say why it does not stop. It would be helpful if you could provide us with the debug logs of such a run. I am also pulling Arvid who works on Flink's connector ecosystem.

Cheers,
Till

On Mon, Mar 29, 2021 at 11:08 PM Vishal Santoshi <[hidden email]> wrote:
More interested whether a  StreamingFileSink without a drain negatively affects it's exactly-once semantics , given that I state on SP would have the offsets from kafka + the valid lengths of the part files at SP.  To be honest not sure whether the flushed buffers on sink are included in the length, or this is not an issue with StreamingFileSink. If it is the former then I would assume we should be documented and then have to look why this hang happens. 

On Mon, Mar 29, 2021 at 4:08 PM Vishal Santoshi <[hidden email]> wrote:
Is this a known issue. We do a stop + savepoint with drain. I see no back pressure on our operators. It essentially takes a SP and then the SInk ( StreamingFileSink to S3 ) just stays in the RUNNING state. 

Without drain i stop + savepoint works fine.  I would imagine drain is important ( flush the buffers etc  ) but why this hang ( I did it 3 times and waited 15 minutes each time ). 

Regards. 
Reply | Threaded
Open this post in threaded view
|

Re: SP with Drain and Cancel hangs after take a SP

Till Rohrmann
This is a good idea. I will add it to the section here [1].


Cheers,
Till

On Tue, Mar 30, 2021 at 2:46 PM Vishal Santoshi <[hidden email]> wrote:
Got it. Is it possible to add this very important note to the documentation. Our case is the former as in this is an infinite pipeline and we were establishing the CiCD release process when non breaking changes ( DAG compatible changes are made ) on a running pipe.

Regards

On Tue, Mar 30, 2021 at 8:14 AM Till Rohrmann <[hidden email]> wrote:
Hi Vishal,

The difference between stop-with-savepoint and stop-with-savepoint-with-drain is that the latter emits a max watermark before taking the snapshot. The idea is to trigger all pending timers and flush the content of some buffering operations like windowing. Semantically, you should use the first option if you want to stop the job and resume it at a later point in time. Stop-with-savepoint-with-drain should only be used if you want to terminate your job and don't intend to resume it because the max watermark destroys the correctness of results which are generated after the job is resumed.

For the concrete problem at hand it is difficult to say why it does not stop. It would be helpful if you could provide us with the debug logs of such a run. I am also pulling Arvid who works on Flink's connector ecosystem.

Cheers,
Till

On Mon, Mar 29, 2021 at 11:08 PM Vishal Santoshi <[hidden email]> wrote:
More interested whether a  StreamingFileSink without a drain negatively affects it's exactly-once semantics , given that I state on SP would have the offsets from kafka + the valid lengths of the part files at SP.  To be honest not sure whether the flushed buffers on sink are included in the length, or this is not an issue with StreamingFileSink. If it is the former then I would assume we should be documented and then have to look why this hang happens. 

On Mon, Mar 29, 2021 at 4:08 PM Vishal Santoshi <[hidden email]> wrote:
Is this a known issue. We do a stop + savepoint with drain. I see no back pressure on our operators. It essentially takes a SP and then the SInk ( StreamingFileSink to S3 ) just stays in the RUNNING state. 

Without drain i stop + savepoint works fine.  I would imagine drain is important ( flush the buffers etc  ) but why this hang ( I did it 3 times and waited 15 minutes each time ). 

Regards. 
Reply | Threaded
Open this post in threaded view
|

Re: SP with Drain and Cancel hangs after take a SP

Vishal Santoshi
Great, thanks! 

On Tue, Mar 30, 2021 at 11:00 AM Till Rohrmann <[hidden email]> wrote:
This is a good idea. I will add it to the section here [1].


Cheers,
Till

On Tue, Mar 30, 2021 at 2:46 PM Vishal Santoshi <[hidden email]> wrote:
Got it. Is it possible to add this very important note to the documentation. Our case is the former as in this is an infinite pipeline and we were establishing the CiCD release process when non breaking changes ( DAG compatible changes are made ) on a running pipe.

Regards

On Tue, Mar 30, 2021 at 8:14 AM Till Rohrmann <[hidden email]> wrote:
Hi Vishal,

The difference between stop-with-savepoint and stop-with-savepoint-with-drain is that the latter emits a max watermark before taking the snapshot. The idea is to trigger all pending timers and flush the content of some buffering operations like windowing. Semantically, you should use the first option if you want to stop the job and resume it at a later point in time. Stop-with-savepoint-with-drain should only be used if you want to terminate your job and don't intend to resume it because the max watermark destroys the correctness of results which are generated after the job is resumed.

For the concrete problem at hand it is difficult to say why it does not stop. It would be helpful if you could provide us with the debug logs of such a run. I am also pulling Arvid who works on Flink's connector ecosystem.

Cheers,
Till

On Mon, Mar 29, 2021 at 11:08 PM Vishal Santoshi <[hidden email]> wrote:
More interested whether a  StreamingFileSink without a drain negatively affects it's exactly-once semantics , given that I state on SP would have the offsets from kafka + the valid lengths of the part files at SP.  To be honest not sure whether the flushed buffers on sink are included in the length, or this is not an issue with StreamingFileSink. If it is the former then I would assume we should be documented and then have to look why this hang happens. 

On Mon, Mar 29, 2021 at 4:08 PM Vishal Santoshi <[hidden email]> wrote:
Is this a known issue. We do a stop + savepoint with drain. I see no back pressure on our operators. It essentially takes a SP and then the SInk ( StreamingFileSink to S3 ) just stays in the RUNNING state. 

Without drain i stop + savepoint works fine.  I would imagine drain is important ( flush the buffers etc  ) but why this hang ( I did it 3 times and waited 15 minutes each time ). 

Regards.