(DEPRECATED) Apache Flink User Mailing List archive.

High back-pressure after recovering from a save point

Classic

List

Threaded

10 messages Options

Kien Truong

High back-pressure after recovering from a save point

Hi all,

I have one job where back-pressure is significantly higher after
resuming from a save point.

Because that job makes heavy use of stateful functions with
RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Fabian Hueske-2

Re: High back-pressure after recovering from a save point

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Kien Truong

Re: High back-pressure after recovering from a save point

Hi Fabian,

This happens to me even when the restore is immediate, so there's not much data in Kafka to catch up (5 minutes max)

Regards

Kien

On Jul 13, 2017, at 23:40, Fabian Hueske <[hidden email]> wrote:

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Stephan Ewen

Re: High back-pressure after recovering from a save point

Hi!

Flink 1.3.2 does not yet exist. Do you mean 1.3.1 or latest master?

Can you tell us whether this occurs only in 1.3.x and worked well in 1.2.x?

If you just keep the job running without savepoint/restore, you do not get into backpressure situations?

Thanks,

Stephan

On Fri, Jul 14, 2017 at 1:15 AM, Kien Truong <[hidden email]> wrote:

Hi Fabian,

This happens to me even when the restore is immediate, so there's not much data in Kafka to catch up (5 minutes max)

Regards

Kien

On Jul 13, 2017, at 23:40, Fabian Hueske <[hidden email]> wrote:

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Kien Truong

Re: High back-pressure after recovering from a save point

Hi,

Sorry for the version typo, I'm running 1.3.1. I did not test with 1.2.x.

The jobs runs fine with almost 0 back-pressure if it's started from scratch or if I reuse the kafka consumers group id without specifying the safe point.

Best regards,

Kien

On Jul 14, 2017, at 15:59, Stephan Ewen <[hidden email]> wrote:

Hi!

Flink 1.3.2 does not yet exist. Do you mean 1.3.1 or latest master?

Can you tell us whether this occurs only in 1.3.x and worked well in 1.2.x?

If you just keep the job running without savepoint/restore, you do not get into backpressure situations?

Thanks,

Stephan

On Fri, Jul 14, 2017 at 1:15 AM, Kien Truong <[hidden email]> wrote:

Hi Fabian,

This happens to me even when the restore is immediate, so there's not much data in Kafka to catch up (5 minutes max)

Regards

Kien

On Jul 13, 2017, at 23:40, Fabian Hueske < [hidden email]> wrote:

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Stephan Ewen

Re: High back-pressure after recovering from a save point

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

On Fri, Jul 14, 2017 at 11:18 AM, Kien Truong <[hidden email]> wrote:

Hi,

Sorry for the version typo, I'm running 1.3.1. I did not test with 1.2.x.

The jobs runs fine with almost 0 back-pressure if it's started from scratch or if I reuse the kafka consumers group id without specifying the safe point.

Best regards,

Kien

On Jul 14, 2017, at 15:59, Stephan Ewen <[hidden email]> wrote:

Hi!

Flink 1.3.2 does not yet exist. Do you mean 1.3.1 or latest master?

Can you tell us whether this occurs only in 1.3.x and worked well in 1.2.x?

If you just keep the job running without savepoint/restore, you do not get into backpressure situations?

Thanks,

Stephan

On Fri, Jul 14, 2017 at 1:15 AM, Kien Truong <[hidden email]> wrote:

Hi Fabian,

This happens to me even when the restore is immediate, so there's not much data in Kafka to catch up (5 minutes max)

Regards

Kien

On Jul 13, 2017, at 23:40, Fabian Hueske < [hidden email]> wrote:

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Tzu-Li (Gordon) Tai

Re: High back-pressure after recovering from a save point

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

This is already possible in Flink 1.3.x. `FlinkKafkaConsumer#setStartFromLatest()` would be it.

On 15 July 2017 at 12:33:53 AM, Stephan Ewen ([hidden email]) wrote:

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

On Fri, Jul 14, 2017 at 11:18 AM, Kien Truong <[hidden email]> wrote:

Hi,

Sorry for the version typo, I'm running 1.3.1. I did not test with 1.2.x.

The jobs runs fine with almost 0 back-pressure if it's started from scratch or if I reuse the kafka consumers group id without specifying the safe point.

Best regards,

Kien

On Jul 14, 2017, at 15:59, Stephan Ewen <[hidden email]> wrote:

Hi!

Flink 1.3.2 does not yet exist. Do you mean 1.3.1 or latest master?

Can you tell us whether this occurs only in 1.3.x and worked well in 1.2.x?

If you just keep the job running without savepoint/restore, you do not get into backpressure situations?

Thanks,

Stephan

On Fri, Jul 14, 2017 at 1:15 AM, Kien Truong <[hidden email]> wrote:

Hi Fabian,

This happens to me even when the restore is immediate, so there's not much data in Kafka to catch up (5 minutes max)

Regards

Kien

On Jul 13, 2017, at 23:40, Fabian Hueske < [hidden email]> wrote:

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Tzu-Li (Gordon) Tai

Re: High back-pressure after recovering from a save point

One thing: do note that `FlinkKafkaConsumer#setStartFromLatest()` does not have any effect when starting from savepoints.

i.e., the consumer will still start from whatever offset is written in the savepoint.

On 15 July 2017 at 12:38:10 AM, Tzu-Li (Gordon) Tai ([hidden email]) wrote:

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

This is already possible in Flink 1.3.x. `FlinkKafkaConsumer#setStartFromLatest()` would be it.

On 15 July 2017 at 12:33:53 AM, Stephan Ewen ([hidden email]) wrote:

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

On Fri, Jul 14, 2017 at 11:18 AM, Kien Truong <[hidden email]> wrote:

Hi,

Sorry for the version typo, I'm running 1.3.1. I did not test with 1.2.x.

The jobs runs fine with almost 0 back-pressure if it's started from scratch or if I reuse the kafka consumers group id without specifying the safe point.

Best regards,

Kien

On Jul 14, 2017, at 15:59, Stephan Ewen <[hidden email]> wrote:

Hi!

Flink 1.3.2 does not yet exist. Do you mean 1.3.1 or latest master?

Can you tell us whether this occurs only in 1.3.x and worked well in 1.2.x?

If you just keep the job running without savepoint/restore, you do not get into backpressure situations?

Thanks,

Stephan

On Fri, Jul 14, 2017 at 1:15 AM, Kien Truong <[hidden email]> wrote:

Hi Fabian,

This happens to me even when the restore is immediate, so there's not much data in Kafka to catch up (5 minutes max)

Regards

Kien

On Jul 13, 2017, at 23:40, Fabian Hueske < [hidden email]> wrote:

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Gyula Fóra

Re: High back-pressure after recovering from a save point

It will work if you assign a new uid to the Kafka source.

Gyula

On Fri, Jul 14, 2017, 18:42 Tzu-Li (Gordon) Tai <[hidden email]> wrote:

One thing: do note that `FlinkKafkaConsumer#setStartFromLatest()` does not have any effect when starting from savepoints.
i.e., the consumer will still start from whatever offset is written in the savepoint.

On 15 July 2017 at 12:38:10 AM, Tzu-Li (Gordon) Tai ([hidden email]) wrote:

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

This is already possible in Flink 1.3.x. `FlinkKafkaConsumer#setStartFromLatest()` would be it.

On 15 July 2017 at 12:33:53 AM, Stephan Ewen ([hidden email]) wrote:

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

On Fri, Jul 14, 2017 at 11:18 AM, Kien Truong <[hidden email]> wrote:

Hi,

Sorry for the version typo, I'm running 1.3.1. I did not test with 1.2.x.

The jobs runs fine with almost 0 back-pressure if it's started from scratch or if I reuse the kafka consumers group id without specifying the safe point.

Best regards,

Kien

On Jul 14, 2017, at 15:59, Stephan Ewen <[hidden email]> wrote:

Hi!

Flink 1.3.2 does not yet exist. Do you mean 1.3.1 or latest master?

Can you tell us whether this occurs only in 1.3.x and worked well in 1.2.x?

If you just keep the job running without savepoint/restore, you do not get into backpressure situations?

Thanks,

Stephan

On Fri, Jul 14, 2017 at 1:15 AM, Kien Truong <[hidden email]> wrote:

Hi Fabian,

This happens to me even when the restore is immediate, so there's not much data in Kafka to catch up (5 minutes max)

Regards

Kien

On Jul 13, 2017, at 23:40, Fabian Hueske < [hidden email]> wrote:

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien

Kien Truong

Re: High back-pressure after recovering from a save point

Hi,

We have been testing with the FsStateBackend for the last few days and have not encountered this issue anymore.

However, we will evaluate the rocksdb backend again soon because we want incremental checkpoint. I will report back if I have more updates.

Best regards,

Kien

On Jul 15, 2017, at 00:43, "Gyula Fóra" <[hidden email]> wrote:

It will work if you assign a new uid to the Kafka source.

Gyula

On Fri, Jul 14, 2017, 18:42 Tzu-Li (Gordon) Tai < [hidden email]> wrote:

One thing: do note that `FlinkKafkaConsumer#setStartFromLatest()` does not have any effect when starting from savepoints.
i.e., the consumer will still start from whatever offset is written in the savepoint.

On 15 July 2017 at 12:38:10 AM, Tzu-Li (Gordon) Tai ([hidden email]) wrote:

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

This is already possible in Flink 1.3.x. `FlinkKafkaConsumer#setStartFromLatest()` would be it.

On 15 July 2017 at 12:33:53 AM, Stephan Ewen ([hidden email]) wrote:

Can you try starting from the savepoint, but telling Kafka to start from the latest offset?

(@gordon: Is that possible in Flink 1.3.1 or only in 1.4-SNAPSHOT ?)

On Fri, Jul 14, 2017 at 11:18 AM, Kien Truong <[hidden email]> wrote:

Hi,

Sorry for the version typo, I'm running 1.3.1. I did not test with 1.2.x.

The jobs runs fine with almost 0 back-pressure if it's started from scratch or if I reuse the kafka consumers group id without specifying the safe point.

Best regards,

Kien

On Jul 14, 2017, at 15:59, Stephan Ewen < [hidden email]> wrote:

Hi!

Flink 1.3.2 does not yet exist. Do you mean 1.3.1 or latest master?

Can you tell us whether this occurs only in 1.3.x and worked well in 1.2.x?

If you just keep the job running without savepoint/restore, you do not get into backpressure situations?

Thanks,

Stephan

On Fri, Jul 14, 2017 at 1:15 AM, Kien Truong <[hidden email]> wrote:

Hi Fabian,

This happens to me even when the restore is immediate, so there's not much data in Kafka to catch up (5 minutes max)

Regards

Kien

On Jul 13, 2017, at 23:40, Fabian Hueske < [hidden email]> wrote:

I would guess that this is quite usual because the job has to "catch-up" work.

For example, if you took a save point two days ago and restore the job today, the input data of the last two days has been written to Kafka (assuming Kafka as source) and needs to be processed.

The job will now read as fast as possible from Kafka to catch-up to the presence. This means the data is much fast ingested (as fast as Kafka can read and ship it) than during regular processing (as fast as your sources produce).

The processing speed is bound by your Flink job which means there will be backpressure.

Once the job caught-up, the backpressure should disappear.

Best, Fabian

2017-07-13 15:48 GMT+02:00 Kien Truong <[hidden email]>:

Hi all,

I have one job where back-pressure is significantly higher after resuming from a save point.

Because that job makes heavy use of stateful functions with RocksDBStateBackend ,

I'm suspecting that this is the cause of performance degradation.

Does anyone encounter simillar issues or have any tips for debugging ?

I'm using Flink 1.3.2 with YARN in detached mode.

Regards,

Kien