Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

agnelo.dcosta

Hi,

What is the difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

table.exec.source.idle-timeout: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/config.html#table-exec-source-idle-timeout

 

setIdleStateRetentionTime: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/query_configuration.html#idle-state-retention-time

 

Some context:
Hi we are using flink 1.12.
Our checkpoint size is constantly increasing once app is deployed.
After performing a restart, checkpoint size goes back to expected size.
Looking at actual checkpoint files generated, it seems our app is holding on to state/events since the time app started up.
Based on our sql, the maximum time we would need to hold state is 10 minutes.

 

This e-mail is intended only for the use of the addressees. Any copying, forwarding, printing or other use of this e-mail by persons other than the addressees is not authorized. This e-mail may contain information that is privileged, confidential and exempt from disclosure. If you are not the intended recipient, please notify us immediately by return e-mail (including the original message in your reply) and then delete and discard all copies of the e-mail. Thank you.
HB75
Reply | Threaded
Open this post in threaded view
|

Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

Dawid Wysakowicz-2

Hi,

The difference is that the table.exec.source.idle-timeout is used for dealing with source idleness[1]. It is a problem that a watermark cannot advance if some of the partition become idle (do not produce any records). Watermark is always the minimum of watermarks of all input partitions. The setting makes flink ignore certain partitions in the calculation after the time threshold is reached.

The IdleStateRetention is Table API specific. As described in the link you provided it removes entries from a state for keys that were not seen for a given time threshold.

As for your issue, I'd recommend first investigating what is causing the state to grow. Is it ever growing keyspace? Is it that a watermark does not progress (this should manifest in results as well). Or is it something else.

Best,

Dawid


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html#dealing-with-idle-sources

On 25/01/2021 20:12, Dcosta, Agnelo (HBO) wrote:

Hi,

What is the difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

table.exec.source.idle-timeout: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/config.html#table-exec-source-idle-timeout

 

setIdleStateRetentionTime: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/query_configuration.html#idle-state-retention-time

 

Some context:
Hi we are using flink 1.12.
Our checkpoint size is constantly increasing once app is deployed.
After performing a restart, checkpoint size goes back to expected size.
Looking at actual checkpoint files generated, it seems our app is holding on to state/events since the time app started up.
Based on our sql, the maximum time we would need to hold state is 10 minutes.

 

This e-mail is intended only for the use of the addressees. Any copying, forwarding, printing or other use of this e-mail by persons other than the addressees is not authorized. This e-mail may contain information that is privileged, confidential and exempt from disclosure. If you are not the intended recipient, please notify us immediately by return e-mail (including the original message in your reply) and then delete and discard all copies of the e-mail. Thank you.
HB75

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

agnelo.dcosta

Hi Dawid, thanks for the clarification and it helps a lot.
Reply to couple of points :

what is causing the state to grow?
We are using flink SQL and have 5 pattern match queries , 3 group by tumble windows. State growth over time is primarily coming from pattern match queries.

Is it ever growing keyspace?
Yes. By design our keyspace is ever growing. The expectation is that messages for one key will come in for couple of hours, then stop coming in. We would never see messages from that key again. New keys are constantly coming in.

Is it that a watermark does not progress?
Watermark on the subtask level is constantly updating and is in sync with other subtasks. We have not seen any issues with watermark updating as such.

Looking through mailing list archive, our problem seems similar to
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Memory-in-CEP-queries-keep-increasing-td31045.html
https://issues.apache.org/jira/browse/FLINK-15160 : Clean up is not applied if there are no incoming events for a key.

By design we can have partial matched states/matches in pattern match queries. And key space is such that no new event comes in for those partial matches.

thanks.

 

From: Dawid Wysakowicz <[hidden email]>
Date: Tuesday, January 26, 2021 at 3:14 AM
To: Dcosta, Agnelo (HBO) <[hidden email]>, [hidden email] <[hidden email]>
Subject: Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

**External Email received from: [hidden email] **

 

Hi,

The difference is that the table.exec.source.idle-timeout is used for dealing with source idleness[1]. It is a problem that a watermark cannot advance if some of the partition become idle (do not produce any records). Watermark is always the minimum of watermarks of all input partitions. The setting makes flink ignore certain partitions in the calculation after the time threshold is reached.

The IdleStateRetention is Table API specific. As described in the link you provided it removes entries from a state for keys that were not seen for a given time threshold.

As for your issue, I'd recommend first investigating what is causing the state to grow. Is it ever growing keyspace? Is it that a watermark does not progress (this should manifest in results as well). Or is it something else.

Best,

Dawid

 

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html#dealing-with-idle-sources

On 25/01/2021 20:12, Dcosta, Agnelo (HBO) wrote:

Hi,

What is the difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

table.exec.source.idle-timeout: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/config.html#table-exec-source-idle-timeout

 

setIdleStateRetentionTime: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/query_configuration.html#idle-state-retention-time

 

Some context:
Hi we are using flink 1.12.
Our checkpoint size is constantly increasing once app is deployed.
After performing a restart, checkpoint size goes back to expected size.
Looking at actual checkpoint files generated, it seems our app is holding on to state/events since the time app started up.
Based on our sql, the maximum time we would need to hold state is 10 minutes.

 

This e-mail is intended only for the use of the addressees. Any copying, forwarding, printing or other use of this e-mail by persons other than the addressees is not authorized. This e-mail may contain information that is privileged, confidential and exempt from disclosure. If you are not the intended recipient, please notify us immediately by return e-mail (including the original message in your reply) and then delete and discard all copies of the e-mail. Thank you.
HB75

Reply | Threaded
Open this post in threaded view
|

Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

Dawid Wysakowicz-2

Hey,

As for the MATCH_RECOGNIZE clause, I highly recommend applying a time constraint[1]. The idle state retention time does not apply to the MATCH_RECOGNIZE, but you can think of the time constraint as something similar, but it is closer to the actual query logic.

If you are hitting FLINK-15160 unfortunately I don't have a good solution for it. The only thing that comes to my mind is adding a heartbeat event to the event stream to prune the partial matches, but I understand it is quite invasive.

If you would be willing to help fixing the problem in FLINK, I could also help review it and give pointers how it could be done.

Best,

Dawid

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/streaming/match_recognize.html#time-constraint

On 26/01/2021 17:39, Dcosta, Agnelo (HBO) wrote:

Hi Dawid, thanks for the clarification and it helps a lot.
Reply to couple of points :

what is causing the state to grow?
We are using flink SQL and have 5 pattern match queries , 3 group by tumble windows. State growth over time is primarily coming from pattern match queries.

Is it ever growing keyspace?
Yes. By design our keyspace is ever growing. The expectation is that messages for one key will come in for couple of hours, then stop coming in. We would never see messages from that key again. New keys are constantly coming in.

Is it that a watermark does not progress?
Watermark on the subtask level is constantly updating and is in sync with other subtasks. We have not seen any issues with watermark updating as such.

Looking through mailing list archive, our problem seems similar to
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Memory-in-CEP-queries-keep-increasing-td31045.html
https://issues.apache.org/jira/browse/FLINK-15160 : Clean up is not applied if there are no incoming events for a key.

By design we can have partial matched states/matches in pattern match queries. And key space is such that no new event comes in for those partial matches.

thanks.

 

From: Dawid Wysakowicz [hidden email]
Date: Tuesday, January 26, 2021 at 3:14 AM
To: Dcosta, Agnelo (HBO) [hidden email], [hidden email] [hidden email]
Subject: Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

**External Email received from: [hidden email] **

 

Hi,

The difference is that the table.exec.source.idle-timeout is used for dealing with source idleness[1]. It is a problem that a watermark cannot advance if some of the partition become idle (do not produce any records). Watermark is always the minimum of watermarks of all input partitions. The setting makes flink ignore certain partitions in the calculation after the time threshold is reached.

The IdleStateRetention is Table API specific. As described in the link you provided it removes entries from a state for keys that were not seen for a given time threshold.

As for your issue, I'd recommend first investigating what is causing the state to grow. Is it ever growing keyspace? Is it that a watermark does not progress (this should manifest in results as well). Or is it something else.

Best,

Dawid

 

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html#dealing-with-idle-sources

On 25/01/2021 20:12, Dcosta, Agnelo (HBO) wrote:

Hi,

What is the difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

table.exec.source.idle-timeout: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/config.html#table-exec-source-idle-timeout

 

setIdleStateRetentionTime: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/query_configuration.html#idle-state-retention-time

 

Some context:
Hi we are using flink 1.12.
Our checkpoint size is constantly increasing once app is deployed.
After performing a restart, checkpoint size goes back to expected size.
Looking at actual checkpoint files generated, it seems our app is holding on to state/events since the time app started up.
Based on our sql, the maximum time we would need to hold state is 10 minutes.

 

This e-mail is intended only for the use of the addressees. Any copying, forwarding, printing or other use of this e-mail by persons other than the addressees is not authorized. This e-mail may contain information that is privileged, confidential and exempt from disclosure. If you are not the intended recipient, please notify us immediately by return e-mail (including the original message in your reply) and then delete and discard all copies of the e-mail. Thank you.
HB75


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

agnelo.dcosta

Hi Dawid,

Thanks for the tip on time constraint. We are using within in our MATCH_RECOGNIZE clause. It set to 3 minutes.
Increase in checkpoint size problem still persists.

Thanks for adding comments to FLINK-15160. I will take a look at changes you suggested.

P.S. :
I initially meant to ask what is the difference between
table.exec.state.ttl https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/config.html#table-exec-state-ttl

And
setIdleStateRetentionTime: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/query_configuration.html#idle-state-retention-time

And if table.exec.state.ttl makes any difference to match_recognize state ?

 

From: Dawid Wysakowicz <[hidden email]>
Date: Wednesday, January 27, 2021 at 12:41 AM
To: Dcosta, Agnelo (HBO) <[hidden email]>, [hidden email] <[hidden email]>
Subject: Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

**External Email received from: [hidden email] **

 

Hey,

As for the MATCH_RECOGNIZE clause, I highly recommend applying a time constraint[1]. The idle state retention time does not apply to the MATCH_RECOGNIZE, but you can think of the time constraint as something similar, but it is closer to the actual query logic.

If you are hitting FLINK-15160 unfortunately I don't have a good solution for it. The only thing that comes to my mind is adding a heartbeat event to the event stream to prune the partial matches, but I understand it is quite invasive.

If you would be willing to help fixing the problem in FLINK, I could also help review it and give pointers how it could be done.

Best,

Dawid

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/streaming/match_recognize.html#time-constraint

On 26/01/2021 17:39, Dcosta, Agnelo (HBO) wrote:

Hi Dawid, thanks for the clarification and it helps a lot.
Reply to couple of points :

what is causing the state to grow?
We are using flink SQL and have 5 pattern match queries , 3 group by tumble windows. State growth over time is primarily coming from pattern match queries.

Is it ever growing keyspace?
Yes. By design our keyspace is ever growing. The expectation is that messages for one key will come in for couple of hours, then stop coming in. We would never see messages from that key again. New keys are constantly coming in.

Is it that a watermark does not progress?
Watermark on the subtask level is constantly updating and is in sync with other subtasks. We have not seen any issues with watermark updating as such.

Looking through mailing list archive, our problem seems similar to
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Memory-in-CEP-queries-keep-increasing-td31045.html
https://issues.apache.org/jira/browse/FLINK-15160 : Clean up is not applied if there are no incoming events for a key.

By design we can have partial matched states/matches in pattern match queries. And key space is such that no new event comes in for those partial matches.

thanks.

 

From: Dawid Wysakowicz [hidden email]
Date: Tuesday, January 26, 2021 at 3:14 AM
To: Dcosta, Agnelo (HBO) [hidden email], [hidden email] [hidden email]
Subject: Re: Difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

**External Email received from: [hidden email] **

 

Hi,

The difference is that the table.exec.source.idle-timeout is used for dealing with source idleness[1]. It is a problem that a watermark cannot advance if some of the partition become idle (do not produce any records). Watermark is always the minimum of watermarks of all input partitions. The setting makes flink ignore certain partitions in the calculation after the time threshold is reached.

The IdleStateRetention is Table API specific. As described in the link you provided it removes entries from a state for keys that were not seen for a given time threshold.

As for your issue, I'd recommend first investigating what is causing the state to grow. Is it ever growing keyspace? Is it that a watermark does not progress (this should manifest in results as well). Or is it something else.

Best,

Dawid

 

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html#dealing-with-idle-sources

On 25/01/2021 20:12, Dcosta, Agnelo (HBO) wrote:

Hi,

What is the difference between table.exec.source.idle-timeout and setIdleStateRetentionTime ?

table.exec.source.idle-timeout: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/config.html#table-exec-source-idle-timeout

 

setIdleStateRetentionTime: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/query_configuration.html#idle-state-retention-time

 

Some context:
Hi we are using flink 1.12.
Our checkpoint size is constantly increasing once app is deployed.
After performing a restart, checkpoint size goes back to expected size.
Looking at actual checkpoint files generated, it seems our app is holding on to state/events since the time app started up.
Based on our sql, the maximum time we would need to hold state is 10 minutes.

 

This e-mail is intended only for the use of the addressees. Any copying, forwarding, printing or other use of this e-mail by persons other than the addressees is not authorized. This e-mail may contain information that is privileged, confidential and exempt from disclosure. If you are not the intended recipient, please notify us immediately by return e-mail (including the original message in your reply) and then delete and discard all copies of the e-mail. Thank you.
HB75