Hi all,
We have an Apache Flink application which generates player sessions based on player events keyed by playerId. Sessions are based on EventTime. A session is created on first event event for that player and closes if there are 30 mins of inactivity. Events are merged in our custom PlayerSessionAggregator implements AggregateFunction.
We deployed this application on a Flink dev cluster (having checkpoints enabled), however we noted that the state keeps growing until we end up with an out of memory as shown in the attached file flink_oom_exception.txt
We tried the using the PurgingTrigger together CountTrigger however since it uses FIRE_AND_PURGE we were ending up with a session per event i.e. event were not being merged.
Using an Evictor we ended up with same situation because events were being removed from the window. Hence we resorted to using State TTL:
Our questions are:
|
Added flink_oom_exception.txt
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2426/flink_oom_exception.txt> as originally forgot to attach it -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hey Karl, sorry for the late reply! Let me first quickly answer your questions:
Yes, they should get removed automatically.
You can monitor the state size in the Flink web ui (there's a "Checkpointing" tab for each job) Or through Flink's metrics system: https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#checkpointing
The window operator does not expose any metrics. I also have some questions :) - Have you considered using the RocksDB statebackend to mitigate the out of memory issues? - Why are you disabling the operator chaining? - Did you validate that the "TimeZone.setDefault(...)" setting ends up at the worker JVMs executing your code? (I suspect that you are only setting the TimeZone in the JVM executing the main() method) Best, Robert On Tue, Mar 3, 2020 at 7:23 PM karl.pullicino <[hidden email]> wrote: Added flink_oom_exception.txt |
Sorry, I pressed the send button too fast. You also attached an exception to the email, which reads as follows: Caused by: org.apache.flink.runtime.resourcemanager.exceptions.UnfulfillableSlotRequestException: Could not fulfill slot request c28970b7cd4f68383e242703bdac81ca. Requested resource profile (ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1}) is unfulfillable.This exception does not indicate that you are running out of memory at runtime, rather that your operator can not be scheduled anymore. Can you see if enabling the operator chaining again solves the problem? Also, how are you deploying Flink? On Mon, Mar 9, 2020 at 3:30 PM Robert Metzger <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |