On 23 Jun 2015, at 13:53, Stephan Ewen <
[hidden email]> wrote:
> Currently, Flink does not cache anything across runs, except JAR files on the workers.
>
> The reason the first run is slower may be:
> - Because in the first run, code is distributed in the cluster. In subsequent runs, the JAR files need not be redistributed.
> - Because the JIT takes a bit to kick in and compile code in the first run. In subsequent runs, the code is already JIT-ted.
>
>
> The system should not freeze after 100 runs. Can you tell us a bit more of what you see? Can you identify which process hangs and send us a stack-trace of that one? Then we could look into this...
If you have access to the task manager instances, you can do a `jps` to get the PID of the task manager and then you can do `jstack PID`.
$ jps
16242 Jps
89107 TaskManager
$ jstack 89107
[stack trace]
Would be great if you could share this after the task managers freeze.
- Can you also provide some information on your setup (what job? how many task managers? etc.) so that I can try to reproduce this?