Hi running 1.10.0
3 Zookeepers 3 Job Nodes 3 Task Nodes Yesterday my task nodeas failed with metaspace error. I increased the metaspace a bit to be sure and I restarted the 3 task nodes. But none of the jobs recovered, or no jobs running, should they not recover from the job and zookeeper state? It's as if no jobs ran. P.s: I'm not running the history server. |
Hi John, did you also restart the JobManager, or just the TaskManagers? In either case, they should recover. Do you still have the JobManager logs around, so that we can analyze them? On Thu, Jun 25, 2020 at 6:40 PM John Smith <[hidden email]> wrote:
|
I didn't restart the job manager. Let me see if I can dig up the logs... Also I just realised it's possible that the retry attempts to recover may have been exhausted.
|
Here is one log.... https://www.dropbox.com/s/s8uom5uto708izf/flink-job-001.log?dl=0 If I understand correctly on June 23rd it suspended the jobs? So at that point they would no longer show in the UI or be restarted? On Fri, 3 Jul 2020 at 12:05, John Smith <[hidden email]> wrote:
|
Hi Robert is my assumption correct? On Fri., Jul. 3, 2020, 12:42 p.m. John Smith, <[hidden email]> wrote:
|
In reply to this post by John Smith
On 03.07.20 18:42, John Smith wrote:
> If I understand correctly on June 23rd it suspended the jobs? So at that > point they would no longer show in the UI or be restarted? Yes, that is correct, though in the logs it seems the jobs failed terminally on June 22nd: 2020-06-22 23:30:22,130 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Job ba50a77608992097a98b250b87a08da0 reached globally terminal state FAILED. What you can do in that case is restore the jobs from a savepoint or from a retained checkpoint. See https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints, you need to manually enable checkpoint retention. I hope that helps. Best, Aljoscha |
Yeah it's fine but the thing is I guess because I don't have the history server and the UI wasn't showing any jobs and I didn't have any job Id so I can go and look for the checkpoints. I restarted them but instead of checkpoint I went and played back a few days before just to be sure... All my jobs also have a kafka start time. On Fri, 10 Jul 2020 at 09:31, Aljoscha Krettek <[hidden email]> wrote: On 03.07.20 18:42, John Smith wrote: |
Free forum by Nabble | Edit this page |