http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Failed-job-restart-flink-on-yarn-tp7764p7784.html
operators will be in an empty state. It should be possible to add a
the container memory. But I'm afraid that this won't make it to the
next release though. I will open an issue for it though.
and just shut down the YARN containers without cancelling the job. The
when you restart the cluster. It's really more a hack/abuse of HA
> Hi Jamie,
>
> Thanks for the reply.
>
> Yeah i looked at save points, i want to start my job only from the last
> checkpoint, this means I have to keep track of when the checkpoint was taken
> and the trigger a save point. I am not sure this is the way to go. My state
> backend is HDFS and I can see that the checkpoint path has the data that has
> been buffered in the window.
>
> I want to start the job in a way such that it will read the checkpointed
> data before the failure and continue processing.
>
> I realise that the checkpoints are used whenever there is a container
> failure, and a new container is obtained. In my case the job failed because
> a container failed for the maximum AllowedN umber of failures
>
> Thanks,
> Prabhu
>
> On Fri, Jul 1, 2016 at 3:54 PM, Jamie Grier [via Apache Flink User Mailing
> List archive.] <[hidden email]> wrote:
>>
>> Hi Prabhu,
>>
>> Have you taken a look at Flink's savepoints feature? This allows you to
>> make snapshots of your job's state on demand and then at any time restart
>> your job from that point:
>>
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/savepoints.html>>
>> Also know that you can use Flink disk-backed state backend as well if
>> you're job state is larger than fits in memory. See
>>
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/state_backends.html#the-rocksdbstatebackend>>
>>
>> -Jamie
>>
>>
>> On Fri, Jul 1, 2016 at 1:34 PM, [hidden email] <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> I have a flink streaming job that reads from kafka, performs a
>>> aggregation
>>> in a window, it ran fine for a while however when the number of events in
>>> a
>>> window crossed a certain limit , the yarn containers failed with Out Of
>>> Memory. The job was running with 10G containers.
>>>
>>> We have about 64G memory on the machine and now I want to restart the job
>>> with a 20G container (we ran some tests and 20G should be good enough to
>>> accomodate all the elements from the window).
>>>
>>> Is there a way to restart the job from the last checkpoint ?
>>>
>>> When I resubmit the job, it starts from the last committed offsets
>>> however
>>> the events that were held in the window at the time of checkpointing seem
>>> to
>>> get lost. Is there a way to recover the events buffered within the window
>>> and were checkpointed before the failure ?
>>>
>>> Thanks,
>>> Prabhu
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>>
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764.html>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive at Nabble.com.
>>
>>
>>
>>
>> --
>>
>> Jamie Grier
>> data Artisans, Director of Applications Engineering
>> @jamiegrier
>> [hidden email]
>>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>>
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764p7767.html>> To unsubscribe from Failed job restart - flink on yarn, click here.
>> NAML
>
>
>
> ________________________________
> View this message in context: Re: Failed job restart - flink on yarn
>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.