(DEPRECATED) Apache Flink User Mailing List archive.

Re: Failed job restart - flink on yarn

Posted by vprabhu@gmail.com on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Failed-job-restart-flink-on-yarn-tp7764p7771.html

Hi Jamie,

Thanks for the reply.

Yeah i looked at save points, i want to start my job only from the last checkpoint, this means I have to keep track of when the checkpoint was taken and the trigger a save point. I am not sure this is the way to go. My state backend is HDFS and I can see that the checkpoint path has the data that has been buffered in the window.

I want to start the job in a way such that it will read the checkpointed data before the failure and continue processing.

I realise that the checkpoints are used whenever there is a container failure, and a new container is obtained. In my case the job failed because a container failed for the maximum AllowedN umber of failures

Thanks,

Prabhu

On Fri, Jul 1, 2016 at 3:54 PM, Jamie Grier [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:

Hi Prabhu,

Have you taken a look at Flink's savepoints feature? This allows you to make snapshots of your job's state on demand and then at any time restart your job from that point: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/savepoints.html

Also know that you can use Flink disk-backed state backend as well if you're job state is larger than fits in memory. See https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/state_backends.html#the-rocksdbstatebackend

-Jamie

On Fri, Jul 1, 2016 at 1:34 PM, [hidden email] <[hidden email]> wrote:
Hi,

I have a flink streaming job that reads from kafka, performs a aggregation
in a window, it ran fine for a while however when the number of events in a
window crossed a certain limit , the yarn containers failed with Out Of
Memory. The job was running with 10G containers.

We have about 64G memory on the machine and now I want to restart the job
with a 20G container (we ran some tests and 20G should be good enough to
accomodate all the elements from the window).

Is there a way to restart the job from the last checkpoint ?

When I resubmit the job, it starts from the last committed offsets however
the events that were held in the window at the time of checkpointing seem to
get lost. Is there a way to recover the events buffered within the window
and were checkpointed before the failure ?

Thanks,
Prabhu

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

--

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier
[hidden email]

If you reply to this email, your message will be added to the discussion below:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764p7767.html

To unsubscribe from Failed job restart - flink on yarn, click here.
NAML