Checkpointing to gcs taking too long

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

Checkpointing to gcs taking too long

I am trying to run flink on kubernetes, and trying to push checkpoints to
Google Cloud Storage. Below is the docker file

`FROM flink:1.6.2-hadoop28-scala_2.11-alpine

RUN wget -O lib/gcs-connector-latest-hadoop2.jar

RUN wget -O lib/gcs-connector-latest-hadoop2.jar
&& \
&& \
tar xf flink-1.6.2-bin-hadoop28-scala_2.11.tgz && \
mv flink-1.6.2/lib/flink-shaded-hadoop2* lib/  && \
rm -r flink-1.6.2*`

But the checkpoints are taking around 2-3 seconds on average and around 25
seconds at max, even the state size is around 100 KB.

Even the jobs are getting restarted with the error
`AsynchronousException{java.lang.Exception: Could not materialize checkpoint
1640 for operator groupBy` and sometimes losing connections with task

Currently, I have given the heap size of 4096 MB.

Sent from:
Reply | Threaded
Open this post in threaded view

Re: Checkpointing to gcs taking too long

Chesnay Schepler
Please provide the full Exception stack trace and the configuration of
your job (parallelism, number of stateful operators).
Have you tried using the gcs-connector in isolation? This may not be an
issue with Flink.

On 28.11.2018 07:01, prakhar_mathur wrote:

> I am trying to run flink on kubernetes, and trying to push checkpoints to
> Google Cloud Storage. Below is the docker file
> `FROM flink:1.6.2-hadoop28-scala_2.11-alpine
> RUN wget -O lib/gcs-connector-latest-hadoop2.jar
> RUN wget -O lib/gcs-connector-latest-hadoop2.jar
> && \
> wget
> && \
> tar xf flink-1.6.2-bin-hadoop28-scala_2.11.tgz && \
> mv flink-1.6.2/lib/flink-shaded-hadoop2* lib/  && \
> rm -r flink-1.6.2*`
> But the checkpoints are taking around 2-3 seconds on average and around 25
> seconds at max, even the state size is around 100 KB.
> Even the jobs are getting restarted with the error
> `AsynchronousException{java.lang.Exception: Could not materialize checkpoint
> 1640 for operator groupBy` and sometimes losing connections with task
> managers.
> Currently, I have given the heap size of 4096 MB.
> --
> Sent from: