(DEPRECATED) Apache Flink User Mailing List archive.

Checkpointing to gcs taking too long

Classic

List

Threaded

2 messages Options

prakhar_mathur

Checkpointing to gcs taking too long

I am trying to run flink on kubernetes, and trying to push checkpoints to
Google Cloud Storage. Below is the docker file

`FROM flink:1.6.2-hadoop28-scala_2.11-alpine

RUN wget -O lib/gcs-connector-latest-hadoop2.jar
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar

RUN wget -O lib/gcs-connector-latest-hadoop2.jar
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
&& \
wget
http://ftp.fau.de/apache/flink/flink-1.6.2/flink-1.6.2-bin-hadoop28-scala_2.11.tgz
&& \
tar xf flink-1.6.2-bin-hadoop28-scala_2.11.tgz && \
mv flink-1.6.2/lib/flink-shaded-hadoop2* lib/ && \
rm -r flink-1.6.2*`

But the checkpoints are taking around 2-3 seconds on average and around 25
seconds at max, even the state size is around 100 KB.

Even the jobs are getting restarted with the error
`AsynchronousException{java.lang.Exception: Could not materialize checkpoint
1640 for operator groupBy` and sometimes losing connections with task
managers.

Currently, I have given the heap size of 4096 MB.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Chesnay Schepler

Re: Checkpointing to gcs taking too long

Please provide the full Exception stack trace and the configuration of
your job (parallelism, number of stateful operators).
Have you tried using the gcs-connector in isolation? This may not be an
issue with Flink.

On 28.11.2018 07:01, prakhar_mathur wrote:

> I am trying to run flink on kubernetes, and trying to push checkpoints to
> Google Cloud Storage. Below is the docker file
>
> `FROM flink:1.6.2-hadoop28-scala_2.11-alpine
>
> RUN wget -O lib/gcs-connector-latest-hadoop2.jar
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>
> RUN wget -O lib/gcs-connector-latest-hadoop2.jar
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
> && \
> wget
> http://ftp.fau.de/apache/flink/flink-1.6.2/flink-1.6.2-bin-hadoop28-scala_2.11.tgz
> && \
> tar xf flink-1.6.2-bin-hadoop28-scala_2.11.tgz && \
> mv flink-1.6.2/lib/flink-shaded-hadoop2* lib/ && \
> rm -r flink-1.6.2*`
>
> But the checkpoints are taking around 2-3 seconds on average and around 25
> seconds at max, even the state size is around 100 KB.
>
> Even the jobs are getting restarted with the error
> `AsynchronousException{java.lang.Exception: Could not materialize checkpoint
> 1640 for operator groupBy` and sometimes losing connections with task
> managers.
>
> Currently, I have given the heap size of 4096 MB.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>