(DEPRECATED) Apache Flink User Mailing List archive.

checkpoint failed due to s3 exception: request timeout

Classic

List

Threaded

8 messages Options

Tony Wei

checkpoint failed due to s3 exception: request timeout

Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:

flink version 1.4.0
standalone mode
4 slots for each TM
presto s3 filesystem
rocksdb statebackend
local ssd
enable incremental checkpoint

No weird message beside the exception in the log file. No high ratio of GC during the checkpoint

procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that

would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One

reply said you can passively avoid the problem by raising the max client retires config. So I found

that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config

it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,

Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885

[2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

checkpoint_failure.png (221K) Download Attachment

error.log (11K) Download Attachment

vino yang

Re: checkpoint failed due to s3 exception: request timeout

Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.

[1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375

[2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三上午11:36写道：

Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
flink version 1.4.0
standalone mode
4 slots for each TM
presto s3 filesystem
rocksdb statebackend
local ssd
enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

Tony Wei

Re: checkpoint failed due to s3 exception: request timeout

Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question

finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that

showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't

think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,

I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3

filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,

Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:

Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.

[1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
[2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三上午11:36写道：
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
flink version 1.4.0
standalone mode
4 slots for each TM
presto s3 filesystem
rocksdb statebackend
local ssd
enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

vino yang

Re: checkpoint failed due to s3 exception: request timeout

Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3.

Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三下午1:17写道：

Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.

[1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
[2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三上午11:36写道：
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
flink version 1.4.0
standalone mode
4 slots for each TM
presto s3 filesystem
rocksdb statebackend
local ssd
enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

Tony Wei

Re: checkpoint failed due to s3 exception: request timeout

Hi Vino,

I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.

So, I guessed I should find a way to pass this config to this library.

Best,

Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:

Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3.
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三下午1:17写道：
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.

[1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
[2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三上午11:36写道：
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
flink version 1.4.0
standalone mode
4 slots for each TM
presto s3 filesystem
rocksdb statebackend
local ssd
enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

vino yang

Re: checkpoint failed due to s3 exception: request timeout

Hi Tony,

Maybe you can consider looking at the doc information for this class, this class comes from flink-s3-fs-presto.[1]

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三下午2:18写道：

Hi Vino,

I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3.
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三下午1:17写道：
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.

[1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
[2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三上午11:36写道：
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
flink version 1.4.0
standalone mode
4 slots for each TM
presto s3 filesystem
rocksdb statebackend
local ssd
enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

Andrey Zagrebin

Re: checkpoint failed due to s3 exception: request timeout

Hi,

the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which has this option:

S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;

If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should automatically prefix it and configure PrestoS3FileSystem correctly.

Cheers,

Andrey

[1] https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java

[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended

On 29 Aug 2018, at 08:49, vino yang <[hidden email]> wrote:

Hi Tony,

Maybe you can consider looking at the doc information for this class, this class comes from flink-s3-fs-presto.[1]

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三下午2:18写道：
Hi Vino,

I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3.
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三下午1:17写道：
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.

[1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
[2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三上午11:36写道：
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
flink version 1.4.0
standalone mode
4 slots for each TM
presto s3 filesystem
rocksdb statebackend
local ssd
enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

Tony Wei

Re: checkpoint failed due to s3 exception: request timeout

Hi Andrey,

Cool! I will add it in my flink-conf.yaml. However, I'm still wondering if anyone is familiar with this

problem or has any idea to find the root cause. Thanks.

Best,

Tony Wei

2018-08-29 16:20 GMT+08:00 Andrey Zagrebin <[hidden email]>:

Hi,

the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which has this option:
S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;

If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should automatically prefix it and configure PrestoS3FileSystem correctly.

Cheers,
Andrey

[1] https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java
[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended

On 29 Aug 2018, at 08:49, vino yang <[hidden email]> wrote:

Hi Tony,

Maybe you can consider looking at the doc information for this class, this class comes from flink-s3-fs-presto.[1]

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三下午2:18写道：
Hi Vino,

I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3.
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三下午1:17写道：
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.

[1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
[2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三上午11:36写道：
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
flink version 1.4.0
standalone mode
4 slots for each TM
presto s3 filesystem
rocksdb statebackend
local ssd
enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218