Hi, I met checkpoint failure problem that cause by s3 exception. org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw= The full stack trace and screenshot is provided in the attachment. My setting for flink cluster and job:
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that would related to this failure. Did anyone meet this problem before? Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One reply said you can passively avoid the problem by raising the max client retires config. So I found that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance. Best, Tony Wei |
Hi Tony, A while ago, I have answered a similar question.[1] You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify. Thanks, vino. Tony Wei <[hidden email]> 于2018年8月29日周三 上午11:36写道:
|
Hi Vino, Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that showed the checkpoint failed in a short time. According to configuration, do you mean pass the configuration as program's input arguments? I don't think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However, I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3 filesystem and I thought it might have a simple way to support this setting like other s3.xxx config. Very much appreciate for your answer and help. Best, Tony Wei 2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
|
Hi Tony, Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3. Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client? Thanks, vino. Tony Wei <[hidden email]> 于2018年8月29日周三 下午1:17写道:
|
Hi Vino, I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto. So, I guessed I should find a way to pass this config to this library. Best, Tony Wei 2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:
|
Hi Tony, Maybe you can consider looking at the doc information for this class, this class comes from flink-s3-fs-presto.[1] Thanks, vino. Tony Wei <[hidden email]> 于2018年8月29日周三 下午2:18写道:
|
Hi,
the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which has this option: S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”; If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should automatically prefix it and configure PrestoS3FileSystem correctly. Cheers, Andrey
|
Hi Andrey, Cool! I will add it in my flink-conf.yaml. However, I'm still wondering if anyone is familiar with this problem or has any idea to find the root cause. Thanks. Best, Tony Wei 2018-08-29 16:20 GMT+08:00 Andrey Zagrebin <[hidden email]>:
|
Free forum by Nabble | Edit this page |