checkpoint failed due to s3 exception: request timeout

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

checkpoint failed due to s3 exception: request timeout

Tony Wei
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
  • flink version 1.4.0
  • standalone mode
  • 4 slots for each TM
  • presto s3 filesystem
  • rocksdb statebackend
  • local ssd
  • enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei


checkpoint_failure.png (221K) Download Attachment
error.log (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failed due to s3 exception: request timeout

vino yang
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 上午11:36写道:
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
  • flink version 1.4.0
  • standalone mode
  • 4 slots for each TM
  • presto s3 filesystem
  • rocksdb statebackend
  • local ssd
  • enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei

Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failed due to s3 exception: request timeout

Tony Wei
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question 
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that 
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't 
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 上午11:36写道:
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
  • flink version 1.4.0
  • standalone mode
  • 4 slots for each TM
  • presto s3 filesystem
  • rocksdb statebackend
  • local ssd
  • enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei


Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failed due to s3 exception: request timeout

vino yang
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3. 
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 下午1:17写道:
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question 
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that 
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't 
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 上午11:36写道:
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
  • flink version 1.4.0
  • standalone mode
  • 4 slots for each TM
  • presto s3 filesystem
  • rocksdb statebackend
  • local ssd
  • enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei


Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failed due to s3 exception: request timeout

Tony Wei
Hi Vino,

I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3. 
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 下午1:17写道:
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question 
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that 
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't 
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 上午11:36写道:
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
  • flink version 1.4.0
  • standalone mode
  • 4 slots for each TM
  • presto s3 filesystem
  • rocksdb statebackend
  • local ssd
  • enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei



Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failed due to s3 exception: request timeout

vino yang
Hi Tony,

Maybe you can consider looking at the doc information for this class, this class comes from flink-s3-fs-presto.[1]


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 下午2:18写道:
Hi Vino,

I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3. 
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 下午1:17写道:
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question 
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that 
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't 
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 上午11:36写道:
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
  • flink version 1.4.0
  • standalone mode
  • 4 slots for each TM
  • presto s3 filesystem
  • rocksdb statebackend
  • local ssd
  • enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei



Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failed due to s3 exception: request timeout

Andrey Zagrebin
Hi,

the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which has this option:
S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;

If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should automatically prefix it and configure PrestoS3FileSystem correctly.

Cheers,
Andrey



On 29 Aug 2018, at 08:49, vino yang <[hidden email]> wrote:

Hi Tony,

Maybe you can consider looking at the doc information for this class, this class comes from flink-s3-fs-presto.[1]


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 下午2:18写道:
Hi Vino,

I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3. 
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 下午1:17写道:
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question 
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that 
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't 
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 上午11:36写道:
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
  • flink version 1.4.0
  • standalone mode
  • 4 slots for each TM
  • presto s3 filesystem
  • rocksdb statebackend
  • local ssd
  • enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei




Reply | Threaded
Open this post in threaded view
|

Re: checkpoint failed due to s3 exception: request timeout

Tony Wei
Hi Andrey,

Cool! I will add it in my flink-conf.yaml. However, I'm still wondering if anyone is familiar with this
problem or has any idea to find the root cause. Thanks.

Best,
Tony Wei

2018-08-29 16:20 GMT+08:00 Andrey Zagrebin <[hidden email]>:
Hi,

the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which has this option:
S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;

If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should automatically prefix it and configure PrestoS3FileSystem correctly.

Cheers,
Andrey



On 29 Aug 2018, at 08:49, vino yang <[hidden email]> wrote:

Hi Tony,

Maybe you can consider looking at the doc information for this class, this class comes from flink-s3-fs-presto.[1]


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 下午2:18写道:
Hi Vino,

I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3. 
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?

Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 下午1:17写道:
Hi Vino,

Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question 
finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that 
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's input arguments? I don't 
think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <[hidden email]>:
Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.


Thanks, vino.

Tony Wei <[hidden email]> 于2018年8月29日周三 上午11:36写道:
Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=

The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:
  • flink version 1.4.0
  • standalone mode
  • 4 slots for each TM
  • presto s3 filesystem
  • rocksdb statebackend
  • local ssd
  • enable incremental checkpoint
No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
reply said you can passively avoid the problem by raising the max client retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.

Best,
Tony Wei