StreamingFileSink to S3 failure to complete multipart upload

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

StreamingFileSink to S3 failure to complete multipart upload

Harrison Xu
Hello,
I'm seeing the following behavior in StreamingFileSink (1.9.1) uploading to S3.

2019-11-06 15:50:58,081 INFO  com.quora.dataInfra.s3connector.flink.filesystem.Buckets      - Subtask 1 checkpointing for checkpoint with id=5025 (max part counter=3406).
2019-11-06 15:50:58,448 INFO  org.apache.flink.streaming.api.operators.AbstractStreamOperator  - Could not complete snapshot 5025 for operator Source: kafka_source -> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18).
java.io.IOException: Uploading parts failed
at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartUploadToComplete(RecoverableMultiPartUploadImpl.java:231)
at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartsUpload(RecoverableMultiPartUploadImpl.java:215)
at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:151)
at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:56)
...12 more
Caused by: java.io.FileNotFoundException: upload part on tmp/kafka/meta/auction_ads/dt=2019-11-06T15/partition_7/part-1-3403: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchUpload; Request ID: 6D4B335FE7687B51; S3 Extended Request ID: OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=), S3 Extended Request ID: OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=:NoSuchUpload
... 10 more
...
2019-11-06 15:50:58,476 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task Source: kafka_source -> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18) (060d4deed87f3be96f3704474a5dc3e9).

Via the S3 console, the file in question (part-1-3403) does NOT exist, but its part file does:
_part-1-3402_tmp_38cbdecf-e5b5-4649-9754-bb7aa008f373
_part-1-3403_tmp_73e2a73b-0bac-46e8-8fdf-9455903d9da0
part-1-3395
part-1-3396
...
part-1-3401

The MPU lifecycling policy is configured to delete incomplete uploads in 3 days, which should not be affecting this.

Attempting to restore from the most recent checkpoint, 5025, results in similar issues for different topics. What I am seeing in S3 is essentially two incomplete part files, such as:
_part-4-3441_tmp_da13ceba-a284-4353-bdd6-ef4005d382fc
_part-4-3442_tmp_fe0c0e00-c7f7-462f-a99f-464b2851a4cb
And the checkpoint restore operation fails with:
upload part on tmp/kafka/meta/feed_features/dt=2019-11-06T15/partition_0/part-4-3441: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist.
(It does indeed not exist in S3).

Any ideas? 
As it stands, this job is basically unrecoverable right now because of this error.
Thank you




Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink to S3 failure to complete multipart upload

Harrison Xu
To add to this, attempting to restore from the most recent manually triggered savepoint results in a similar, yet slightly different error:

java.io.FileNotFoundException: upload part on tmp/kafka/meta/ads_action_log_kafka_uncounted/dt=2019-11-06T00/partition_6/part-4-2158: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchUpload

Looking into the S3, I see that two files with the same part number exist. 
_part-4-2158_tmp_03c7ebaa-a9e5-455a-b501-731badc36765
part-4-2158

And again, I cannot recover the job from this prior state.
Thanks so much for any input - would love to understand what is going on. Happy to provide full logs if needed.


On Wed, Nov 6, 2019 at 11:52 AM Harrison Xu <[hidden email]> wrote:
Hello,
I'm seeing the following behavior in StreamingFileSink (1.9.1) uploading to S3.

2019-11-06 15:50:58,081 INFO  com.quora.dataInfra.s3connector.flink.filesystem.Buckets      - Subtask 1 checkpointing for checkpoint with id=5025 (max part counter=3406).
2019-11-06 15:50:58,448 INFO  org.apache.flink.streaming.api.operators.AbstractStreamOperator  - Could not complete snapshot 5025 for operator Source: kafka_source -> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18).
java.io.IOException: Uploading parts failed
at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartUploadToComplete(RecoverableMultiPartUploadImpl.java:231)
at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartsUpload(RecoverableMultiPartUploadImpl.java:215)
at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:151)
at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:56)
...12 more
Caused by: java.io.FileNotFoundException: upload part on tmp/kafka/meta/auction_ads/dt=2019-11-06T15/partition_7/part-1-3403: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchUpload; Request ID: 6D4B335FE7687B51; S3 Extended Request ID: OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=), S3 Extended Request ID: OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=:NoSuchUpload
... 10 more
...
2019-11-06 15:50:58,476 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task Source: kafka_source -> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18) (060d4deed87f3be96f3704474a5dc3e9).

Via the S3 console, the file in question (part-1-3403) does NOT exist, but its part file does:
_part-1-3402_tmp_38cbdecf-e5b5-4649-9754-bb7aa008f373
_part-1-3403_tmp_73e2a73b-0bac-46e8-8fdf-9455903d9da0
part-1-3395
part-1-3396
...
part-1-3401

The MPU lifecycling policy is configured to delete incomplete uploads in 3 days, which should not be affecting this.

Attempting to restore from the most recent checkpoint, 5025, results in similar issues for different topics. What I am seeing in S3 is essentially two incomplete part files, such as:
_part-4-3441_tmp_da13ceba-a284-4353-bdd6-ef4005d382fc
_part-4-3442_tmp_fe0c0e00-c7f7-462f-a99f-464b2851a4cb
And the checkpoint restore operation fails with:
upload part on tmp/kafka/meta/feed_features/dt=2019-11-06T15/partition_0/part-4-3441: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist.
(It does indeed not exist in S3).

Any ideas? 
As it stands, this job is basically unrecoverable right now because of this error.
Thank you