Hello,
I'm seeing the following behavior in StreamingFileSink (1.9.1) uploading to S3. 2019-11-06 15:50:58,081 INFO com.quora.dataInfra.s3connector.flink.filesystem.Buckets - Subtask 1 checkpointing for checkpoint with id=5025 (max part counter=3406). 2019-11-06 15:50:58,448 INFO org.apache.flink.streaming.api.operators.AbstractStreamOperator - Could not complete snapshot 5025 for operator Source: kafka_source -> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18). java.io.IOException: Uploading parts failed at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartUploadToComplete(RecoverableMultiPartUploadImpl.java:231) at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartsUpload(RecoverableMultiPartUploadImpl.java:215) at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:151) at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:56) ...12 more Caused by: java.io.FileNotFoundException: upload part on tmp/kafka/meta/auction_ads/dt=2019-11-06T15/partition_7/part-1-3403: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchUpload; Request ID: 6D4B335FE7687B51; S3 Extended Request ID: OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=), S3 Extended Request ID: OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=:NoSuchUpload ... 10 more ... 2019-11-06 15:50:58,476 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task Source: kafka_source -> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18) (060d4deed87f3be96f3704474a5dc3e9). Via the S3 console, the file in question (part-1-3403) does NOT exist, but its part file does: _part-1-3402_tmp_38cbdecf-e5b5-4649-9754-bb7aa008f373 _part-1-3403_tmp_73e2a73b-0bac-46e8-8fdf-9455903d9da0 part-1-3395 part-1-3396 ... part-1-3401 The MPU lifecycling policy is configured to delete incomplete uploads in 3 days, which should not be affecting this. Attempting to restore from the most recent checkpoint, 5025, results in similar issues for different topics. What I am seeing in S3 is essentially two incomplete part files, such as: _part-4-3441_tmp_da13ceba-a284-4353-bdd6-ef4005d382fc _part-4-3442_tmp_fe0c0e00-c7f7-462f-a99f-464b2851a4cb And the checkpoint restore operation fails with: upload part on tmp/kafka/meta/feed_features/dt=2019-11-06T15/partition_0/part-4-3441: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. (It does indeed not exist in S3). Any ideas? As it stands, this job is basically unrecoverable right now because of this error. Thank you |
To add to this, attempting to restore from the most recent manually triggered savepoint results in a similar, yet slightly different error: java.io.FileNotFoundException: upload part on tmp/kafka/meta/ads_action_log_kafka_uncounted/dt=2019-11-06T00/partition_6/part-4-2158: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchUpload Looking into the S3, I see that two files with the same part number exist. _part-4-2158_tmp_03c7ebaa-a9e5-455a-b501-731badc36765 part-4-2158 And again, I cannot recover the job from this prior state. Thanks so much for any input - would love to understand what is going on. Happy to provide full logs if needed. On Wed, Nov 6, 2019 at 11:52 AM Harrison Xu <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |