Hi,
I'm on 1.11.0, with a streaming job running on a YARN session, reading from Kinesis. I tried to stop the job using REST, with "drain=false". After that POST request, I got back a request_id (not sure how should I use that for). Checked the job in GUI, I could see that a savepoint has been completed successfully (my job has next to zero states, so that was very quick). The watermark stopped increasing, and no more checkpoints after that. However, the job's status is still ACTIVE. Querying job details via REST showed that the job is not stoppable (I guess this is misleading information), and the /timestamps.RUNNING/ is not increasing. / "name": "MyFlinkJob", "isStoppable": false, "state": "RUNNING", "start-time": 1604016319260, "end-time": -1, "duration": 1166337471, "now": 1605182656731, "timestamps": { "CANCELLING": 0, "FAILING": 0, "CANCELED": 0, "FINISHED": 0, "RUNNING": 1604016319495, "FAILED": 0, "RESTARTING": 0, "CREATED": 1604016319260, "RECONCILING": 0, "SUSPENDED": 0 }, / Is this a known bug? Or is it an expected behaviour? Thanks and best regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
I have some updates. Some weird behaviours were found. Please refer to the
attached photo. All requests were sent via REST API The status of the savepoint triggered by that stop request (ID 11018) is "COMPLETED [Savepoint]", however, no checkpoint data has been persisted (in S3). The folder /`savepoint-5871af-c0f2d2334501/_metadata/`/ has been created in S3, but no files in that. This was the command I used to send the first stop request: /curl -s -d '{"drain": false, "targetDirectory":"*s3*://mybucket/savepoint"}' -H 'Content-Type: application/json' -X POST http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/stop/ Suspected that /s3:/// might be the issue, I tried to send another stop request (ID 11020), mistakenly having the path as /s3*s*:///. So it failed. Another stop request was sent (ID 11021). This one failed after timeout (10 minutes). The GUI says the checkpoint failed with /`Checkpoint expired before completing`/. /curl -s -d '{"drain": false, "targetDirectory":"s3*a*://mybucket/savepoint"}' -H 'Content-Type: application/json' -X POST http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/stop/ I tried to send a create-savepoint request (ID 11023), and this time, it completed successfully, with files persisted to S3. Checking Flink GUI I could see that the job actually resumed before that savepoint request (with the checkpoint ID 11021 created just 30 seconds after 11021 expired). /curl -s -d '{"target-directory":"s3a://mybucket/savepoint", "cancel-job": false}' -H 'Content-Type: application/json' -X POST http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/savepoints / <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Screen_Shot_2020-11-13_at_11.png> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Averell, thanks for sharing this with the Flink community. Is there anything suspicious in the logs which you could share? Best, Matthias On Fri, Nov 13, 2020 at 2:27 AM Averell <[hidden email]> wrote: I have some updates. Some weird behaviours were found. Please refer to the |
Free forum by Nabble | Edit this page |