Job is still in ACTIVE state after /jobs/:jobid/stop

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Job is still in ACTIVE state after /jobs/:jobid/stop

Averell
Hi,

I'm on 1.11.0, with a streaming job running on a YARN session, reading from
Kinesis.
I tried to stop the job using REST, with "drain=false". After that POST
request, I got back a request_id (not sure how should I use that for).

Checked the job in GUI, I could see that a savepoint has been completed
successfully (my job has next to zero states, so that was very quick). The
watermark stopped increasing, and no more checkpoints after that. However,
the job's status is still ACTIVE.

Querying job details via REST showed that the job is not stoppable (I guess
this is misleading information), and the /timestamps.RUNNING/ is not
increasing.

/  "name": "MyFlinkJob",
  "isStoppable": false,
  "state": "RUNNING",
  "start-time": 1604016319260,
  "end-time": -1,
  "duration": 1166337471,
  "now": 1605182656731,
  "timestamps": {
    "CANCELLING": 0,
    "FAILING": 0,
    "CANCELED": 0,
    "FINISHED": 0,
    "RUNNING": 1604016319495,
    "FAILED": 0,
    "RESTARTING": 0,
    "CREATED": 1604016319260,
    "RECONCILING": 0,
    "SUSPENDED": 0
  },
/

Is this a known bug? Or is it an expected behaviour?
Thanks and best regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Job is still in ACTIVE state after /jobs/:jobid/stop

Averell
I have some updates. Some weird behaviours were found. Please refer to the
attached photo.

All requests were sent via REST API

The status of the savepoint triggered by that stop request (ID 11018) is
"COMPLETED [Savepoint]", however, no checkpoint data has been persisted (in
S3).
The folder /`savepoint-5871af-c0f2d2334501/_metadata/`/ has been created in
S3, but no files in that.
This was the command I used to send the first stop request:
/curl -s -d '{"drain": false,
"targetDirectory":"*s3*://mybucket/savepoint"}' -H 'Content-Type:
application/json' -X POST
http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/stop/

Suspected that /s3:/// might be the issue, I tried to send another stop
request (ID 11020), mistakenly having the path as /s3*s*:///. So it failed.

Another stop request was sent (ID 11021). This one failed after timeout (10
minutes). The GUI says the checkpoint failed with /`Checkpoint expired
before completing`/.
/curl -s -d '{"drain": false,
"targetDirectory":"s3*a*://mybucket/savepoint"}' -H 'Content-Type:
application/json' -X POST
http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/stop/

I tried to send a create-savepoint request (ID 11023), and this time, it
completed successfully, with files persisted to S3. Checking Flink GUI I
could see that the job actually resumed before that savepoint request (with
the checkpoint ID 11021 created just 30 seconds after 11021 expired).
/curl -s -d '{"target-directory":"s3a://mybucket/savepoint", "cancel-job":
false}' -H 'Content-Type: application/json' -X POST
http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/savepoints
/

<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Screen_Shot_2020-11-13_at_11.png>



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Job is still in ACTIVE state after /jobs/:jobid/stop

Matthias
Hi Averell,
thanks for sharing this with the Flink community. Is there anything suspicious in the logs which you could share?

Best,
Matthias

On Fri, Nov 13, 2020 at 2:27 AM Averell <[hidden email]> wrote:
I have some updates. Some weird behaviours were found. Please refer to the
attached photo.

All requests were sent via REST API

The status of the savepoint triggered by that stop request (ID 11018) is
"COMPLETED [Savepoint]", however, no checkpoint data has been persisted (in
S3).
The folder /`savepoint-5871af-c0f2d2334501/_metadata/`/ has been created in
S3, but no files in that.
This was the command I used to send the first stop request:
/curl -s -d '{"drain": false,
"targetDirectory":"*s3*://mybucket/savepoint"}' -H 'Content-Type:
application/json' -X POST
http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/stop/

Suspected that /s3:/// might be the issue, I tried to send another stop
request (ID 11020), mistakenly having the path as /s3*s*:///. So it failed.

Another stop request was sent (ID 11021). This one failed after timeout (10
minutes). The GUI says the checkpoint failed with /`Checkpoint expired
before completing`/.
/curl -s -d '{"drain": false,
"targetDirectory":"s3*a*://mybucket/savepoint"}' -H 'Content-Type:
application/json' -X POST
http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/stop/

I tried to send a create-savepoint request (ID 11023), and this time, it
completed successfully, with files persisted to S3. Checking Flink GUI I
could see that the job actually resumed before that savepoint request (with
the checkpoint ID 11021 created just 30 seconds after 11021 expired).
/curl -s -d '{"target-directory":"s3a://mybucket/savepoint", "cancel-job":
false}' -H 'Content-Type: application/json' -X POST
http://myip:45507/jobs/5871af88ff279f30ebcc49ce741c2d75/savepoints
/

<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Screen_Shot_2020-11-13_at_11.png>



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/