Sink not switched to "RUNUNG" even though a task slot is available

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Sink not switched to "RUNUNG" even though a task slot is available

Yassine MARZOUGUI
Hi all,

My batch job has the follwoing plan in the end (figure attached):



I have a total of 32 task slots, and I have set the parallelism of the last two operators before the sink to 31. The sink parallelism is 1. The last operator before the sink is a MapOperator, so it doesn't need to buffer all elements before emitting the output.

Looking at the dashboard, I see that one task slot is available, but the sink subtask stays in the state "CREATED".

Any explanation for this behaviour? Thank you.

Best,
Yassine
Reply | Threaded
Open this post in threaded view
|

Re: Sink not switched to "RUNUNG" even though a task slot is available

rmetzger0
Hi Yassine,

you don't necessarily need to set the parallelism of the last two operators of 31, the sink with parallelism 1 will fit still into the slots.
A task slot can, by default, hold an entire "slice" or parallel instance of a job.

The reason why the sink stays in state CREATE in the beginning is because it didn't receive any data yet. In batch mode, an operator is switching to RUNNING once it received the first record. In your case, there are some operations (reduce, join) before the sink that first need to consume all data and process it.
If you wait long enough, you should see the sink to become active.

Regards,
Robert



On Wed, Nov 23, 2016 at 2:03 PM, Yassine MARZOUGUI <[hidden email]> wrote:
Hi all,

My batch job has the follwoing plan in the end (figure attached):



I have a total of 32 task slots, and I have set the parallelism of the last two operators before the sink to 31. The sink parallelism is 1. The last operator before the sink is a MapOperator, so it doesn't need to buffer all elements before emitting the output.

Looking at the dashboard, I see that one task slot is available, but the sink subtask stays in the state "CREATED".

Any explanation for this behaviour? Thank you.

Best,
Yassine

Reply | Threaded
Open this post in threaded view
|

Re: Sink not switched to "RUNUNG" even though a task slot is available

Yassine MARZOUGUI

Hi Robert,

I see, so the join needs to consume all data first and process it.
In my case, I couldn't wait long because the first join quickly generated a lot of data that can't fit in the memory or in the disk. The solution was then to manually specify a JoinHint and broadcast the small dataset, that way I was able to get an output of the join as soon as data is received.

Thanks a lot for the clarification!

Best,
Yassine


2016-11-23 17:35 GMT+01:00 Robert Metzger <[hidden email]>:
Hi Yassine,

you don't necessarily need to set the parallelism of the last two operators of 31, the sink with parallelism 1 will fit still into the slots.
A task slot can, by default, hold an entire "slice" or parallel instance of a job.

The reason why the sink stays in state CREATE in the beginning is because it didn't receive any data yet. In batch mode, an operator is switching to RUNNING once it received the first record. In your case, there are some operations (reduce, join) before the sink that first need to consume all data and process it.
If you wait long enough, you should see the sink to become active.

Regards,
Robert



On Wed, Nov 23, 2016 at 2:03 PM, Yassine MARZOUGUI <[hidden email]> wrote:
Hi all,

My batch job has the follwoing plan in the end (figure attached):



I have a total of 32 task slots, and I have set the parallelism of the last two operators before the sink to 31. The sink parallelism is 1. The last operator before the sink is a MapOperator, so it doesn't need to buffer all elements before emitting the output.

Looking at the dashboard, I see that one task slot is available, but the sink subtask stays in the state "CREATED".

Any explanation for this behaviour? Thank you.

Best,
Yassine