Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Cliff Resnick
I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources. 

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?





Reply | Threaded
Open this post in threaded view
|

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Till Rohrmann
Hi Cliff,

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

Cheers,
Till

On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <[hidden email]> wrote:
I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources. 

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?





Reply | Threaded
Open this post in threaded view
|

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Cliff Resnick
Hi Till,

Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG level. I saw several errors in 1.6.2, hope it's informative!

Cliff

On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <[hidden email]> wrote:
Hi Cliff,

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

Cheers,
Till

On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <[hidden email]> wrote:
I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources. 

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?






logs.tar.gz (317K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Till Rohrmann
Hi Cliff,

the TaskManger fail to start with exit code 31 which indicates an initialization error on startup. If you check the TaskManager logs via `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs don't start up.

Cheers,
Till

On Fri, Nov 9, 2018 at 8:32 PM Cliff Resnick <[hidden email]> wrote:
Hi Till,

Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG level. I saw several errors in 1.6.2, hope it's informative!

Cliff

On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <[hidden email]> wrote:
Hi Cliff,

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

Cheers,
Till

On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <[hidden email]> wrote:
I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources. 

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?





Reply | Threaded
Open this post in threaded view
|

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Cliff Resnick
Hi Till,

Yes, it turns out the problem was having flink-queryable-state-runtime_2.11-1.6.2.jar in flink/lib. I guess Queriable State bootstraps itself and, in my situation, it brought the task manager down when it found no available ports. What's a little troubling is that I had not configured Queriable State at all, so I would not expect it to get in the way. I haven't looked further into it but I think that if Queriable State wants to enable itself then it should at worst take an unused port by default, especially since many folks will be running in shared environments like YARN.

But anyway, thanks for that! I'm now up with 1.6.2.

Cliff

On Mon, Nov 12, 2018 at 6:04 AM Till Rohrmann <[hidden email]> wrote:
Hi Cliff,

the TaskManger fail to start with exit code 31 which indicates an initialization error on startup. If you check the TaskManager logs via `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs don't start up.

Cheers,
Till

On Fri, Nov 9, 2018 at 8:32 PM Cliff Resnick <[hidden email]> wrote:
Hi Till,

Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG level. I saw several errors in 1.6.2, hope it's informative!

Cliff

On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <[hidden email]> wrote:
Hi Cliff,

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

Cheers,
Till

On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <[hidden email]> wrote:
I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources. 

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?





Reply | Threaded
Open this post in threaded view
|

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Till Rohrmann
Good to hear Cliff. 

You're right that it's not a nice user experience. The problem with queryable state is that one would need to take a look at the actual user job to decide whether the user uses queryable state or not. But then it's already too late for starting the respective infrastructure needed for querying the state. You're right, though, that we should at least take a random port per default. I've created a corresponding issue for this: https://issues.apache.org/jira/browse/FLINK-10866.

Cheers,
Till

On Mon, Nov 12, 2018 at 11:16 PM Cliff Resnick <[hidden email]> wrote:
Hi Till,

Yes, it turns out the problem was having flink-queryable-state-runtime_2.11-1.6.2.jar in flink/lib. I guess Queriable State bootstraps itself and, in my situation, it brought the task manager down when it found no available ports. What's a little troubling is that I had not configured Queriable State at all, so I would not expect it to get in the way. I haven't looked further into it but I think that if Queriable State wants to enable itself then it should at worst take an unused port by default, especially since many folks will be running in shared environments like YARN.

But anyway, thanks for that! I'm now up with 1.6.2.

Cliff

On Mon, Nov 12, 2018 at 6:04 AM Till Rohrmann <[hidden email]> wrote:
Hi Cliff,

the TaskManger fail to start with exit code 31 which indicates an initialization error on startup. If you check the TaskManager logs via `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs don't start up.

Cheers,
Till

On Fri, Nov 9, 2018 at 8:32 PM Cliff Resnick <[hidden email]> wrote:
Hi Till,

Here are Job Manager logs, same job in both 1.6.0 and 1.6.2 at DEBUG level. I saw several errors in 1.6.2, hope it's informative!

Cliff

On Fri, Nov 9, 2018 at 8:34 AM Till Rohrmann <[hidden email]> wrote:
Hi Cliff,

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

Cheers,
Till

On Thu, Nov 8, 2018 at 9:59 PM Cliff Resnick <[hidden email]> wrote:
I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources. 

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?