Reducing parallelism leads to NoResourceAvailableException

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Reducing parallelism leads to NoResourceAvailableException

Ken Krugler
Hi all,

In trying out different settings for performance, I run into a job failure case that puzzles me.

I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran successfully, on a cluster with 40 slots.

I then tried with -p 15, and it failed with:

NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism…

But the change was to reduce parallelism - why would that now cause this problem?

Thanks,

— Ken


--------------------------
Ken Krugler
+1 530-210-6378
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Reply | Threaded
Open this post in threaded view
|

Re: Reducing parallelism leads to NoResourceAvailableException

Aljoscha Krettek
Hi,
is this a streaming or batch job? If it is a batch job, are you using either collect() or print() on a DataSet?

Cheers,
Aljoscha

On Thu, 28 Apr 2016 at 00:52 Ken Krugler <[hidden email]> wrote:
Hi all,

In trying out different settings for performance, I run into a job failure case that puzzles me.

I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran successfully, on a cluster with 40 slots.

I then tried with -p 15, and it failed with:

NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism…

But the change was to reduce parallelism - why would that now cause this problem?

Thanks,

— Ken


--------------------------
Ken Krugler
+1 530-210-6378
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Reply | Threaded
Open this post in threaded view
|

Re: Reducing parallelism leads to NoResourceAvailableException

Ufuk Celebi
In reply to this post by Ken Krugler
Hey Ken!

That should not happen. Can you check the web interface for two things:

- How many available slots are advertized on the landing page
(localhost:8081) when you submit your job?
- Can you check the actual parallelism of the submitted job (it should
appear as a FAILED job in the web frontend). Is it really 15?

– Ufuk

On Thu, Apr 28, 2016 at 12:52 AM, Ken Krugler
<[hidden email]> wrote:

> Hi all,
>
> In trying out different settings for performance, I run into a job failure
> case that puzzles me.
>
> I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran
> successfully, on a cluster with 40 slots.
>
> I then tried with -p 15, and it failed with:
>
> NoResourceAvailableException: Not enough free slots available to run the
> job. You can decrease the operator parallelism…
>
> But the change was to reduce parallelism - why would that now cause this
> problem?
>
> Thanks,
>
> — Ken
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reducing parallelism leads to NoResourceAvailableException

Ken Krugler
In reply to this post by Aljoscha Krettek

On Apr 28, 2016, at 1:32am, Aljoscha Krettek <[hidden email]> wrote:

Hi,
is this a streaming or batch job?

Batch.

If it is a batch job, are you using either collect() or print() on a DataSet?

Definitely not a print(). Don’t know about collect(), since the job is created via the Cascading-Flink planner. Fabian would know best.

— Ken


Cheers,
Aljoscha

On Thu, 28 Apr 2016 at 00:52 Ken Krugler <[hidden email]> wrote:
Hi all,

In trying out different settings for performance, I run into a job failure case that puzzles me.

I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran successfully, on a cluster with 40 slots.

I then tried with -p 15, and it failed with:

NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism…

But the change was to reduce parallelism - why would that now cause this problem?

Thanks,

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Reply | Threaded
Open this post in threaded view
|

Re: Reducing parallelism leads to NoResourceAvailableException

Ken Krugler
In reply to this post by Ufuk Celebi
Hi Ufuk,

On Apr 28, 2016, at 1:32am, Ufuk Celebi <[hidden email]> wrote:

Hey Ken!

That should not happen. Can you check the web interface for two things:

- How many available slots are advertized on the landing page
(localhost:8081) when you submit your job?

I’m running this on YARN, so I don’t believe the web UI shows up until the Flink AppManager has been started, which means I don’t know the advertised number of available slots before the job is running.

- Can you check the actual parallelism of the submitted job (it should
appear as a FAILED job in the web frontend). Is it really 15?

Same as above, the Flink web UI is gone once the job has failed.

Any suggestions for how to check the actual parallelism in this type of transient YARN environment?

Thanks,

— Ken


On Thu, Apr 28, 2016 at 12:52 AM, Ken Krugler
<[hidden email]> wrote:
Hi all,

In trying out different settings for performance, I run into a job failure
case that puzzles me.

I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran
successfully, on a cluster with 40 slots.

I then tried with -p 15, and it failed with:

NoResourceAvailableException: Not enough free slots available to run the
job. You can decrease the operator parallelism…

But the change was to reduce parallelism - why would that now cause this
problem?

Thanks,

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr