(DEPRECATED) Apache Flink User Mailing List archive.

Batch jobs stalling after initial progress

Classic

List

Threaded

2 messages Options

Marko Mušnjak

Batch jobs stalling after initial progress

Hi,

I'm running flink batch jobs on EMR 5.21, and I'm seeing many (>50%) jobs stall and make no progress after some initial period. I've seen the behaviour earlier (5.17), but not nearly as much as now.

The job is a fairly simple enrichment job, loading an avro metadata file, creating several datasets from the file and broadcasting them. Later they are used in joins with the dataset of input events, also avro files. There are no shuffles or keyBy operations.

I see nothing in the logs at INFO level, and the UI for the stalled jobs shows the following:

* metadata loading tasks are finished.

* all other tasks are running, except the parquet output which is in state "created"

* the task earlier in the DAG from the parquet output task shows the back pressure status as "OK", the one earlier is shown with back pressure status "High"

Are there any specific logs I should enable to get more information on this? Has anyone else seen this behaviour?

Kind regards,

Marko

Ken Krugler

Re: Batch jobs stalling after initial progress

Hi Marko,

Some things that have caused my jobs to run very slowly (though not completely stall)

1. Cross-joins generating huge result sets.

2. Joins causing very large spills to disk.

3. Slow external API access

With streaming, iterations can cause stalls, but I don’t think that’s true for batch (haven’t tried, though)

— Ken

On Mar 13, 2019, at 6:54 AM, Marko Mušnjak <[hidden email]> wrote:

Hi,

I'm running flink batch jobs on EMR 5.21, and I'm seeing many (>50%) jobs stall and make no progress after some initial period. I've seen the behaviour earlier (5.17), but not nearly as much as now.

The job is a fairly simple enrichment job, loading an avro metadata file, creating several datasets from the file and broadcasting them. Later they are used in joins with the dataset of input events, also avro files. There are no shuffles or keyBy operations.

I see nothing in the logs at INFO level, and the UI for the stalled jobs shows the following:
* metadata loading tasks are finished.
* all other tasks are running, except the parquet output which is in state "created"
* the task earlier in the DAG from the parquet output task shows the back pressure status as "OK", the one earlier is shown with back pressure status "High"

Are there any specific logs I should enable to get more information on this? Has anyone else seen this behaviour?

Kind regards,
Marko

--------------------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra