Batch jobs stalling after initial progress

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Batch jobs stalling after initial progress

Marko Mušnjak
Hi,

I'm running flink batch jobs on EMR 5.21, and I'm seeing many (>50%) jobs stall and make no progress after some initial period. I've seen the behaviour earlier (5.17), but not nearly as much as now.

The job is a fairly simple enrichment job, loading an avro metadata file, creating several datasets from the file and broadcasting them. Later they are used in joins with the dataset of input events, also avro files. There are no shuffles or keyBy operations. 

I see nothing in the logs at INFO level, and the UI for the stalled jobs shows the following:
* metadata loading tasks are finished.
* all other tasks are running, except the parquet output which is in state "created"
* the task earlier in the DAG from the parquet output task shows the back pressure status as "OK", the one earlier is shown with back pressure status "High"

Are there any specific logs I should enable to get more information on this? Has anyone else seen this behaviour?

Kind regards,
Marko 
Reply | Threaded
Open this post in threaded view
|

Re: Batch jobs stalling after initial progress

Ken Krugler
Hi Marko,

Some things that have caused my jobs to run very slowly (though not completely stall)

1. Cross-joins generating huge result sets.

2. Joins causing very large spills to disk. 

3. Slow external API access

With streaming, iterations can cause stalls, but I don’t think that’s true for batch (haven’t tried, though)

— Ken


On Mar 13, 2019, at 6:54 AM, Marko Mušnjak <[hidden email]> wrote:

Hi,

I'm running flink batch jobs on EMR 5.21, and I'm seeing many (>50%) jobs stall and make no progress after some initial period. I've seen the behaviour earlier (5.17), but not nearly as much as now.

The job is a fairly simple enrichment job, loading an avro metadata file, creating several datasets from the file and broadcasting them. Later they are used in joins with the dataset of input events, also avro files. There are no shuffles or keyBy operations. 

I see nothing in the logs at INFO level, and the UI for the stalled jobs shows the following:
* metadata loading tasks are finished.
* all other tasks are running, except the parquet output which is in state "created"
* the task earlier in the DAG from the parquet output task shows the back pressure status as "OK", the one earlier is shown with back pressure status "High"

Are there any specific logs I should enable to get more information on this? Has anyone else seen this behaviour?

Kind regards,
Marko 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra