Hi,
I'm running flink batch jobs on EMR 5.21, and I'm seeing many (>50%) jobs stall and make no progress after some initial period. I've seen the behaviour earlier (5.17), but not nearly as much as now.
The job is a fairly simple enrichment job, loading an avro metadata file, creating several datasets from the file and broadcasting them. Later they are used in joins with the dataset of input events, also avro files. There are no shuffles or keyBy operations.
I see nothing in the logs at INFO level, and the UI for the stalled jobs shows the following:
* metadata loading tasks are finished.
* all other tasks are running, except the parquet output which is in state "created"
* the task earlier in the DAG from the parquet output task shows the back pressure status as "OK", the one earlier is shown with back pressure status "High"
Are there any specific logs I should enable to get more information on this? Has anyone else seen this behaviour?
Kind regards,
Marko