Flink batch app occasionally hang

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink batch app occasionally hang

Caio Aoque
Hi, I've been running some flink scala applications on an AWS EMR cluster (version 5.26.0 with flink 1.8.0 for scala 2.11) for a while and I started to have some issues now.

I have a flink app that reads some files from S3, process them and save some files to s3 and also some records to a database.

The application is not so complex it has a source that reads a directory (multiple files) and other one that reads a single one and then it has some grouping and mapping and a left outer join between these 2 sources.

The issue is that occasionally the application got stuck with only two tasks running, one finished and the other ones not even run. The 2 tasks that keep running forever are the source1 from directory (multiple files) and the leftouterjoin, the source2 (input from a single file) is the one that finishes. One interest thing is that there should be several tasks between source 1 and this leftouterjoin but they remain in CREATED state. If the app stuck usually I simply kill that and run that again, which works. The issue is not that frequent but is getting more and more frequent. It's happening almost everyday now.

I also have a DEBUG log from a job that didn't work and another one from a job that worked.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Flink batch app occasionally hang

vino yang
Hi Caio,

Because it involves interaction with external systems. It would be better if you can provide the full logs.

Best,
Vino

Caio Aoque <[hidden email]> 于2019年10月30日周三 上午8:31写道:
Hi, I've been running some flink scala applications on an AWS EMR cluster (version 5.26.0 with flink 1.8.0 for scala 2.11) for a while and I started to have some issues now.

I have a flink app that reads some files from S3, process them and save some files to s3 and also some records to a database.

The application is not so complex it has a source that reads a directory (multiple files) and other one that reads a single one and then it has some grouping and mapping and a left outer join between these 2 sources.

The issue is that occasionally the application got stuck with only two tasks running, one finished and the other ones not even run. The 2 tasks that keep running forever are the source1 from directory (multiple files) and the leftouterjoin, the source2 (input from a single file) is the one that finishes. One interest thing is that there should be several tasks between source 1 and this leftouterjoin but they remain in CREATED state. If the app stuck usually I simply kill that and run that again, which works. The issue is not that frequent but is getting more and more frequent. It's happening almost everyday now.

I also have a DEBUG log from a job that didn't work and another one from a job that worked.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Flink batch app occasionally hang

Zhu Zhu
In reply to this post by Caio Aoque
Hi Caio,

Did you check whether there are enough resources to launch the other nodes?

Could you attach the logs you mentioned? And elaborate how the tasks are connected in the topology?


Thanks,
Zhu Zhu

Caio Aoque <[hidden email]> 于2019年10月30日周三 上午8:31写道:
Hi, I've been running some flink scala applications on an AWS EMR cluster (version 5.26.0 with flink 1.8.0 for scala 2.11) for a while and I started to have some issues now.

I have a flink app that reads some files from S3, process them and save some files to s3 and also some records to a database.

The application is not so complex it has a source that reads a directory (multiple files) and other one that reads a single one and then it has some grouping and mapping and a left outer join between these 2 sources.

The issue is that occasionally the application got stuck with only two tasks running, one finished and the other ones not even run. The 2 tasks that keep running forever are the source1 from directory (multiple files) and the leftouterjoin, the source2 (input from a single file) is the one that finishes. One interest thing is that there should be several tasks between source 1 and this leftouterjoin but they remain in CREATED state. If the app stuck usually I simply kill that and run that again, which works. The issue is not that frequent but is getting more and more frequent. It's happening almost everyday now.

I also have a DEBUG log from a job that didn't work and another one from a job that worked.

Thanks.