Stream job getting Failed

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Stream job getting Failed

anuj.aj07
I have a Flink stream job that reads data from Kafka and writes it to S3. This job keeps failing after running for 2-3 days.
I am not able to find anything in logs why it's failing. Can somebody help me how to find out the cause of failure?

I can only see this in logs :

 org.apache.flink.streaming.api.functions.sink.filesystem.Buckets [] - Subtask 7 received completion notification for checkpoint with id=608.
2020-12-09 16:41:56,110 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                 [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2020-12-09 16:41:56,111 INFO  org.apache.flink.runtime.blob.TransientBlobCache             [] - Shutting down BLOB cache
2020-12-09 16:41:56,111 INFO  org.apache.flink.runtime.blob.PermanentBlobCache             [] - Shutting down BLOB cache
2020-12-09 16:41:56,111 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-fd5d7eae-bff7-4d74-89d8-0a40f174b7b8
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-c5833412-5944-4b41-a502-5d952f5156af
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-e290b3fd-9110-47c4-9463-1bd08003afc9
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-bf0d69fe-0f00-4483-8b20-0056a049f86b
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-55b8467d-8c16-441a-83d6-393462a0b4ca
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-8bc77a7c-f62b-4f06-b963-41f174a0db8e
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-bf57f8db-0152-4697-b743-d07b4e46c9d7
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-d650ed68-9c44-45b9-9b41-d501152b3f0f
2020-12-09 16:41:56,120 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-9311e006-fee0-4317-9355-5d981c558a08
2020-12-09 16:41:56,120 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-c633cf9f-8220-433a-8f3e-04d45e81efde
2020-12-09 16:41:56,120 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-671eda78-9981-4f6d-bff4-25cca973d76d
2020-12-09 16:41:56,120 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-5efa7701-72da-4d91-b9f7-7e6963ffefdb

End of LogType:taskmanager.log
********************************************************************************


End of LogType:taskmanager.out
********************************************************************************


--
Thanks & Regards,
Anuj Jain
Mob. : +91- 8588817877
Skype : anuj.jain07



Reply | Threaded
Open this post in threaded view
|

Re: Stream job getting Failed

Arvid Heise-3
Hi Anuj,

SIGTERM with SIGNAL 15 means that it was killed by an external process. Look into the Yarn logs to look for a specific error.

Usually, yarn kills a container with exit code 143 when it goes over memory boundaries. This is something the community constantly improves, but may still happen because of the various types of memory that is allocated (in particular native memory). Please recheck [1], how you can increase some safety margins.


On Wed, Dec 9, 2020 at 6:25 PM aj <[hidden email]> wrote:
I have a Flink stream job that reads data from Kafka and writes it to S3. This job keeps failing after running for 2-3 days.
I am not able to find anything in logs why it's failing. Can somebody help me how to find out the cause of failure?

I can only see this in logs :

 org.apache.flink.streaming.api.functions.sink.filesystem.Buckets [] - Subtask 7 received completion notification for checkpoint with id=608.
2020-12-09 16:41:56,110 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                 [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2020-12-09 16:41:56,111 INFO  org.apache.flink.runtime.blob.TransientBlobCache             [] - Shutting down BLOB cache
2020-12-09 16:41:56,111 INFO  org.apache.flink.runtime.blob.PermanentBlobCache             [] - Shutting down BLOB cache
2020-12-09 16:41:56,111 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-fd5d7eae-bff7-4d74-89d8-0a40f174b7b8
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-c5833412-5944-4b41-a502-5d952f5156af
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-e290b3fd-9110-47c4-9463-1bd08003afc9
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-bf0d69fe-0f00-4483-8b20-0056a049f86b
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-55b8467d-8c16-441a-83d6-393462a0b4ca
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.filecache.FileCache                 [] - removed file cache directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-8bc77a7c-f62b-4f06-b963-41f174a0db8e
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-bf57f8db-0152-4697-b743-d07b4e46c9d7
2020-12-09 16:41:56,115 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-d650ed68-9c44-45b9-9b41-d501152b3f0f
2020-12-09 16:41:56,120 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-9311e006-fee0-4317-9355-5d981c558a08
2020-12-09 16:41:56,120 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-c633cf9f-8220-433a-8f3e-04d45e81efde
2020-12-09 16:41:56,120 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-671eda78-9981-4f6d-bff4-25cca973d76d
2020-12-09 16:41:56,120 INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - FileChannelManager removed spill file directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-5efa7701-72da-4d91-b9f7-7e6963ffefdb

End of LogType:taskmanager.log
********************************************************************************


End of LogType:taskmanager.out
********************************************************************************


--
Thanks & Regards,
Anuj Jain
Mob. : +91- 8588817877
Skype : anuj.jain07





--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng