I have a Flink stream job that reads data from Kafka and writes it to S3. This job keeps failing after running for 2-3 days.
I am not able to find anything in logs why it's failing. Can somebody help me how to find out the cause of failure? I can only see this in logs : org.apache.flink.streaming.api.functions.sink.filesystem.Buckets [] - Subtask 7 received completion notification for checkpoint with id=608. 2020-12-09 16:41:56,110 INFO org.apache.flink.yarn.YarnTaskExecutorRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 2020-12-09 16:41:56,111 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache 2020-12-09 16:41:56,111 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache 2020-12-09 16:41:56,111 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager. 2020-12-09 16:41:56,115 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-fd5d7eae-bff7-4d74-89d8-0a40f174b7b8 2020-12-09 16:41:56,115 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-c5833412-5944-4b41-a502-5d952f5156af 2020-12-09 16:41:56,115 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-e290b3fd-9110-47c4-9463-1bd08003afc9 2020-12-09 16:41:56,115 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-bf0d69fe-0f00-4483-8b20-0056a049f86b 2020-12-09 16:41:56,115 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-55b8467d-8c16-441a-83d6-393462a0b4ca 2020-12-09 16:41:56,115 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-dist-cache-8bc77a7c-f62b-4f06-b963-41f174a0db8e 2020-12-09 16:41:56,115 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-bf57f8db-0152-4697-b743-d07b4e46c9d7 2020-12-09 16:41:56,115 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-io-d650ed68-9c44-45b9-9b41-d501152b3f0f 2020-12-09 16:41:56,120 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /mnt1/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-9311e006-fee0-4317-9355-5d981c558a08 2020-12-09 16:41:56,120 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /mnt2/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-c633cf9f-8220-433a-8f3e-04d45e81efde 2020-12-09 16:41:56,120 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /mnt3/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-671eda78-9981-4f6d-bff4-25cca973d76d 2020-12-09 16:41:56,120 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /mnt/yarn/usercache/hadoop/appcache/application_1603267081962_94843/flink-netty-shuffle-5efa7701-72da-4d91-b9f7-7e6963ffefdb End of LogType:taskmanager.log ******************************************************************************** End of LogType:taskmanager.out ******************************************************************************** |
Hi Anuj, SIGTERM with SIGNAL 15 means that it was killed by an external process. Look into the Yarn logs to look for a specific error. Usually, yarn kills a container with exit code 143 when it goes over memory boundaries. This is something the community constantly improves, but may still happen because of the various types of memory that is allocated (in particular native memory). Please recheck [1], how you can increase some safety margins. On Wed, Dec 9, 2020 at 6:25 PM aj <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |