Hi everyone,
I have a strange issue with flink logging. I use pretty much standard log4 config, which is writing to standard output in order to see it in Flink GUI. Deployment is on YARN with job mode. I can see logs in UI, no problem. On the servers, where Flink YARN containers are running, there is disk quota on the partition where YARN normally creates logs. I see no specific files in the application_xx directory, but space on the disk is actually decreasing with time. After several weeks we eventually hit quota. It seems like some file or pipe is created but not closed, but still reserves the space. After I restart Flink job, space is immediately returned back. I'm sure that flink job is the problem, I have re-produces issue on a cluster where only 1 filnk job was running. Below is my log4 config. Any help or idea is appreciated. Thanks in advance, Maxim. ------------------------------------------- # This affects logging for both user code and Flink log4j.rootLogger=INFO, file, stderr # Uncomment this if you want to _only_ change Flink's logging #log4j.logger.org.apache.flink=INFO # The following lines keep the log level of common libraries/connectors on # log level INFO. The root logger does not override this. You have to manually # change the log levels here. log4j.logger.akka=INFO log4j.logger.org.apache.kafka=INFO log4j.logger.org.apache.hadoop=INFO log4j.logger.org.apache.zookeeper=INFO # Log all infos in the given file log4j.appender.file=org.apache.log4j.FileAppender log4j.appender.file.file=${log.file} log4j.appender.file.append=false log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n # Suppress the irrelevant (wrong) warnings from the Netty channel handler log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file |
Hi Maxim, First, i want to confirm with you that do you have checked all the "yarn.nodemanager.log-dirs". If you could access the logs in Flink webUI, the log files(e.g. taskmanager.log, taskmanager.out, taskmanager.err) should exist. I suggest to double check the multiple log-dirs. Since the *.out/err files do not roll, if you print some user logs to the stdout/stderr, the two files will increase over time. When you stop the Flink application, Yarn will clean up all the jars and logs, so you find that the disk space get back. Best, Yang Maxim Parkachov <[hidden email]> 于2020年7月30日周四 下午10:00写道:
|
Hi Yang, you are right. Since then, I looked for open files and found *.out/*.err files on that partition and as you mentioned they don't roll. I could implement a workaround to restart the streaming job every week or so, but I really don't want to go this way. I tried to forward logs to files and then I could roll them, but then I don't see logs in the GUI. So my question would be, how to make them roll ? Regards, Maxim. On Tue, Aug 4, 2020 at 4:48 AM Yang Wang <[hidden email]> wrote:
|
AFAIK, there is no way to roll the *.out/err files except we hijack the stdout/stderr in Flink code. However, it is a temporary hack. A good way is to write your logs to other separate files that could roll via log4j. If you want to access them in the Flink webUI, upgrade to the 1.11 version. Then you will find a "Log List" tab under JobManager sidebar. Best, Yang Maxim Parkachov <[hidden email]> 于2020年8月4日周二 下午2:52写道:
|
Hi Yang, Thanks for your advice, now I have a good reason to upgrade to 1.11. Regards, Maxim. On Tue, Aug 4, 2020 at 9:39 AM Yang Wang <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |