Flink streaming job logging reserves space

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink streaming job logging reserves space

Maxim Parkachov
Hi everyone,

I have a strange issue with flink logging. I use pretty much standard log4 config, which is writing to standard output in order to see it in Flink GUI. Deployment is on YARN with job mode. I can see logs in UI, no problem. On the servers, where Flink YARN containers are running, there is disk quota on the partition where YARN normally creates logs. I see no specific files in the application_xx directory, but space on the disk is actually decreasing with time. After several weeks we eventually hit quota. It seems like some file or pipe is created but not closed, but still reserves the space. After I restart Flink job, space is immediately returned back. I'm sure that flink job is the problem, I have re-produces issue on a cluster where only 1 filnk job was running. Below is my log4 config. Any help or idea is appreciated.

Thanks in advance,
Maxim.
-------------------------------------------
# This affects logging for both user code and Flink
log4j.rootLogger=INFO, file, stderr

# Uncomment this if you want to _only_ change Flink's logging
#log4j.logger.org.apache.flink=INFO

# The following lines keep the log level of common libraries/connectors on
# log level INFO. The root logger does not override this. You have to manually
# change the log levels here.
log4j.logger.akka=INFO
log4j.logger.org.apache.kafka=INFO
log4j.logger.org.apache.hadoop=INFO
log4j.logger.org.apache.zookeeper=INFO

# Log all infos in the given file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.file=${log.file}
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n

# Suppress the irrelevant (wrong) warnings from the Netty channel handler
log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file

Reply | Threaded
Open this post in threaded view
|

Re: Flink streaming job logging reserves space

Yang Wang
Hi Maxim,

First, i want to confirm with you that do you have checked all the "yarn.nodemanager.log-dirs". If you
could access the logs in Flink webUI, the log files(e.g. taskmanager.log, taskmanager.out, taskmanager.err)
should exist. I suggest to double check the multiple log-dirs.

Since the *.out/err files do not roll, if you print some user logs to the stdout/stderr, the two files will increase
over time.

When you stop the Flink application, Yarn will clean up all the jars and logs, so you find that the disk space get back.


Best,
Yang

Maxim Parkachov <[hidden email]> 于2020年7月30日周四 下午10:00写道:
Hi everyone,

I have a strange issue with flink logging. I use pretty much standard log4 config, which is writing to standard output in order to see it in Flink GUI. Deployment is on YARN with job mode. I can see logs in UI, no problem. On the servers, where Flink YARN containers are running, there is disk quota on the partition where YARN normally creates logs. I see no specific files in the application_xx directory, but space on the disk is actually decreasing with time. After several weeks we eventually hit quota. It seems like some file or pipe is created but not closed, but still reserves the space. After I restart Flink job, space is immediately returned back. I'm sure that flink job is the problem, I have re-produces issue on a cluster where only 1 filnk job was running. Below is my log4 config. Any help or idea is appreciated.

Thanks in advance,
Maxim.
-------------------------------------------
# This affects logging for both user code and Flink
log4j.rootLogger=INFO, file, stderr

# Uncomment this if you want to _only_ change Flink's logging
#log4j.logger.org.apache.flink=INFO

# The following lines keep the log level of common libraries/connectors on
# log level INFO. The root logger does not override this. You have to manually
# change the log levels here.
log4j.logger.akka=INFO
log4j.logger.org.apache.kafka=INFO
log4j.logger.org.apache.hadoop=INFO
log4j.logger.org.apache.zookeeper=INFO

# Log all infos in the given file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.file=${log.file}
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n

# Suppress the irrelevant (wrong) warnings from the Netty channel handler
log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file

Reply | Threaded
Open this post in threaded view
|

Re: Flink streaming job logging reserves space

Maxim Parkachov
Hi Yang,

you are right. Since then, I looked for open files and found *.out/*.err files on that partition and as you mentioned they don't roll.
I could implement a workaround to restart the streaming job every week or so, but I really don't want to go this way.

I tried to forward logs to files and then I could roll them, but then I don't see logs in the GUI.

So my question would be, how to make them roll ?

Regards,
Maxim.

On Tue, Aug 4, 2020 at 4:48 AM Yang Wang <[hidden email]> wrote:
Hi Maxim,

First, i want to confirm with you that do you have checked all the "yarn.nodemanager.log-dirs". If you
could access the logs in Flink webUI, the log files(e.g. taskmanager.log, taskmanager.out, taskmanager.err)
should exist. I suggest to double check the multiple log-dirs.

Since the *.out/err files do not roll, if you print some user logs to the stdout/stderr, the two files will increase
over time.

When you stop the Flink application, Yarn will clean up all the jars and logs, so you find that the disk space get back.


Best,
Yang

Maxim Parkachov <[hidden email]> 于2020年7月30日周四 下午10:00写道:
Hi everyone,

I have a strange issue with flink logging. I use pretty much standard log4 config, which is writing to standard output in order to see it in Flink GUI. Deployment is on YARN with job mode. I can see logs in UI, no problem. On the servers, where Flink YARN containers are running, there is disk quota on the partition where YARN normally creates logs. I see no specific files in the application_xx directory, but space on the disk is actually decreasing with time. After several weeks we eventually hit quota. It seems like some file or pipe is created but not closed, but still reserves the space. After I restart Flink job, space is immediately returned back. I'm sure that flink job is the problem, I have re-produces issue on a cluster where only 1 filnk job was running. Below is my log4 config. Any help or idea is appreciated.

Thanks in advance,
Maxim.
-------------------------------------------
# This affects logging for both user code and Flink
log4j.rootLogger=INFO, file, stderr

# Uncomment this if you want to _only_ change Flink's logging
#log4j.logger.org.apache.flink=INFO

# The following lines keep the log level of common libraries/connectors on
# log level INFO. The root logger does not override this. You have to manually
# change the log levels here.
log4j.logger.akka=INFO
log4j.logger.org.apache.kafka=INFO
log4j.logger.org.apache.hadoop=INFO
log4j.logger.org.apache.zookeeper=INFO

# Log all infos in the given file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.file=${log.file}
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n

# Suppress the irrelevant (wrong) warnings from the Netty channel handler
log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file

Reply | Threaded
Open this post in threaded view
|

Re: Flink streaming job logging reserves space

Yang Wang
AFAIK, there is no way to roll the *.out/err files except we hijack the stdout/stderr in Flink code. However, it is a temporary hack.

A good way is to write your logs to other separate files that could roll via log4j. If you want to access them in the Flink webUI,
upgrade to the 1.11 version. Then you will find a "Log List" tab under JobManager sidebar.


Best,
Yang

Maxim Parkachov <[hidden email]> 于2020年8月4日周二 下午2:52写道:
Hi Yang,

you are right. Since then, I looked for open files and found *.out/*.err files on that partition and as you mentioned they don't roll.
I could implement a workaround to restart the streaming job every week or so, but I really don't want to go this way.

I tried to forward logs to files and then I could roll them, but then I don't see logs in the GUI.

So my question would be, how to make them roll ?

Regards,
Maxim.

On Tue, Aug 4, 2020 at 4:48 AM Yang Wang <[hidden email]> wrote:
Hi Maxim,

First, i want to confirm with you that do you have checked all the "yarn.nodemanager.log-dirs". If you
could access the logs in Flink webUI, the log files(e.g. taskmanager.log, taskmanager.out, taskmanager.err)
should exist. I suggest to double check the multiple log-dirs.

Since the *.out/err files do not roll, if you print some user logs to the stdout/stderr, the two files will increase
over time.

When you stop the Flink application, Yarn will clean up all the jars and logs, so you find that the disk space get back.


Best,
Yang

Maxim Parkachov <[hidden email]> 于2020年7月30日周四 下午10:00写道:
Hi everyone,

I have a strange issue with flink logging. I use pretty much standard log4 config, which is writing to standard output in order to see it in Flink GUI. Deployment is on YARN with job mode. I can see logs in UI, no problem. On the servers, where Flink YARN containers are running, there is disk quota on the partition where YARN normally creates logs. I see no specific files in the application_xx directory, but space on the disk is actually decreasing with time. After several weeks we eventually hit quota. It seems like some file or pipe is created but not closed, but still reserves the space. After I restart Flink job, space is immediately returned back. I'm sure that flink job is the problem, I have re-produces issue on a cluster where only 1 filnk job was running. Below is my log4 config. Any help or idea is appreciated.

Thanks in advance,
Maxim.
-------------------------------------------
# This affects logging for both user code and Flink
log4j.rootLogger=INFO, file, stderr

# Uncomment this if you want to _only_ change Flink's logging
#log4j.logger.org.apache.flink=INFO

# The following lines keep the log level of common libraries/connectors on
# log level INFO. The root logger does not override this. You have to manually
# change the log levels here.
log4j.logger.akka=INFO
log4j.logger.org.apache.kafka=INFO
log4j.logger.org.apache.hadoop=INFO
log4j.logger.org.apache.zookeeper=INFO

# Log all infos in the given file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.file=${log.file}
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n

# Suppress the irrelevant (wrong) warnings from the Netty channel handler
log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file

Reply | Threaded
Open this post in threaded view
|

Re: Flink streaming job logging reserves space

Maxim Parkachov
Hi Yang,

Thanks for your advice, now I have a good reason to upgrade to 1.11.

Regards,
Maxim.

On Tue, Aug 4, 2020 at 9:39 AM Yang Wang <[hidden email]> wrote:
AFAIK, there is no way to roll the *.out/err files except we hijack the stdout/stderr in Flink code. However, it is a temporary hack.

A good way is to write your logs to other separate files that could roll via log4j. If you want to access them in the Flink webUI,
upgrade to the 1.11 version. Then you will find a "Log List" tab under JobManager sidebar.


Best,
Yang

Maxim Parkachov <[hidden email]> 于2020年8月4日周二 下午2:52写道:
Hi Yang,

you are right. Since then, I looked for open files and found *.out/*.err files on that partition and as you mentioned they don't roll.
I could implement a workaround to restart the streaming job every week or so, but I really don't want to go this way.

I tried to forward logs to files and then I could roll them, but then I don't see logs in the GUI.

So my question would be, how to make them roll ?

Regards,
Maxim.

On Tue, Aug 4, 2020 at 4:48 AM Yang Wang <[hidden email]> wrote:
Hi Maxim,

First, i want to confirm with you that do you have checked all the "yarn.nodemanager.log-dirs". If you
could access the logs in Flink webUI, the log files(e.g. taskmanager.log, taskmanager.out, taskmanager.err)
should exist. I suggest to double check the multiple log-dirs.

Since the *.out/err files do not roll, if you print some user logs to the stdout/stderr, the two files will increase
over time.

When you stop the Flink application, Yarn will clean up all the jars and logs, so you find that the disk space get back.


Best,
Yang

Maxim Parkachov <[hidden email]> 于2020年7月30日周四 下午10:00写道:
Hi everyone,

I have a strange issue with flink logging. I use pretty much standard log4 config, which is writing to standard output in order to see it in Flink GUI. Deployment is on YARN with job mode. I can see logs in UI, no problem. On the servers, where Flink YARN containers are running, there is disk quota on the partition where YARN normally creates logs. I see no specific files in the application_xx directory, but space on the disk is actually decreasing with time. After several weeks we eventually hit quota. It seems like some file or pipe is created but not closed, but still reserves the space. After I restart Flink job, space is immediately returned back. I'm sure that flink job is the problem, I have re-produces issue on a cluster where only 1 filnk job was running. Below is my log4 config. Any help or idea is appreciated.

Thanks in advance,
Maxim.
-------------------------------------------
# This affects logging for both user code and Flink
log4j.rootLogger=INFO, file, stderr

# Uncomment this if you want to _only_ change Flink's logging
#log4j.logger.org.apache.flink=INFO

# The following lines keep the log level of common libraries/connectors on
# log level INFO. The root logger does not override this. You have to manually
# change the log levels here.
log4j.logger.akka=INFO
log4j.logger.org.apache.kafka=INFO
log4j.logger.org.apache.hadoop=INFO
log4j.logger.org.apache.zookeeper=INFO

# Log all infos in the given file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.file=${log.file}
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n

# Suppress the irrelevant (wrong) warnings from the Netty channel handler
log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file