Hi, After running for a while , my job manager holds thousands of CLOSE_WAIT TCP connection to HDFS datanode, the number is growing up slowly, and it’s likely will hit the max open file limit. My jobs checkpoint to HDFS every minute. If I run lsof -i -a -p $JMPID, I can get a tons of following output: java 9433 iot 408u IPv4 4060901898 0t0 TCP jmHost:17922->datanode:50010 (CLOSE_WAIT) java 9433 iot 409u IPv4 4061478455 0t0 TCP jmHost:52854->datanode:50010 (CLOSE_WAIT) java 9433 iot 410r IPv4 4063170767 0t0 TCP jmHost:49384->datanode:50010 (CLOSE_WAIT) java 9433 iot 411w IPv4 4063188376 0t0 TCP jmHost:50516->datanode:50010 (CLOSE_WAIT) java 9433 iot 412u IPv4 4061459881 0t0 TCP jmHost:51651->datanode:50010 (CLOSE_WAIT) java 9433 iot 413u IPv4 4063737603 0t0 TCP jmHost:31318->datanode:50010 (CLOSE_WAIT) java 9433 iot 414w IPv4 4062030625 0t0 TCP jmHost:34033->datanode:50010 (CLOSE_WAIT) java 9433 iot 415u IPv4 4062049134 0t0 TCP jmHost:35156->datanode:50010 (CLOSE_WAIT) java 9433 iot 416u IPv4 4062615550 0t0 TCP jmHost:16962->datanode:50010 (CLOSE_WAIT) java 9433 iot 417r IPv4 4063757056 0t0 TCP jmHost:32553->datanode:50010 (CLOSE_WAIT) java 9433 iot 418w IPv4 4064304789 0t0 TCP jmHost:13375->datanode:50010 (CLOSE_WAIT) java 9433 iot 419u IPv4 4062599328 0t0 TCP jmHost:15915->datanode:50010 (CLOSE_WAIT) java 9433 iot 420w IPv4 4065462963 0t0 TCP jmHost:30432->datanode:50010 (CLOSE_WAIT) java 9433 iot 421u IPv4 4067178257 0t0 TCP jmHost:28334->datanode:50010 (CLOSE_WAIT) java 9433 iot 422u IPv4 4066022066 0t0 TCP jmHost:11843->datanode:50010 (CLOSE_WAIT) I know restarting the job manager should cleanup those connections, but I wonder if there is any better solution? Btw, I am using flink 1.4.0, and running a standalone cluster. Thanks Youjun |
Hi Youjun, How long has your job been running for a long time? As far as I know, if in a short time, for checkpoint, jobmanager will not generate so many connections to HDFS. What is your Flink cluster environment? Standalone or Flink on YARN? In addition, does JM's log show any timeout information? Has Checkpoint timed out? If you can provide more information, it will help locate the problem. Thanks, vino. Yuan,Youjun <[hidden email]> 于2018年8月23日周四 下午10:53写道:
|
Hi Youjun, You can see if there is any real data transfer between these connections. I guess there may be some connection leaks here, and if so, it's a bug. On the other hand, the 1.4 version is a bit old, can you compare the 1.5 or 1.6 whether the same problem exists? I suggest you create an issue on JIRA and maybe get more feedback. Questions about how to force these connections to be closed. If you have configured HA mode and the checkpoints are enabled for the job, you can try to show off the JM leader, then let ZK conduct the leader election and JM to switch. But please be cautious about this process. One more safer approach is to execute cancel with savepoint on all jobs first. Then switch JM. Thanks, vino. Yuan,Youjun <[hidden email]> 于2018年8月24日周五 下午1:06写道:
|
One more safer approach is to execute cancel with savepoint on all jobs first >> this sounds great! Thanks Youjun 发件人: vino yang <[hidden email]>
Hi Youjun, You can see if there is any real data transfer between these connections. I guess there may be some connection leaks here, and if so, it's a bug. On the other hand, the 1.4 version is a bit old, can you compare the 1.5 or 1.6 whether the same problem exists? I suggest you create an issue on JIRA and maybe get more feedback. Questions about how to force these connections to be closed. If you have configured HA mode and the checkpoints are enabled for the job, you can try to show off the JM leader, then let ZK conduct the leader election and JM to switch. But please be cautious about this process. One more safer approach is to execute cancel with savepoint on all jobs first. Then switch JM. Thanks, vino. Yuan,Youjun <[hidden email]>
于2018年8月24日周五
下午1:06写道:
|
Free forum by Nabble | Edit this page |