Hello,
I am struggling with standalone HA flink configuration (on real servers not on localhost). Setup: 3x flink-masters [flinkmaster1, flinkmaster2, flinkmaster3] 3x flink-workers [flinkworker1, flinkworker2, flinkworker3] _______________ flink-conf.yaml: jobmanager.rpc.address: localhost jobmanager.rpc.port: 6123 jobmanager.heap.mb: 256 taskmanager.heap.mb: 512 taskmanager.numberOfTaskSlots: 6 taskmanager.memory.preallocate: false parallelism.default: 1 jobmanager.web.port: 8081 env.java.opts: "-Djava.library.path=/usr/local/lib" state.backend: filesystem state.backend.fs.checkpointdir: hdfs://hdnn01:8020/flink/flink-checkpoints fs.hdfs.hadoopconf: /etc/hadoop/conf/ recovery.mode: zookeeper recovery.zookeeper.quorum: zoo01:2181,zoo02:2181,zoo03:2181,zoo04:2181,zoo05:2181 recovery.zookeeper.storageDir: hdfs://hdnn01:8020/flink/flink-recovery Cluster started without errors and I run sample jobs seamlessly. Then I conducted tests for HA (I rebooted leader.flinkmaster) and ... all of the jobs freezed (probably because I thought that in HA mode it omits jobmanager.rpc.address: option which was pointed then to flinkmaster1). So I changed it to localhost and started flink-cluster again. Now flink-cluster starts without errors, but when I try to run the same sample jobs few of the crushes with DataStreamer Exception Errors: 2016-09-19 13:00:37,941 WARN org.apache.hadoop.hdfs.DFSClient - DataStreamer Exception java.io.FileNotFoundException: ID mismatch. Request id and saved id: 1403418 , 1403419 for file /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f .... 2016-09-19 13:00:37,951 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.FileNotFoundException: ID mismatch. Request id and saved id: 1403418 , 1403419 for file /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f ... Please find more here: http://pastebin.com/AzYwbmyS I use a snakebite HDFS client (stat, tail, ls, touchz) to check if HDFS path is writable: ______________________________________ HDFS access check from flinkmasters: [flink@flinkmaster01 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f Found 1 items -rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f [flink@flinkmaster02 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f Found 1 items -rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f [flink@flinkmaster03 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f Found 1 items -rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f ______________________________________ HDFS access check from flinkworkers: [flink@flinkworker01 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f Found 1 items -rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f [flink@flinkworker02 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f Found 1 items -rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f [flink@flinkworker03 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f Found 1 items -rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f 1) Why it needs file blob_ad21c112f0f6698cf976002afd75a44c9d48e68f, even though I restarted flink-cluster with new /flink/flink-recovery2 hdfs path ??? 2) Why there is nothing in: [zk: localhost:2181(CONNECTED) 42] ls /flink/default/leader [] 3) How to add failed flinkmasterX to the HA cluster? (examle: reboot of the flinkmasterX server, which was part of the HA cluster) 4) And basically what is wrong ??? :) Of course in standalone flink-cluster without HA (single master) everything works OK. _________ Regards, Jan |
Free forum by Nabble | Edit this page |