(DEPRECATED) Apache Flink User Mailing List archive.

Standalone HA cluster - 1.1.2

Classic

List

Threaded

1 message

jkogut

Standalone HA cluster - 1.1.2

Hello,

I am struggling with standalone HA flink configuration (on real servers not on localhost).

Setup:

3x flink-masters [flinkmaster1, flinkmaster2, flinkmaster3]
3x flink-workers [flinkworker1, flinkworker2, flinkworker3]

_______________
flink-conf.yaml:

jobmanager.rpc.address: localhost
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 256
taskmanager.heap.mb: 512
taskmanager.numberOfTaskSlots: 6
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
env.java.opts: "-Djava.library.path=/usr/local/lib"

state.backend: filesystem
state.backend.fs.checkpointdir: hdfs://hdnn01:8020/flink/flink-checkpoints
fs.hdfs.hadoopconf: /etc/hadoop/conf/

recovery.mode: zookeeper
recovery.zookeeper.quorum: zoo01:2181,zoo02:2181,zoo03:2181,zoo04:2181,zoo05:2181
recovery.zookeeper.storageDir: hdfs://hdnn01:8020/flink/flink-recovery

Cluster started without errors and I run sample jobs seamlessly.
Then I conducted tests for HA (I rebooted leader.flinkmaster) and ... all of the jobs freezed (probably because I thought that in HA mode it omits jobmanager.rpc.address: option which was pointed then to flinkmaster1). So I changed it to localhost and started flink-cluster again.

Now flink-cluster starts without errors, but when I try to run the same sample jobs few of the crushes with DataStreamer Exception Errors:

2016-09-19 13:00:37,941 WARN org.apache.hadoop.hdfs.DFSClient - DataStreamer Exception
java.io.FileNotFoundException: ID mismatch. Request id and saved id: 1403418 , 1403419 for file /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f

....

2016-09-19 13:00:37,951 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed
java.io.FileNotFoundException: ID mismatch. Request id and saved id: 1403418 , 1403419 for file /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f

...

Please find more here: http://pastebin.com/AzYwbmyS

I use a snakebite HDFS client (stat, tail, ls, touchz) to check if HDFS path is writable:
______________________________________
HDFS access check from flinkmasters:

[flink@flinkmaster01 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f
Found 1 items
-rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f

[flink@flinkmaster02 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f
Found 1 items
-rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f

[flink@flinkmaster03 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f
Found 1 items
-rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f

______________________________________
HDFS access check from flinkworkers:

[flink@flinkworker01 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f
Found 1 items
-rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f

[flink@flinkworker02 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f
Found 1 items
-rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f

[flink@flinkworker03 ~]$ snakebite ls /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f
Found 1 items
-rw-r--r-- 3 flink hdfs 10030 2016-09-19 13:00 /flink/flink-recovery/blob/cache/blob_ad21c112f0f6698cf976002afd75a44c9d48e68f

1) Why it needs file blob_ad21c112f0f6698cf976002afd75a44c9d48e68f, even though I restarted flink-cluster with new /flink/flink-recovery2 hdfs path ???
2) Why there is nothing in:
[zk: localhost:2181(CONNECTED) 42] ls /flink/default/leader
[]
3) How to add failed flinkmasterX to the HA cluster? (examle: reboot of the flinkmasterX server, which was part of the HA cluster)

4) And basically what is wrong ??? :)

Of course in standalone flink-cluster without HA (single master) everything works OK.

_________
Regards,
Jan