Unable to start Flink HA cluster with Zookeeper

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Unable to start Flink HA cluster with Zookeeper

mozer
I am trying to install a Flink HA cluster (Zookeeper mode) but the task
manager cannot find the job manager.

Here I give you the architecture;

    - Machine 1 : Job Manager + Zookeeper
    - Machine 2 : Task Manager

masters:

    Machine1

slaves :

    Machine2

flink-conf.yaml:

    #jobmanager.rpc.address: localhost
    jobmanager.rpc.port: 6123
    blob.server.port: 50100-50200
    taskmanager.data.port: 6121
    high-availability: zookeeper
    high-availability.zookeeper.quorum: Machine1:2181
    high-availability.zookeeper.path.root: /flink-1.5.1
    high-availability.cluster-id: /default_b
    high-availability.storageDir: file:///shareflink/recovery

Here this is the log of Task Manager, it tries to connect to localhost
instead of Machine1:

    2018-08-17 10:46:44,875 INFO
org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
select the network interface and address to use by connecting to the leading
JobManager.
    2018-08-17 10:46:44,876 INFO
org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager
will try to connect for 10000 milliseconds before falling back to heuristics
    2018-08-17 10:46:44,966 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Retrieved
new target address /127.0.0.1:37133.
    2018-08-17 10:46:45,324 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Trying to
connect to address /127.0.0.1:37133
    2018-08-17 10:46:45,325 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address 'Machine2/IP-Machine2': Connection refused
    2018-08-17 10:46:45,325 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/127.0.0.1': Connection refused
    2018-08-17 10:46:45,325 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/IP_Machine2': Connection refused
    2018-08-17 10:46:45,325 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/127.0.0.1': Connection refused
    2018-08-17 10:46:45,326 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/IP_Machine2': Connection refused
    2018-08-17 10:46:45,326 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/127.0.0.1': Connection refused
    2018-08-17 10:46:45,726 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Trying to
connect to address /127.0.0.1:37133
    2018-08-17 10:46:45,727 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address 'Machine2/IP-Machine2
   
    2018-08-17 10:47:22,022 WARN  akka.remote.ReliableDeliverySupervisor                      
- Association with remote system [akka.tcp://flink@127.0.0.1:36515] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@127.0.0.1:36515]] Caused by: [Connection refused:
/127.0.0.1:36515]

    2018-08-17 10:47:22,022 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
resolve ResourceManager address
akka.tcp://flink@127.0.0.1:36515/user/resourcemanager, retrying in 10000 ms:
Could not connect to rpc endpoint under address
akka.tcp://flink@127.0.0.1:36515/user/resourcemanager..
    2018-08-17 10:47:32,037 WARN  akka.remote.transport.netty.NettyTransport                  
- Remote connection to [null] failed with java.net.ConnectException:
Connection refused: /127.0.0.1:36515



PS. : **/etc/hosts** contains the **localhost, Machine1 and Machine2**


Can you please tell me how the Task Manager can connect to Job Manager ?

Regards





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to start Flink HA cluster with Zookeeper

miki haiat
First of all try with  FQD or full ip.
Also in order to run HA cluster you need to make sure that you have password less ssh access to your slaves and master communication.   . 

On Tue, Aug 21, 2018 at 4:15 PM mozer <[hidden email]> wrote:
I am trying to install a Flink HA cluster (Zookeeper mode) but the task
manager cannot find the job manager.

Here I give you the architecture;

    - Machine 1 : Job Manager + Zookeeper
    - Machine 2 : Task Manager

masters:

    Machine1

slaves :

    Machine2

flink-conf.yaml:

    #jobmanager.rpc.address: localhost
    jobmanager.rpc.port: 6123
    blob.server.port: 50100-50200
    taskmanager.data.port: 6121
    high-availability: zookeeper
    high-availability.zookeeper.quorum: Machine1:2181
    high-availability.zookeeper.path.root: /flink-1.5.1
    high-availability.cluster-id: /default_b
    high-availability.storageDir: file:///shareflink/recovery

Here this is the log of Task Manager, it tries to connect to localhost
instead of Machine1:

    2018-08-17 10:46:44,875 INFO
org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
select the network interface and address to use by connecting to the leading
JobManager.
    2018-08-17 10:46:44,876 INFO
org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager
will try to connect for 10000 milliseconds before falling back to heuristics
    2018-08-17 10:46:44,966 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Retrieved
new target address /127.0.0.1:37133.
    2018-08-17 10:46:45,324 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Trying to
connect to address /127.0.0.1:37133
    2018-08-17 10:46:45,325 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address 'Machine2/IP-Machine2': Connection refused
    2018-08-17 10:46:45,325 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/127.0.0.1': Connection refused
    2018-08-17 10:46:45,325 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/IP_Machine2': Connection refused
    2018-08-17 10:46:45,325 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/127.0.0.1': Connection refused
    2018-08-17 10:46:45,326 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/IP_Machine2': Connection refused
    2018-08-17 10:46:45,326 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address '/127.0.0.1': Connection refused
    2018-08-17 10:46:45,726 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Trying to
connect to address /127.0.0.1:37133
    2018-08-17 10:46:45,727 INFO
org.apache.flink.runtime.net.ConnectionUtils                  - Failed to
connect from address 'Machine2/IP-Machine2

    2018-08-17 10:47:22,022 WARN  akka.remote.ReliableDeliverySupervisor                       
- Association with remote system [akka.tcp://flink@127.0.0.1:36515] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@127.0.0.1:36515]] Caused by: [Connection refused:
/127.0.0.1:36515]

    2018-08-17 10:47:22,022 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
resolve ResourceManager address
akka.tcp://flink@127.0.0.1:36515/user/resourcemanager, retrying in 10000 ms:
Could not connect to rpc endpoint under address
akka.tcp://flink@127.0.0.1:36515/user/resourcemanager..
    2018-08-17 10:47:32,037 WARN  akka.remote.transport.netty.NettyTransport                   
- Remote connection to [null] failed with java.net.ConnectException:
Connection refused: /127.0.0.1:36515



PS. : **/etc/hosts** contains the **localhost, Machine1 and Machine2**


Can you please tell me how the Task Manager can connect to Job Manager ?

Regards





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to start Flink HA cluster with Zookeeper

mozer
FQD or full ip; tried all of them, still no changes ...
For ssh connection, I can connect to each machine without passwords.


Do you think that the problem can come from :

*high-availability.storageDir: file:///shareflink/recovery* ?

I don't use a HDFS storage but NAS file system which is common for two
machines.

I also added ;


state.backend: filesystem
state.checkpoints.fs.dir: file:///shareflink/recovery/checkpoint
blob.storage.directory: file:///shareflink/recovery/blob

Logs for zookeeper file :

2018-08-21 14:59:32,652 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- tickTime set to 2000
2018-08-21 14:59:32,653 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- minSessionTimeout set to -1
2018-08-21 14:59:32,653 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- maxSessionTimeout set to -1
2018-08-21 14:59:32,661 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
- binding to port 0.0.0.0/0.0.0.0:2181
2018-08-21 14:59:39,940 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
- Accepted socket connection from /Machine1:60186
2018-08-21 14:59:40,015 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
- Accepted socket connection from /Machine2:54466
2018-08-21 14:59:40,017 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- Client attempting to establish new session at /Machine1:60186
2018-08-21 14:59:40,017 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- Client attempting to establish new session at /Machine2:54466

Log for Job Manager :

2018-08-21 14:59:39,327 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to
start actor system at 127.0.0.1:50101
2018-08-21 14:59:39,723 INFO  akka.event.slf4j.Slf4jLogger                                
- Slf4jLogger started
2018-08-21 14:59:39,766 INFO  akka.remote.Remoting                                        
- Starting remoting
2018-08-21 14:59:39,859 INFO  akka.remote.Remoting                                        
- Remoting started; listening on addresses
:[akka.tcp://flink@127.0.0.1:50101]
2018-08-21 14:59:39,865 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system
started at akka.tcp://flink@127.0.0.1:50101
2018-08-21 14:59:39,872 INFO
org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating
highly available BLOB storage directory at
file:///shareflink/recovery///blob
2018-08-21 14:59:39,876 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                
- Enforcing default ACL for ZK connections
2018-08-21 14:59:39,876 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                
- Using '/usr/flink-1.5.1/' as Zookeeper namespace.
2018-08-21 14:59:39,919 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
- Starting





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to start Flink HA cluster with Zookeeper

Dawid Wysakowicz
Hi,
In your case the jobmanager binds itself to localhost and that's what it writes to zookeeper. Try starting the jobmanager manually with jobmanager.rpc.address set to the ip of machine you are running the jobmanager.  In other words make sure the jobmanager binds itself to the right ip.

Regards
Dawid

On Tue, 21 Aug 2018 at 15:32, mozer <[hidden email]> wrote:
FQD or full ip; tried all of them, still no changes ...
For ssh connection, I can connect to each machine without passwords.


Do you think that the problem can come from :

*high-availability.storageDir: file:///shareflink/recovery* ?

I don't use a HDFS storage but NAS file system which is common for two
machines.

I also added ;


state.backend: filesystem
state.checkpoints.fs.dir: file:///shareflink/recovery/checkpoint
blob.storage.directory: file:///shareflink/recovery/blob

Logs for zookeeper file :

2018-08-21 14:59:32,652 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- tickTime set to 2000
2018-08-21 14:59:32,653 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- minSessionTimeout set to -1
2018-08-21 14:59:32,653 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- maxSessionTimeout set to -1
2018-08-21 14:59:32,661 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
- binding to port 0.0.0.0/0.0.0.0:2181
2018-08-21 14:59:39,940 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
- Accepted socket connection from /Machine1:60186
2018-08-21 14:59:40,015 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
- Accepted socket connection from /Machine2:54466
2018-08-21 14:59:40,017 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- Client attempting to establish new session at /Machine1:60186
2018-08-21 14:59:40,017 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer
- Client attempting to establish new session at /Machine2:54466

Log for Job Manager :

2018-08-21 14:59:39,327 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to
start actor system at 127.0.0.1:50101
2018-08-21 14:59:39,723 INFO  akka.event.slf4j.Slf4jLogger                                 
- Slf4jLogger started
2018-08-21 14:59:39,766 INFO  akka.remote.Remoting                                         
- Starting remoting
2018-08-21 14:59:39,859 INFO  akka.remote.Remoting                                         
- Remoting started; listening on addresses
:[akka.tcp://flink@127.0.0.1:50101]
2018-08-21 14:59:39,865 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system
started at akka.tcp://flink@127.0.0.1:50101
2018-08-21 14:59:39,872 INFO
org.apache.flink.runtime.blob.FileSystemBlobStore             - Creating
highly available BLOB storage directory at
file:///shareflink/recovery///blob
2018-08-21 14:59:39,876 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                 
- Enforcing default ACL for ZK connections
2018-08-21 14:59:39,876 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                 
- Using '/usr/flink-1.5.1/' as Zookeeper namespace.
2018-08-21 14:59:39,919 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
- Starting





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to start Flink HA cluster with Zookeeper

mozer
Yeah, you are right. I have already tried to set up jobmanager.rpc.adress and
it works in that case, but if I use this setting I will not be able to use
HA, am i right ?
How the job manager can register to zookeeper with the right address but not
localhost ?





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to start Flink HA cluster with Zookeeper

Dawid Wysakowicz
Hi,
It will use HA settings as long as you specify the high-availability: zookeeper. The jobmanager.rpc.adress is used by the jobmanager as a binding address. You can verify it by starting two jobmanagers and then killing the leader.
Best,
Dawid

On Tue, 21 Aug 2018 at 17:46, mozer <[hidden email]> wrote:
Yeah, you are right. I have already tried to set up jobmanager.rpc.adress and
it works in that case, but if I use this setting I will not be able to use
HA, am i right ?
How the job manager can register to zookeeper with the right address but not
localhost ?





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to start Flink HA cluster with Zookeeper

mozer
Thanks for the info, I have managed to launch a HA cluster with adding
rpc.address for all job managers.
But it did not work with start-cluster.sh, I had to add one by one.





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/