ResourceManager not using correct akka URI in standalone cluster (?)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

ResourceManager not using correct akka URI in standalone cluster (?)

AJ Heller
I'm running a standalone cluster on Amazon EC2. Leader election is happening according to the logs, and the Flink Dashboard is up and running, accessible remotely. The issue I'm having is that the SocketWordCount example is not working, the local connection is being refused!

In the Flink Dashboard, 0 task managers are being reported. And in the jobmanager logs, the last line indicates "leader session null". All other akka URIs in the log file begin "akka.tcp://flink@PUBLIC_IP/...", but the Resourse Manager URI indicated "akka://flink/...".


jobmanager log:
http://pastebin.com/VWJM8XvW

client log:
http://pastebin.com/ZrWsbcwa

master and slave files are populated with public ips as well.
Reply | Threaded
Open this post in threaded view
|

Re: ResourceManager not using correct akka URI in standalone cluster (?)

AJ Heller
More information:

From the master node, I cannot `telnet localhost 6123` nor `telnet <PUBLIC IP> 6123` while the cluster is apparently running. Connection refused immediately. `netstat -n | grep 6123` is empty. There's no server listening. But the processes are running on all machines.

Does it matter that I don't have hadoop or HDFS installed? It is optional, right? To be clear, this fails at startup, long before I'm able to run any job.

On Amazon EC2, the machines know of their private IPs, but not their public IPs. I've instructed the cluster to operate over the public network because I couldn't get the private IP scenario working.

Running `./bin/start-local.sh` shows non-zero counts in the Flink Dashboard. Cluster setups show zero-counts all around.

-aj

On Thu, Sep 15, 2016 at 12:41 PM, AJ Heller <[hidden email]> wrote:
I'm running a standalone cluster on Amazon EC2. Leader election is happening according to the logs, and the Flink Dashboard is up and running, accessible remotely. The issue I'm having is that the SocketWordCount example is not working, the local connection is being refused!

In the Flink Dashboard, 0 task managers are being reported. And in the jobmanager logs, the last line indicates "leader session null". All other akka URIs in the log file begin "akka.tcp://flink@PUBLIC_IP/...", but the Resourse Manager URI indicated "akka://flink/...".


jobmanager log:
http://pastebin.com/VWJM8XvW

client log:
http://pastebin.com/ZrWsbcwa

master and slave files are populated with public ips as well.

Reply | Threaded
Open this post in threaded view
|

Re: ResourceManager not using correct akka URI in standalone cluster (?)

Till Rohrmann
Hi,

could you check what happened to your TaskManagers in the logs? There seems to be a problem with the connection of the TMs to the JM.

You're right that you don't strictly need HDFS to run a Flink job as long as you don't want to access HDFS data or write to HDFS.

`netstat -atn` should list you all tcp sockets currently used. A socket bound to port 6123 should be among them.

Cheers,
Till

On Thu, Sep 15, 2016 at 11:20 PM, AJ Heller <[hidden email]> wrote:
More information:

From the master node, I cannot `telnet localhost 6123` nor `telnet <PUBLIC IP> 6123` while the cluster is apparently running. Connection refused immediately. `netstat -n | grep 6123` is empty. There's no server listening. But the processes are running on all machines.

Does it matter that I don't have hadoop or HDFS installed? It is optional, right? To be clear, this fails at startup, long before I'm able to run any job.

On Amazon EC2, the machines know of their private IPs, but not their public IPs. I've instructed the cluster to operate over the public network because I couldn't get the private IP scenario working.

Running `./bin/start-local.sh` shows non-zero counts in the Flink Dashboard. Cluster setups show zero-counts all around.

-aj

On Thu, Sep 15, 2016 at 12:41 PM, AJ Heller <[hidden email]> wrote:
I'm running a standalone cluster on Amazon EC2. Leader election is happening according to the logs, and the Flink Dashboard is up and running, accessible remotely. The issue I'm having is that the SocketWordCount example is not working, the local connection is being refused!

In the Flink Dashboard, 0 task managers are being reported. And in the jobmanager logs, the last line indicates "leader session null". All other akka URIs in the log file begin "akka.tcp://flink@PUBLIC_IP/...", but the Resourse Manager URI indicated "akka://flink/...".


jobmanager log:
http://pastebin.com/VWJM8XvW

client log:
http://pastebin.com/ZrWsbcwa

master and slave files are populated with public ips as well.


Reply | Threaded
Open this post in threaded view
|

Re: ResourceManager not using correct akka URI in standalone cluster (?)

AJ Heller
Thank you Till. I was in a time crunch, and rebuilt my cluster from the ground up with hadoop installed. All works fine now, `netstat -pn | grep 6123` shows flink's pid. Hadoop may be irrelevant, I can't rule out PEBKAC yet :-). Sorry, when I have time I'll attempt to reproduce the scenario, on the off chance there's a bug in there I can help dig up.

Best,
aj