Flink with parallelism 3 is running locally but not on cluster

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink with parallelism 3 is running locally but not on cluster

zavalit
This post was updated on .
Hi,
may be i just missing smth, but i just have no more ideas where to look.

here is an screen of the failed state
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1383/Bildschirmfoto_2018-11-12_um_16.png

i read messages from 2 sources, make a join based on a common key and sink
it all in a kafka.

  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(3)
  ...
  source1
     .keyBy(_.searchId)
     .connect(source2.keyBy(_.searchId))
     .process(new SearchResultsJoinFunction)
     .addSink(KafkaSink.sink)

so it perfectly works when i launch it locally. it also works on cluster with Parallelism set to 1, but not with 3 any more.

When i deploy it to 1 job manager and 3 taskmanagers and get every Task in "RUNNING" state, after 2
minutes (when nothing is comming to sink) one of the taskmanagers gets
following in log:

 Flat Map (1/3) (9598c11996f4b52a2e2f9f532f91ff66) switched from RUNNING to
FAILED.
java.io.IOException: Connecting the channel failed: Connecting to remote
task manager + 'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed.
This might indicate that the remote task manager has been lost.
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:133)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:85)
        at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
        at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:166)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:494)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:525)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:508)
        at
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:94)
        at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:209)
        at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
        at java.lang.Thread.run(Thread.java:748)
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager +
'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed. This might
indicate that the remote task manager has been lost.
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:219)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:133)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:125)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
        ... 1 more
Caused by:
org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException:
connection timed out: flink-taskmanager-11-dn9cj/10.81.27.84:37708
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
        ... 7 more
2018-11-12 15:47:57,198 INFO  org.apache.flink.runtime.taskmanager.Task                    
- Flat Map (1/3) (171e84d98f94a83e1f3e7cd598c7dbbc) switched from RUNNING to
FAILED.


i would appreciate any hint.

thx a lot.


 





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink with parallelism 3 is running locally but not on cluster

Dominik Wosiński
Hey,

DId You try to run any other job on your setup? Also, could You please tell what are the sources you are trying to use, do all messages come from Kafka?? 
From the first look, it seems that the JobManager can't connect to one of the TaskManagers.


Best Regards,
Dom.

pon., 12 lis 2018 o 17:12 zavalit <[hidden email]> napisał(a):
Hi,
may be i just missing smth, but i just have no more ideas where to look.

here is an screen of the failed state
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1383/Bildschirmfoto_2018-11-12_um_16.png>

i read messages from 2 sources, make a join based on a common key and sink
it all in a kafka.

  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(3)
  ...
  source1
     .keyBy(_.searchId)
     .connect(source2.keyBy(_.searchId))
     .process(new SearchResultsJoinFunction)
     .addSink(KafkaSink.sink)

so it perfectly works when launch it locally. when i deploy it to 1 job
manager and 3 taskmanagers and get every Task in "RUNNING" state, after 2
minutes (when nothing is comming to sink) one of the taskmanagers gets
following in log:

 Flat Map (1/3) (9598c11996f4b52a2e2f9f532f91ff66) switched from RUNNING to
FAILED.
java.io.IOException: Connecting the channel failed: Connecting to remote
task manager + 'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed.
This might indicate that the remote task manager has been lost.
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:133)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:85)
        at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
        at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:166)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:494)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:525)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:508)
        at
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:94)
        at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:209)
        at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
        at java.lang.Thread.run(Thread.java:748)
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager +
'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed. This might
indicate that the remote task manager has been lost.
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:219)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:133)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:125)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
        ... 1 more
Caused by:
org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException:
connection timed out: flink-taskmanager-11-dn9cj/10.81.27.84:37708
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
        ... 7 more
2018-11-12 15:47:57,198 INFO  org.apache.flink.runtime.taskmanager.Task                   
- Flat Map (1/3) (171e84d98f94a83e1f3e7cd598c7dbbc) switched from RUNNING to
FAILED.


i would appreciate any hint.

thx a lot.








--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink with parallelism 3 is running locally but not on cluster

Dominik Wosiński
PS.
Could You also post the whole log for the application run ?? 

Best Regards,
Dom.

czw., 15 lis 2018 o 11:04 Dominik Wosiński <[hidden email]> napisał(a):
Hey,

DId You try to run any other job on your setup? Also, could You please tell what are the sources you are trying to use, do all messages come from Kafka?? 
From the first look, it seems that the JobManager can't connect to one of the TaskManagers.


Best Regards,
Dom.

pon., 12 lis 2018 o 17:12 zavalit <[hidden email]> napisał(a):
Hi,
may be i just missing smth, but i just have no more ideas where to look.

here is an screen of the failed state
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1383/Bildschirmfoto_2018-11-12_um_16.png>

i read messages from 2 sources, make a join based on a common key and sink
it all in a kafka.

  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(3)
  ...
  source1
     .keyBy(_.searchId)
     .connect(source2.keyBy(_.searchId))
     .process(new SearchResultsJoinFunction)
     .addSink(KafkaSink.sink)

so it perfectly works when launch it locally. when i deploy it to 1 job
manager and 3 taskmanagers and get every Task in "RUNNING" state, after 2
minutes (when nothing is comming to sink) one of the taskmanagers gets
following in log:

 Flat Map (1/3) (9598c11996f4b52a2e2f9f532f91ff66) switched from RUNNING to
FAILED.
java.io.IOException: Connecting the channel failed: Connecting to remote
task manager + 'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed.
This might indicate that the remote task manager has been lost.
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:133)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:85)
        at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
        at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:166)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:494)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:525)
        at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:508)
        at
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:94)
        at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:209)
        at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
        at java.lang.Thread.run(Thread.java:748)
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager +
'flink-taskmanager-11-dn9cj/10.81.27.84:37708' has failed. This might
indicate that the remote task manager has been lost.
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:219)
        at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:133)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:269)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:125)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
        ... 1 more
Caused by:
org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException:
connection timed out: flink-taskmanager-11-dn9cj/10.81.27.84:37708
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
        ... 7 more
2018-11-12 15:47:57,198 INFO  org.apache.flink.runtime.taskmanager.Task                   
- Flat Map (1/3) (171e84d98f94a83e1f3e7cd598c7dbbc) switched from RUNNING to
FAILED.


i would appreciate any hint.

thx a lot.








--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink with parallelism 3 is running locally but not on cluster

zavalit
Hey, Dominik,
tnx for getting back.
i've posted also by stackoverflow and David Anderson gave a good tipp where
to look.
https://stackoverflow.com/questions/53282967/run-flink-with-parallelism-more-than-1/53289840
issues is resolved, everything is running.

thx. again



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/