Posted by
Ufuk Celebi-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/PartitionNotFoundException-after-deployment-tp19942p19952.html
Hey Gyula!
I'm including Piotr and Nico (cc'd) who have worked on the network
stack in the last releases.
Registering the network structures including the intermediate results
actually happens **before** any state is restored. I'm not sure why
this reproducibly happens when you restore state. @Nico, Piotr: any
ideas here?
In general I think what happens here is the following:
- a task requests the result of a local upstream producer, but that
one has not registered its intermediate result yet
- this should result in a retry of the request with some backoff
(controlled via the config params you mention
taskmanager.network.request-backoff.max,
taskmanager.network.request-backoff.initial)
As a first step I would set logging to DEBUG and check the TM logs for
messages like "Retriggering partition request {}:{}."
You can also check the SingleInputGate code which has the logic for
retriggering requests.
– Ufuk
On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <
[hidden email]> wrote:
> Hi Ufuk,
>
> Do you have any quick idea what could cause this problems in flink 1.4.2?
> Seems like one operator takes too long to deploy and downstream tasks error
> out on partition not found. This only seems to happen when the job is
> restored from state and in fact that operator has some keyed and operator
> state as well.
>
> Deploying the same job from empty state works well. We tried increasing the
> taskmanager.network.request-backoff.max that didnt help.
>
> It would be great if you have some pointers where to look further, I havent
> seen this happening before.
>
> Thank you!
> Gyula
>
> The errror:
> org.apache.flink.runtime.io.network.partition.: Partition
> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found.
> at
> org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
> at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
> at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
--
Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin
[hidden email]
+49-30-43208879
Registered at Amtsgericht Charlottenburg - HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen