> Looks pretty clear that one operator takes too long to start (even on
> the UI it shows it in the created state for far too long). Any idea what
> might cause this delay? It actually often crashes on Akka ask timeout
> during scheduling the node.
>
> Gyula
>
> Piotr Nowojski <
[hidden email]
> <mailto:
[hidden email]>> ezt írta (időpont: 2018. máj. 4., P,
> 15:33):
>
> Ufuk: I don’t know why.
>
> +1 for your other suggestions.
>
> Piotrek
>
> > On 4 May 2018, at 14:52, Ufuk Celebi <
[hidden email]
> <mailto:
[hidden email]>> wrote:
> >
> > Hey Gyula!
> >
> > I'm including Piotr and Nico (cc'd) who have worked on the network
> > stack in the last releases.
> >
> > Registering the network structures including the intermediate results
> > actually happens **before** any state is restored. I'm not sure why
> > this reproducibly happens when you restore state. @Nico, Piotr: any
> > ideas here?
> >
> > In general I think what happens here is the following:
> > - a task requests the result of a local upstream producer, but that
> > one has not registered its intermediate result yet
> > - this should result in a retry of the request with some backoff
> > (controlled via the config params you mention
> > taskmanager.network.request-backoff.max,
> > taskmanager.network.request-backoff.initial)
> >
> > As a first step I would set logging to DEBUG and check the TM logs for
> > messages like "Retriggering partition request {}:{}."
> >
> > You can also check the SingleInputGate code which has the logic for
> > retriggering requests.
> >
> > – Ufuk
> >
> >
> > On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <
[hidden email]
> <mailto:
[hidden email]>> wrote:
> >> Hi Ufuk,
> >>
> >> Do you have any quick idea what could cause this problems in
> flink 1.4.2?
> >> Seems like one operator takes too long to deploy and downstream
> tasks error
> >> out on partition not found. This only seems to happen when the job is
> >> restored from state and in fact that operator has some keyed and
> operator
> >> state as well.
> >>
> >> Deploying the same job from empty state works well. We tried
> increasing the
> >> taskmanager.network.request-backoff.max that didnt help.
> >>
> >> It would be great if you have some pointers where to look
> further, I havent
> >> seen this happening before.
> >>
> >> Thank you!
> >> Gyula
> >>
> >> The errror:
> >> org.apache.flink.runtime.io
> <
http://org.apache.flink.runtime.io>.network.partition.: Partition
> >> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd
> not found.
> >> at
> >> org.apache.flink.runtime.io
> <
http://org.apache.flink.runtime.io>.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
> >> at
> >> org.apache.flink.runtime.io
> <
http://org.apache.flink.runtime.io>.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
> >> at
> >> org.apache.flink.runtime.io
> <
http://org.apache.flink.runtime.io>.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
> >> at java.util.TimerThread.mainLoop(Timer.java:555)
> >> at java.util.TimerThread.run(Timer.java:505)
> >
> >
> >
> > --
> > Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin
> >
> >
[hidden email] <mailto:
[hidden email]>
> > +49-30-43208879 <tel:+49%2030%2043208879>
> >
> > Registered at Amtsgericht Charlottenburg - HRB 158244 B
> > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>