Hi all,
I was debugging a curious problem with a streaming job that contained an iteration and several AsynFunctions. The entire job would stall out, with no progress being made. But when I checked back pressure, only one function showed it as being high - everything else was OK. And when I dumped threads, the only bit of my code that was running was indeed that one function w/high back pressure, stuck while making a collect() call. There were two issues here…. 1. A downstream function in the iteration was (significantly) increasing the number of tuples - it would get one in, and sometimes emit 100+. The output would loop back as input via the iteration. This eventually caused the network buffers to fill up, and that’s why the job got stuck. I had to add my own tracking/throttling in one of my custom function, to avoid having too many “active” tuples. So maybe something to note in documentation on iterations, if it’s not there already. 2. The back pressure calculation doesn’t take into account AsyncIO When I double-checked the thread dump, there were actually a number of threads (one for each of my AsyncFunctions) that were stuck calling collect(). These all were named "AsyncIO-Emitter-Thread (<name of AsyncFunction>…). For example:
I’m assuming that when my AsyncFunction calls collect(), this hands off the tuple to this AsyncIO-Emitter-Thread thread, which is why none of my code (either AsyncFunctions or threads in my pool doing async stuff) shows up in the thread dump. And I’m assuming that the back pressure calculation isn’t associating these threads with the source function, which is why they don’t show up in the GUI. I’m hoping someone can confirm the above. If so, I’ll file an issue. Thanks, — Ken -------------------------- Ken Krugler custom big data solutions & training Hadoop, Cascading, Cassandra & Solr |
Hey Ken,
thanks for your message. Both your comments are correct (see inline). On Fri, Nov 10, 2017 at 10:31 PM, Ken Krugler <[hidden email]> wrote: > 1. A downstream function in the iteration was (significantly) increasing the > number of tuples - it would get one in, and sometimes emit 100+. > > The output would loop back as input via the iteration. > > This eventually caused the network buffers to fill up, and that’s why the > job got stuck. > > I had to add my own tracking/throttling in one of my custom function, to > avoid having too many “active” tuples. > > So maybe something to note in documentation on iterations, if it’s not there > already. Yes, iterations are prone to deadlock due to the way that data is exchanged between the sink and head nodes. There have been multiple attempts to fix these shortcomings, but I don't know what the latest state is. Maybe Aljoscha (CC'd) has some input... > 2. The back pressure calculation doesn’t take into account AsyncIO Correct, the back pressure monitoring only takes the main task thread into account. Every operator that uses a separate thread to emit records (like Async I/O oder Kafka source) is therefore not covered by the back pressure monitoring. – Ufuk |
Hi,
Unfortunately, I don't have anything to add. Yes, back pressure doesn't work correctly for functions that do work outside the main thread and iterations currently don't work well and can lead to deadlocks. Did you already open issues for those by now? Best, Aljoscha > On 10. Nov 2017, at 22:46, Ufuk Celebi <[hidden email]> wrote: > > Hey Ken, > > thanks for your message. Both your comments are correct (see inline). > > On Fri, Nov 10, 2017 at 10:31 PM, Ken Krugler > <[hidden email]> wrote: >> 1. A downstream function in the iteration was (significantly) increasing the >> number of tuples - it would get one in, and sometimes emit 100+. >> >> The output would loop back as input via the iteration. >> >> This eventually caused the network buffers to fill up, and that’s why the >> job got stuck. >> >> I had to add my own tracking/throttling in one of my custom function, to >> avoid having too many “active” tuples. >> >> So maybe something to note in documentation on iterations, if it’s not there >> already. > > Yes, iterations are prone to deadlock due to the way that data is > exchanged between the sink and head nodes. There have been multiple > attempts to fix these shortcomings, but I don't know what the latest > state is. Maybe Aljoscha (CC'd) has some input... > >> 2. The back pressure calculation doesn’t take into account AsyncIO > > Correct, the back pressure monitoring only takes the main task thread > into account. Every operator that uses a separate thread to emit > records (like Async I/O oder Kafka source) is therefore not covered by > the back pressure monitoring. > > – Ufuk |
Free forum by Nabble | Edit this page |