I have been trying to send data from several processes to my Flink application. I want to use a single port that will receive data from multiple clients. I have implemented my own SourceFunction, but I have two doubts.
My TCP server has multiple threads receiving data from multiple clients, is calling ctx.collect() without synchronization safe? Second, when I have a source function, how is the data processing of the arrival messages handled? Is a source function parallelized? For instance with the simple SocketTextStream, are messages received in parallel or received sequentially and then processed in parallel by mappers and so on? Sorry if my questions seem too obvious, but I'm kind of new to the Streaming API, and thank you in advance! |
Hi, regarding your first question. I think it is in general not safe to call ctx.collect() from Threads other than the Thread that is invoking the run() method of your SourceFunction. What I would suggest is to have a queue that your reader threads put data into and then read from that queue in the main thread to call ctx.collect(). For your second question: it depends on the Source. Sources that implement the SourceFunction interface are not parallelized. One instance of the source will send data (round-robin) to downstream parallel mappers (or other operations). If you implement the ParallelSourceFunction or RichParallelSourceFunction then your source will be parallelized. Cheers, Aljoscha On Tue, 24 May 2016 at 22:15 omaralvarez <[hidden email]> wrote: I have been trying to send data from several processes to my Flink |
Hi,
Thank you very much for your answer. There is one more doubt in my mind. How are not parallelized source funtions processed? For instance, lets say I have four streams that implement SourceFunction, will they be placed on different parallel instances or will they be processed sequentially by the same instance? Could you direct me to an example of a ParallelSourceFunction, because looking at the code I don't see an interface or any implementation example in the documentation. Best, Omar. |
Hi! A typical example of a parallel source is the Kafka Source. Actually, other threads than the main run() thread can call ctx.collect(), provided they use the checkpoint lock properly. The Kafka source does that. Stephan On Wed, May 25, 2016 at 11:50 AM, omaralvarez <[hidden email]> wrote: Hi, |
Thanks for your answers, this makes the implementation way easier, I don't have to worry about queues.
I will take a look at the kafka connector. So my only remaining question is how serial stream sources are handled. If I have four independent streams, will the sources be handled by different parallel instances? |
Hi, yes, if you have 4 independent non-parallel sources they will be executed independently in different threads. Cheers, Aljoscha On Wed, 25 May 2016 at 13:40 omaralvarez <[hidden email]> wrote: Thanks for your answers, this makes the implementation way easier, I don't |
Thanks to everybody, all my doubts are solved. I gotta give it to you guys, the answers were really fast!
Cheers, Omar. |
Free forum by Nabble | Edit this page |