Hey,
I’m a new user to Flink and I’m trying to figure out if I can build a pipeline I’m working on using Flink. I have a data source that sends out a continues data stream at a bandwidth of anywhere between 45MB/s to 600MB/s (yes, that’s MiB/s, not Mib/s, and NOT a series of individual messages but an actual continues stream of data where some data may depend on previous or future data to be fully deciphered). I need to be able to pass the data through several processing stages (that manipulate the data but still produce the same order of magnitude output at each stage) and I need processing to be done with low-latency. The data itself CAN be segmented but the segments will be some HUGE (~100MB – 250MB) and I would like to be able to stream data in and out of the processors ASAP instead of waiting for full segments to be complete at each stage (so bytes will flow in/out as soon as they are available).
The obvious solution would be to split the data into very small buffers, but since each segment would have to be sent completely to the same processor node (and not split between several nodes), doing such micro-batching would be a bad idea as it would spread a single segment’s buffers between multiple nodes.
Is there any way to accomplish this with Flink? Or is Flink the wrong platform for that type of processing?
Any help would be greatly appreciated!
Thanks,
Tal |
I think with sufficient processing power flink can do the above mentioned task using the stream api.
Thanks, Ritesh Kumar Singh, On Wed, Jan 20, 2016 at 11:18 AM, Tal Maoz <[hidden email]> wrote:
|
Hi Tal, that sounds like an interesting use case. I think I need a bit more details about your use case to see how it can be done with Flink. You said you need low latency, what latency is acceptable for you? Also, I was wondering how are you going to feed the input data into Flink? If the data is coming from multiple sources, maybe everything can be done completely parallel. Do you need any fault tolerance guarantees? You can use Flink's DataStream abstraction with different data types, and you could create a DataStream<byte>. Flink would internally still send multiple of those records in one buffer. I think the more efficient approach is, as you suggested, to use a DataStream<byte[]> of larger chunks. What kind of transformations are you planning to do on the stream? Regarding the amount of data we are talking about here: Flink is certainly able to handle those loads. I recently did some tests with our KafkaConsumer and I was able to read 390 megabytes/second on my laptop, using a parallelism of one (so only one reading thread). My SSD has a read rate of 530 MBs/. With sufficiently fast hardware, a few Flink TaskManagers will be able to read 600MB/s. On Wed, Jan 20, 2016 at 1:39 PM, Ritesh Kumar Singh <[hidden email]> wrote:
|
Hey Robert, Thanks for responding! The latency I'm talking about would be no more than 1 second from input to output (meaning, bytes should flow immediately through the pipline and get to the other side after going through the processing). You can assume the processors have enough power to work in real-time. The processors would be, for the most part, running external processes (binary executables) and will feed them the incoming data, and then pass along their stdout to the next stage. Simply put, I have several existing 'blackbox' utilities that I need to run on the data in sequence and each of which is a CPU hog... Regarding fault tolerance, no data should be lost and each processor should get the data ONCE and in the correct order (when data is supposed to flow to the same processor). If a node crashes, a new one will take it's place and the data that was sent to the crashed node and was not processed should be sent to the new one, while the output should flow transparently to the next node as if no crashes happened. I know this is a very complicated demand but it is a must in my case. Finally, I'm talking about multiple pipelines running, where each node in a pipeline will be pre-configured before data starts flowing. Each pipeline will read data from a socket or from an MQ if such an MQ exists and is able to handle the load with the required low-latency. Each pipeline's source could be at the range of 45-600MB/s (this can't be split into multiple sources) and eventually, with enough resources and scaling, the system should support hundreds of such pipelines, each with it's own source! Also, at some point, 2 or more sources could be joined with some transformation into a single data stream... Assume the network fabric itself is capable of moving those amounts of data... If i use DataStream<byte[]> where i divide a single segment into very small buffers for low-latency, how can ensure that, on the one hand the data for a single segments flows entirely to the same processor while on the other, different segments can be balanced between several processors? Tal On Wed, Jan 20, 2016 at 3:02 PM, Robert Metzger <[hidden email]> wrote:
|
This sounds quite feasible, actually, though it is a pretty unique use case. Like Robert said, you can write map() and flatMap() function on byte[] arrays. Make sure that the byte[] that the sources produce are not super small and not too large (I would start with 1-4K or so). You can control how data flows pretty well. It flows 1:1 from produce to consumer, if you simply chain these function calls after another. To balance bytes across receivers, use rebalance(), broadcast(), or partitionCustom(). Streams maintain order of elements, unless steams get split/merged by operations like rebalance(), partition() / broadcast / keyBy() or similar. To union multiple streams with control over how the result stream gets pieced together, I would try to connect streams and use a CoMapFunction / CoFlatMapFunction to stitch the result stream together form the two input streams. To get exactly-once processing guarantees, activate checkpointing and use a source that supports that. If you use a custom source, you may need a few lines to integrate it with the checkpoint mechanism, but that is very doable. Hope that helps! Greetings, Stephan On Wed, Jan 20, 2016 at 2:20 PM, Tal Maoz <[hidden email]> wrote:
|
Hi Tal, you said that most processing will be done in external processes. If these processes are stateful, this might be hard to integrate with Flink's fault-tolerance mechanism. In principle, Flink requires two things to achieve exactly-once processing: 1) A data source that can be replayed from a certain point 2) User-functions that can checkpoint and restore their complete state. (The actual checkpointing is done by Flink, but the user code must expose its state through Flink's APIs). In case of a failure, the data sources are set back to a committed point and the corresponding state of all user functions is restored. You would need to expose the state of your external functions to Flink and have some way to reinitialize this state in your external function. Best, Fabian 2016-01-20 22:09 GMT+01:00 Stephan Ewen <[hidden email]>:
|
Thanks Stephan and Fabian! You make very valuable points! This really helps steer me in the right direction! It would take some more careful planning and implementing the components you suggested but hopefully it will work in the end... Thanks, Tal On Thu, Jan 21, 2016 at 11:20 AM, Fabian Hueske <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |