Parallel execution but keep order of kafka messages

Posted by BenReissaus on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Parallel-execution-but-keep-order-of-kafka-messages-tp12619.html

Hi everybody,

 

I have the following flink/kafka setup:

 

I have 3 kafka “input” topics and 3 “output” topics with each 1 partition (only 1 partition because the order of the messages is important). I also have 1 master and 2 flink slave nodes with a total of 16 task slots.

In my flink program I have created 3 consumers - each for one of the input topics.

On each of the datastreams I run a query that generates statistics over a window of 1 second and I write the result to the corresponding output topic. You can find the execution plan with parallelism set to 2 attached.

 

This setup with parallelism=2 sometimes seems to give me the wrong order of the statistics results. I assume it is because of the rebalancing before the last map which leads to a race condition when writing to kafka.

 

If I set parallelism to 1 no rebalancing will be done but only one task slot is used.

 

This has led me to the following questions:

 

Why is only 1 task slot used with my 3 pipelines when parallelism is set to 1? As far as I understand, the parallelism refers to the number of parallel instances a task can be split into. Therefore, I would assume that I could still run multiple different tasks (e.g. different maps or window functions on different streams) in different task slots, right?

 

And to come back to my requirement: Is it not possible to run all 3 pipelines in parallel and still keep the order of the messages and results?

 

I also asked these questions on stackoverflow. And it seems that I have similar trouble understanding the terms “task slot”, “subtasks” etc. like Flavio mentioning in this flink mail thread.

 

Thank you and I would appreciate any input!

 

Best regards,

Ben


Flink_Execution_Plan.png (453K) Download Attachment