Flink streaming Parallelism

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink streaming Parallelism

Jakes John
       I am coming from Apache Storm world.  I am planning to switch from storm to flink. I was reading Flink documentation but, I couldn't find some requirements in Flink which was present in Storm.  

I need to have a streaming pipeline  Kafka->flink-> ElasticSearch.  In storm,  I have seen that I can specify number of tasks per bolt.  Typically databases are slow in writes and hence I need more writers to the database.  Reading from kafka is pretty fast when compared to ES writes.  This means that I need to have more ES writer tasks than Kafka consumers. How can I achieve it in Flink?  What are the concepts in Flink similar to Storm Parallelism concepts like workers, executors, tasks?
        I saw the implementation of elasticsearch sink in Flink which can do batching of messsges before writes. How can I batch data based on a custom logic? For eg: batch writes  grouped on one of the message keys.  This is possible in Storm via FieldGrouping. But I couldn't find an equivalent way to do grouping in Flink and control the overall number of writes to ES.

Please help me with above questions and some pointers to flink parallelism. 



Reply | Threaded
Open this post in threaded view
|

Re: Flink streaming Parallelism

Tzu-Li (Gordon) Tai
Hi,

The equivalent would be setting a parallelism on your sink operator. e.g. stream.addSink(…).setParallelism(…).
By default the parallelism of all operators in the pipeline will be whatever parallelism was set for the whole job, unless parallelism is explicitly set for a specific operator. For more details on the distributed runtime concepts you can take a look at [1]

        I saw the implementation of elasticsearch sink in Flink which can do batching of messsges before writes. How can I batch data based on a custom logic? For eg: batch writes  grouped on one of the message keys.  This is possible in Storm via FieldGrouping.

The equivalent of partitioning streams in Flink is `stream.keyBy(…)`. All messages of the same key would then go to the same parallel downstream operator instance. If its an ElasticsearchSink, then following a keyBy all messages of the same key will be batched by the same ElasticSearch writer.

Hope this helps!

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/concepts/runtime.html


On 8 August 2017 at 8:58:30 AM, Jakes John ([hidden email]) wrote:

       I am coming from Apache Storm world.  I am planning to switch from storm to flink. I was reading Flink documentation but, I couldn't find some requirements in Flink which was present in Storm.  

I need to have a streaming pipeline  Kafka->flink-> ElasticSearch.  In storm,  I have seen that I can specify number of tasks per bolt.  Typically databases are slow in writes and hence I need more writers to the database.  Reading from kafka is pretty fast when compared to ES writes.  This means that I need to have more ES writer tasks than Kafka consumers. How can I achieve it in Flink?  What are the concepts in Flink similar to Storm Parallelism concepts like workers, executors, tasks?
        I saw the implementation of elasticsearch sink in Flink which can do batching of messsges before writes. How can I batch data based on a custom logic? For eg: batch writes  grouped on one of the message keys.  This is possible in Storm via FieldGrouping. But I couldn't find an equivalent way to do grouping in Flink and control the overall number of writes to ES.

Please help me with above questions and some pointers to flink parallelism.