(DEPRECATED) Apache Flink User Mailing List archive.

Flink streaming Parallelism

Classic

List

Threaded

2 messages Options

Jakes John

Flink streaming Parallelism

I am coming from Apache Storm world. I am planning to switch from storm to flink. I was reading Flink documentation but, I couldn't find some requirements in Flink which was present in Storm.

I need to have a streaming pipeline Kafka->flink-> ElasticSearch. In storm, I have seen that I can specify number of tasks per bolt. Typically databases are slow in writes and hence I need more writers to the database. Reading from kafka is pretty fast when compared to ES writes. This means that I need to have more ES writer tasks than Kafka consumers. How can I achieve it in Flink? What are the concepts in Flink similar to Storm Parallelism concepts like workers, executors, tasks?

I saw the implementation of elasticsearch sink in Flink which can do batching of messsges before writes. How can I batch data based on a custom logic? For eg: batch writes grouped on one of the message keys. This is possible in Storm via FieldGrouping. But I couldn't find an equivalent way to do grouping in Flink and control the overall number of writes to ES.

Please help me with above questions and some pointers to flink parallelism.

Tzu-Li (Gordon) Tai

Re: Flink streaming Parallelism

Hi,

The equivalent would be setting a parallelism on your sink operator. e.g. stream.addSink(…).setParallelism(…).

By default the parallelism of all operators in the pipeline will be whatever parallelism was set for the whole job, unless parallelism is explicitly set for a specific operator. For more details on the distributed runtime concepts you can take a look at [1]

I saw the implementation of elasticsearch sink in Flink which can do batching of messsges before writes. How can I batch data based on a custom logic? For eg: batch writes grouped on one of the message keys. This is possible in Storm via FieldGrouping.

The equivalent of partitioning streams in Flink is `stream.keyBy(…)`. All messages of the same key would then go to the same parallel downstream operator instance. If its an ElasticsearchSink, then following a keyBy all messages of the same key will be batched by the same ElasticSearch writer.

Hope this helps!

Cheers,

Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/concepts/runtime.html

On 8 August 2017 at 8:58:30 AM, Jakes John ([hidden email]) wrote:

I am coming from Apache Storm world. I am planning to switch from storm to flink. I was reading Flink documentation but, I couldn't find some requirements in Flink which was present in Storm.

I need to have a streaming pipeline Kafka->flink-> ElasticSearch. In storm, I have seen that I can specify number of tasks per bolt. Typically databases are slow in writes and hence I need more writers to the database. Reading from kafka is pretty fast when compared to ES writes. This means that I need to have more ES writer tasks than Kafka consumers. How can I achieve it in Flink? What are the concepts in Flink similar to Storm Parallelism concepts like workers, executors, tasks?

I saw the implementation of elasticsearch sink in Flink which can do batching of messsges before writes. How can I batch data based on a custom logic? For eg: batch writes grouped on one of the message keys. This is possible in Storm via FieldGrouping. But I couldn't find an equivalent way to do grouping in Flink and control the overall number of writes to ES.

Please help me with above questions and some pointers to flink parallelism.