(DEPRECATED) Apache Flink User Mailing List archive.

Large number of sources in Flink Job

Classic

List

Threaded

2 messages Options

chiggi_dev

Large number of sources in Flink Job

Hi,

I am working on a use case where my Flink job needs to collect data from thousands of sources.

As an example, I want to collect data from more than 2000 File Directories, process(filter, transform) the data and distribute the processed data streams to 200 different directories.

Are there any caveats I should know with such large number of sources, also taking into account per operator parallelism?

Regards,

Chirag

Fabian Hueske-2

Re: Large number of sources in Flink Job

Hi Chirag,

There have been some issue with very large execution graphs.

You might need to adjust the default configuration and configure larger Akka buffers and/or timeouts.

Also, 2000 sources means that you run at least 2000 threads at once.

The FileInputFormat (and most of its sub-classes) in Flink 1.5.0 can be configured to accept multiple directories.

This would be a preferred approach to creating one source per directory.

Best, Fabian

2018-05-28 6:35 GMT+02:00 Chirag Dewan <[hidden email]>:

Hi,

I am working on a use case where my Flink job needs to collect data from thousands of sources.

As an example, I want to collect data from more than 2000 File Directories, process(filter, transform) the data and distribute the processed data streams to 200 different directories.

Are there any caveats I should know with such large number of sources, also taking into account per operator parallelism?

Regards,

Chirag