Large number of sources in Flink Job

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Large number of sources in Flink Job

chiggi_dev
Hi,

I am working on a use case where my Flink job needs to collect data from thousands of sources. 

As an example, I want to collect data from more than 2000 File Directories, process(filter, transform) the data and distribute the processed data streams to 200 different directories.

Are there any caveats I should know with such large number of sources, also taking into account per operator parallelism? 

Regards,

Chirag
  
Reply | Threaded
Open this post in threaded view
|

Re: Large number of sources in Flink Job

Fabian Hueske-2
Hi Chirag,

There have been some issue with very large execution graphs.
You might need to adjust the default configuration and configure larger Akka buffers and/or timeouts.

Also, 2000 sources means that you run at least 2000 threads at once.

The FileInputFormat (and most of its sub-classes) in Flink 1.5.0 can be configured to accept multiple directories.
This would be a preferred approach to creating one source per directory.

Best, Fabian

2018-05-28 6:35 GMT+02:00 Chirag Dewan <[hidden email]>:
Hi,

I am working on a use case where my Flink job needs to collect data from thousands of sources. 

As an example, I want to collect data from more than 2000 File Directories, process(filter, transform) the data and distribute the processed data streams to 200 different directories.

Are there any caveats I should know with such large number of sources, also taking into account per operator parallelism? 

Regards,

Chirag