Batch source improvement

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Batch source improvement

Flavio Pompermaier
Hi to all,
we're still using Flink as a batch processor and despite not very advertised is still doing great.
However there's one thing I always wanted to ask: when reading data from a source the job manager computes the splits and assigns a set of them to every instance of the InputFormat. This works fine until the data is pefectly balanced but in my experience most of the times this is not true and some of them completes very quickly while some of them continue to read data (also for a long time).

Couldn't this be enhanced buffering splits in a shared place so that tasks could ask for a "free" split as soon as they complete to read their assigned split? Would it be complicated to implement such a logic?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Batch source improvement

Fabian Hueske-2
Hi Flavio,

actually, Flink did always lazily assign input splits. The JM gets the list of IS from the InputFormat.
Parallel source instances (with an input format) request an input split from the JM whenever they do not have anything to do.
This should actually balance some of the data skew in input splits.

Best, Fabian

2017-04-29 11:36 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
we're still using Flink as a batch processor and despite not very advertised is still doing great.
However there's one thing I always wanted to ask: when reading data from a source the job manager computes the splits and assigns a set of them to every instance of the InputFormat. This works fine until the data is pefectly balanced but in my experience most of the times this is not true and some of them completes very quickly while some of them continue to read data (also for a long time).

Couldn't this be enhanced buffering splits in a shared place so that tasks could ask for a "free" split as soon as they complete to read their assigned split? Would it be complicated to implement such a logic?

Best,
Flavio