elasticsearch-hadoop creates one Hadoop InputSplit (tasks) per Elasticsearch
shard.
so if my index have 20 shards, it will be split to 20 InputSplit
/My question is:/
What will happen if my job restart (failover) after finishing half of the
InputSplit's ?
Does hadoopInputFormat remember which InputSplit are finished and knows how
to continue from where it stopped? (maybe read from beginning of unfinished
InputSplit? ) or it starts from the beginning?
At the moment if the processing of any data input split fails,
Flink will restart the batch job completely from scratch.
There is an ongoing effort to improve fine-grained recovery in FLINK-4256.
Best,
Andrey
> On 2 Oct 2018, at 13:52, aviad <[hidden email]> wrote:
>
> Hi,
>
> I want to write batch job which reads data from *elasticsearch* using
> *elasticsearch-hadoop* (https://github.com/elastic/elasticsearch-hadoop/)
> and *hadoopInputFormat*
>
> example code (from
> https://github.com/genged/flink-playground/blob/master/src/main/java/com/mic/flink/FlinkMain.java): >
>
>
> elasticsearch-hadoop creates one Hadoop InputSplit (tasks) per Elasticsearch
> shard.
> so if my index have 20 shards, it will be split to 20 InputSplit
>
>
> /My question is:/
> What will happen if my job restart (failover) after finishing half of the
> InputSplit's ?
> Does hadoopInputFormat remember which InputSplit are finished and knows how
> to continue from where it stopped? (maybe read from beginning of unfinished
> InputSplit? ) or it starts from the beginning?
>
> thanks
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/