Failures on DataSet programs

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Failures on DataSet programs

Paulo Cezar
Hi Folks,

I was wondering if it's possible to keep partial outputs from dataset programs.
I have a batch pipeline that writes its output on HDFS using writeAsFormattedText. When it fails the output file is deleted but I would like to keep it so that I can generate new inputs for the pipeline to avoid reprocessing.

[]'s
Paulo Cezar
Reply | Threaded
Open this post in threaded view
|

Re: Failures on DataSet programs

Ufuk Celebi
Hey Paulo! I think it's not possible out of the box at the moment, but
you can try the following as a work around:

1) Create a custom OutputFormat that extends TextOutputFormat and
override the clean up method:

public class NoCleanupTextOutputFormat<T> extends TextOutputFormat<T> {

    @Override
    public void tryCleanupOnError() {
       // ignore cleanup on error
    }

}

2) writeAsFormattedText is actually a map + writeAsText (if you look
into DataSet.java). Instead of that you should manually do:

dataSet.map(new FormattingMapper<>(clean(formatter))).output(new
NoCleanupTextOutputFormat(..))


This should work as expected. You can furthermore open an issue with a
feature request to allow configuring Flink's TextOutputFormat to
ignore cleanup.

Best,

Ufuk


On Tue, Sep 27, 2016 at 10:42 PM, Paulo Cezar <[hidden email]> wrote:

> Hi Folks,
>
> I was wondering if it's possible to keep partial outputs from dataset
> programs.
> I have a batch pipeline that writes its output on HDFS using
> writeAsFormattedText. When it fails the output file is deleted but I would
> like to keep it so that I can generate new inputs for the pipeline to avoid
> reprocessing.
>
> []'s
> Paulo Cezar