Batch version of StreamingFileSink.forRowFormat(...)

classic Classic list List threaded Threaded
3 messages Options
Dan
Reply | Threaded
Open this post in threaded view
|

Batch version of StreamingFileSink.forRowFormat(...)

Dan
Hi. I have a streaming job that writes to StreamingFileSink.forRowFormat(...) with an encoder that converts protocol buffers to byte arrays.

How do read this data back in during a batch pipeline (using DataSet)?  Do I use env.readFile with a custom DelimitedInputFormat?  The streamfile sink documentation is a bit vague.

These files are used as raw logs.  They're processed offline and the whole record is read and used at the same time.

Thanks!
- Dan

Reply | Threaded
Open this post in threaded view
|

Re: Batch version of StreamingFileSink.forRowFormat(...)

Timo Walther
Hi Dan,

InputFormats are the connectors of the DataSet API. Yes, you can use
either readFile, readCsvFile, readFileOfPrimitives etc. However, I would
recommend to also give Table API a try. The unified TableEnvironment is
able to perform batch processing and is integrated with a bunch of
connectors such as for filesystems [1] and through Hive abstractions [2].

I hope this helps.

Regards,
Timo

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/hive_read_write.html

On 11.08.20 00:13, Dan Hill wrote:

> Hi. I have a streaming job that writes to
> StreamingFileSink.forRowFormat(...) with an encoder that converts
> protocol buffers to byte arrays.
>
> How do read this data back in during a batch pipeline (using DataSet)?  
> Do I use env.readFile with a custom DelimitedInputFormat?  The
> streamfile sink documentation
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>
> is a bit vague.
>
> These files are used as raw logs.  They're processed offline and the
> whole record is read and used at the same time.
>
> Thanks!
> - Dan
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>

Dan
Reply | Threaded
Open this post in threaded view
|

Re: Batch version of StreamingFileSink.forRowFormat(...)

Dan
Thanks!  I'll take a look.

On Tue, Aug 11, 2020 at 1:33 AM Timo Walther <[hidden email]> wrote:
Hi Dan,

InputFormats are the connectors of the DataSet API. Yes, you can use
either readFile, readCsvFile, readFileOfPrimitives etc. However, I would
recommend to also give Table API a try. The unified TableEnvironment is
able to perform batch processing and is integrated with a bunch of
connectors such as for filesystems [1] and through Hive abstractions [2].

I hope this helps.

Regards,
Timo

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/hive_read_write.html

On 11.08.20 00:13, Dan Hill wrote:
> Hi. I have a streaming job that writes to
> StreamingFileSink.forRowFormat(...) with an encoder that converts
> protocol buffers to byte arrays.
>
> How do read this data back in during a batch pipeline (using DataSet)? 
> Do I use env.readFile with a custom DelimitedInputFormat?  The
> streamfile sink documentation
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>
> is a bit vague.
>
> These files are used as raw logs.  They're processed offline and the
> whole record is read and used at the same time.
>
> Thanks!
> - Dan
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html>