Hi. I have a streaming job that writes to StreamingFileSink.forRowFormat(...) with an encoder that converts protocol buffers to byte arrays.
How do read this data back in during a batch pipeline (using DataSet)? Do I use env.readFile with a custom DelimitedInputFormat? The streamfile sink documentation is a bit vague. These files are used as raw logs. They're processed offline and the whole record is read and used at the same time. Thanks! - Dan |
Hi Dan,
InputFormats are the connectors of the DataSet API. Yes, you can use either readFile, readCsvFile, readFileOfPrimitives etc. However, I would recommend to also give Table API a try. The unified TableEnvironment is able to perform batch processing and is integrated with a bunch of connectors such as for filesystems [1] and through Hive abstractions [2]. I hope this helps. Regards, Timo [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/connectors/filesystem.html [2] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/hive_read_write.html On 11.08.20 00:13, Dan Hill wrote: > Hi. I have a streaming job that writes to > StreamingFileSink.forRowFormat(...) with an encoder that converts > protocol buffers to byte arrays. > > How do read this data back in during a batch pipeline (using DataSet)? > Do I use env.readFile with a custom DelimitedInputFormat? The > streamfile sink documentation > <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html> > is a bit vague. > > These files are used as raw logs. They're processed offline and the > whole record is read and used at the same time. > > Thanks! > - Dan > <https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html> |
Thanks! I'll take a look. On Tue, Aug 11, 2020 at 1:33 AM Timo Walther <[hidden email]> wrote: Hi Dan, |
Free forum by Nabble | Edit this page |