We have an input file that is tarred and compressed to 12gb. It is about 50gb uncompressed. With readTextFile(), I see it uncompress the file but then flink doesn't seem to handle the untar portion. It's just a single file. (We don't control the input format) foo.tar.gz 12gb foo.tar 50gb then untar it and it is valid jsonl When reading, we get this exception: Caused by: org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'playstore': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: UNKNOWN; line: 1, column: 10] at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840) The process is seeing the header in the tar format and rightly complaining about the JSON format. Is it possible to untar this file using Flink? |
Hi Billy, I suspect that it's not possible in Flink as is. The tar file acts as a directory containing an arbitrary number of files. Afaik, Flink assumes that all compressed files or just single files, like gz without tar. It's like this in your case, but then the tar part doesn't make much sense. Since you cannot control the input, you have two options: * External process that unpacks the file and then calls Flink. * Implement your own input format similar to [1]. On Mon, Dec 28, 2020 at 2:41 PM Billy Bain <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |