read a tarred + gzipped file flink 1.12

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

read a tarred + gzipped file flink 1.12

Billy Bain
We have an input file that is tarred and compressed to 12gb. It is about 50gb uncompressed.

With readTextFile(), I see it uncompress the file but then flink doesn't seem to handle the untar portion. It's just a single file. (We don't control the input format)

foo.tar.gz 12gb
foo.tar  50gb
then untar it and it is valid jsonl

When reading, we get this exception:

Caused by: org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'playstore': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: UNKNOWN; line: 1, column: 10]
at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)

The process is seeing the header in the tar format and rightly complaining about the JSON format. 

Is it possible to untar this file using Flink? 

--
Wayne D. Young
aka Billy Bob Bain
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: read a tarred + gzipped file flink 1.12

Arvid Heise-3
Hi Billy,

I suspect that it's not possible in Flink as is. The tar file acts as a directory containing an arbitrary number of files. Afaik, Flink assumes that all compressed files or just single files, like gz without tar. It's like this in your case, but then the tar part doesn't make much sense.

Since you cannot control the input, you have two options:
* External process that unpacks the file and then calls Flink.
* Implement your own input format similar to [1].


On Mon, Dec 28, 2020 at 2:41 PM Billy Bain <[hidden email]> wrote:
We have an input file that is tarred and compressed to 12gb. It is about 50gb uncompressed.

With readTextFile(), I see it uncompress the file but then flink doesn't seem to handle the untar portion. It's just a single file. (We don't control the input format)

foo.tar.gz 12gb
foo.tar  50gb
then untar it and it is valid jsonl

When reading, we get this exception:

Caused by: org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'playstore': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: UNKNOWN; line: 1, column: 10]
at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)

The process is seeing the header in the tar format and rightly complaining about the JSON format. 

Is it possible to untar this file using Flink? 

--
Wayne D. Young
aka Billy Bob Bain
[hidden email]


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng