Re: Decompressing Tar Files for Batch Processing
Posted by
Chesnay Schepler on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Decompressing-Tar-Files-for-Batch-Processing-tp36436p36453.html
I would probably go with a separate process.
Downloading the file could work with Flink if it is already present in
some supported filesystem. Decompressing the file is supported for
selected formats (deflate, gzip, bz2, xz), but this seems to be an
undocumented feature, so I'm not sure how usable it is in reality.
On 07/07/2020 01:30, Austin Cawley-Edwards wrote:
> Hey all,
>
> I need to ingest a tar file containing ~1GB of data in around 10 CSVs.
> The data is fairly connected and needs some cleaning, which I'd like
> to do with the Batch Table API + SQL (but have never used before).
> I've got a small prototype loading the uncompressed CSVs and applying
> the necessary SQL, which works well.
>
> I'm wondering about the task of downloading the tar file and unzipping
> it into the CSVs. Does this sound like something I can/ should do in
> Flink, or should I set up another process to download, unzip, and
> store in a filesystem to then read with the Flink Batch job? My
> research is leading me towards doing it separately but I'd like to do
> it all in the same job if there's a creative way.
>
> Thanks!
> Austin