I've looked into this briefly a while ago out of interest and read about how beam handles this. I've never actually implemented but the concept sounds reasonable to me.
What I read from their code is that beam exports the BigQuery data to Google Storage. This export shards the data in files with a max size of 1GB and these files are then processed by the 'source functions' in beam.
I think implementing this in Flink would require the following:
So the job might look something like this:
> List<String> files = doBigQueryExportJob();
> DataSet<String> records = environment.fromCollection(files)
> .flatMap(new ReadFromFile())
> .map(doWork());
Hi,
Has anyone created a source to READ from BigQuery into Flink yet (we
have Flink running on K8S in the Google cloud)?
I would like to retrieve a DataSet in a distributed way (the data ...
it's kinda big) and process that with Flink running on k8s (which we
have running already).
So far I have not been able to find anything yet.
Any pointers/hints/code fragments are welcome.
Thanks
--
Best regards / Met vriendelijke groeten,
Niels Basjes