BigQuery source ?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

BigQuery source ?

Niels Basjes
Hi,

Has anyone created a source to READ from BigQuery into Flink yet (we
have Flink running on K8S in the Google cloud)?
I would like to retrieve a DataSet in a distributed way (the data ...
it's kinda big) and process that with Flink running on k8s (which we
have running already).

So far I have not been able to find anything yet.
Any pointers/hints/code fragments are welcome.

Thanks

--
Best regards / Met vriendelijke groeten,

Niels Basjes
Reply | Threaded
Open this post in threaded view
|

Re: BigQuery source ?

Richard Deurwaarder
I've looked into this briefly a while ago out of interest and read about how beam handles this. I've never actually implemented but the concept sounds reasonable to me.

What I read from their code is that beam exports the BigQuery data to Google Storage. This export shards the data in files with a max size of 1GB and these files are then processed by the 'source functions' in beam.

I think implementing this in Flink would require the following:

* Before starting the Flink job run the BigQuery to Google Storage Export (https://cloud.google.com/bigquery/docs/exporting-data)
* Start the flink job and point towards the Google storage files (using https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage to easily read from Google Storage buckets)

So the job might look something like this:

> List<String> files = doBigQueryExportJob();
> DataSet<String> records = environment.fromCollection(files)
>                 .flatMap(new ReadFromFile())
>                 .map(doWork());

On Fri, May 31, 2019 at 10:15 AM Niels Basjes <[hidden email]> wrote:
Hi,

Has anyone created a source to READ from BigQuery into Flink yet (we
have Flink running on K8S in the Google cloud)?
I would like to retrieve a DataSet in a distributed way (the data ...
it's kinda big) and process that with Flink running on k8s (which we
have running already).

So far I have not been able to find anything yet.
Any pointers/hints/code fragments are welcome.

Thanks

--
Best regards / Met vriendelijke groeten,

Niels Basjes