Guys, How to use JDBC connection in Flink using Scala

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Guys, How to use JDBC connection in Flink using Scala

ram kumar-2
Hi Team,

I am wondering is that possible to add JDBC connection or url  as a source or target in Flink using Scala.
Could you kindly some one help me on this? if you have any sample code please share it here.


Thanks
Ram

Reply | Threaded
Open this post in threaded view
|

Re: Guys, How to use JDBC connection in Flink using Scala

Felix Dreissig
Hi Ram,

On 24 Sep 2016, at 16:08, ram kumar <[hidden email]> wrote:
> I am wondering is that possible to add JDBC connection or url  as a source or target in Flink using Scala.
> Could you kindly some one help me on this? if you have any sample code please share it here.

What’s your intended use case? Getting changes from a database or REST API into a data stream for processing in Flink?

If so, you could use a data capture tool to write your changes to Kafka and then let Flink receive them from there. There are e.g. Bottled Water [1] for Postgres and Maxwell [2] and Debezium [3] for MySQL.
For REST, I suppose you’d have to periodically query the API and determine changes yourself. I don’t know if there are any tools to help you with that.

Regards,
Felix

[1] https://github.com/confluentinc/bottledwater-pg
[2] http://maxwells-daemon.io/
[3] http://debezium.io/
Reply | Threaded
Open this post in threaded view
|

Re: Guys, How to use JDBC connection in Flink using Scala

ram kumar-2
Many Thanks Felix.

 Flink Use case :

 

Extract data from source (Kafka) and loading data into target (AWS S3 and Redshift).

we use SCD2 in the Redshift…since data changes need to be captured in the redshift target.

 

To connect redshift ( for staging and production database ) I need to setup JDBC connection in Flink Scala.

 

Kafka (Source) ------------>  Flink  (JDBC)  -----------> AWS ( S3 and Redshift) Target.

 

Could you please suggest me the best approach for this use case.


Regards

Ram.



On 24 September 2016 at 16:14, Felix Dreissig <[hidden email]> wrote:
Hi Ram,

On 24 Sep 2016, at 16:08, ram kumar <[hidden email]> wrote:
> I am wondering is that possible to add JDBC connection or url  as a source or target in Flink using Scala.
> Could you kindly some one help me on this? if you have any sample code please share it here.

What’s your intended use case? Getting changes from a database or REST API into a data stream for processing in Flink?

If so, you could use a data capture tool to write your changes to Kafka and then let Flink receive them from there. There are e.g. Bottled Water [1] for Postgres and Maxwell [2] and Debezium [3] for MySQL.
For REST, I suppose you’d have to periodically query the API and determine changes yourself. I don’t know if there are any tools to help you with that.

Regards,
Felix

[1] https://github.com/confluentinc/bottledwater-pg
[2] http://maxwells-daemon.io/
[3] http://debezium.io/

Reply | Threaded
Open this post in threaded view
|

Re: Guys, How to use JDBC connection in Flink using Scala

Sameer Wadkar
Using plain JDBC on Redshift will be slow for any reasonable volume but if you need to do that, you can open a connection to it from a RichFunction open() method- 

I wrote a blog article a while back on Spark-Redshift package works - https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html

This image captures the internal processes in Spark-Redshift for read - https://databricks.com/wp-content/uploads/2015/10/image01.gif. and this one captures the write - https://databricks.com/wp-content/uploads/2015/10/image00.gif

In your case you can read the Kafka sources, partition the data appropriately (based on Redshift) and write the partitions to an S3 bucket and then invoke the COPY command in Redshift to load the data from the S3 bucket. This is the exact same process written explicitly as the above mentioned blog article.

Sameer




On Sat, Sep 24, 2016 at 12:32 PM, ram kumar <[hidden email]> wrote:
Many Thanks Felix.

 Flink Use case :

 

Extract data from source (Kafka) and loading data into target (AWS S3 and Redshift).

we use SCD2 in the Redshift…since data changes need to be captured in the redshift target.

 

To connect redshift ( for staging and production database ) I need to setup JDBC connection in Flink Scala.

 

Kafka (Source) ------------>  Flink  (JDBC)  -----------> AWS ( S3 and Redshift) Target.

 

Could you please suggest me the best approach for this use case.


Regards

Ram.



On 24 September 2016 at 16:14, Felix Dreissig <[hidden email]> wrote:
Hi Ram,

On 24 Sep 2016, at 16:08, ram kumar <[hidden email]> wrote:
> I am wondering is that possible to add JDBC connection or url  as a source or target in Flink using Scala.
> Could you kindly some one help me on this? if you have any sample code please share it here.

What’s your intended use case? Getting changes from a database or REST API into a data stream for processing in Flink?

If so, you could use a data capture tool to write your changes to Kafka and then let Flink receive them from there. There are e.g. Bottled Water [1] for Postgres and Maxwell [2] and Debezium [3] for MySQL.
For REST, I suppose you’d have to periodically query the API and determine changes yourself. I don’t know if there are any tools to help you with that.

Regards,
Felix

[1] https://github.com/confluentinc/bottledwater-pg
[2] http://maxwells-daemon.io/
[3] http://debezium.io/


Reply | Threaded
Open this post in threaded view
|

Re: Guys, How to use JDBC connection in Flink using Scala

Felix Dreissig
In reply to this post by ram kumar-2
Hi Ram,

On 24 Sep 2016, at 18:32, ram kumar <[hidden email]> wrote:
> To connect redshift ( for staging and production database ) I need to setup JDBC connection in Flink Scala.
>
>  
>
> Kafka (Source) ------------>  Flink  (JDBC)  -----------> AWS ( S3 and Redshift) Target.
>
>  
> Could you please suggest me the best approach for this use case.

If you get your stream from somewhere else (i.e. Kafka) and the database (i.e. Redshift) is your target only, forget what I told in my previous mail.
You could just write a DataSink that uses JDBC and connects to your target, but I don’t have any tips or even sample code for that.

Regards,
Felix