Flink redshift table lookup and updates

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink redshift table lookup and updates

Harshith Chennamaneni

Hi,

I've very recently come upon flink and I'm trying to use it to solve a problem that I have. 

I have a stream of User Settings updates coming through kafka queue. I need to store the most recent settings along with a history of settings for each user in redshift which then feeds into analytics dashboards.

I've been contemplating using Flink for this problem. I wanted some guidance from people experienced in Flink to help me decide if Flink is suited to this problem and if so what approach might work best. I am considering the following approaches:

1. Create a secondary key-value database with the users latest settings and lookup these settings after grouping the stream byKey(userId) to check if a setting has changed and if so create a history record. I came across this stackoverflow thread: http://stackoverflow.com/questions/38866078/how-to-look-up-and-update-the-state-of-a-record-from-a-database-in-apache-flink to help with this approach.

2. Pull the current snapshot of users from redshift and keep it as state in Flink program at program start (the snapshot isn't huge ~1GB). Subsequently lookup from this state and update it when processing events.

In both these cases I plan to create a Redshift sink that batches updates to history as well as latest state and persists to redshift by batches (through s3 and copy command for history, through a update on join for snapshot).

Is one of these designs more suited to working with Flink? Is there an alternative I should consider?

Thanks!

-H

Reply | Threaded
Open this post in threaded view
|

Re: Flink redshift table lookup and updates

rmetzger0
Hi Harshith,

Welcome to the Flink community ;)

I would recommend using approach 2. Keeping the state in Flink and just sending updates to the dashboard store should give you better performance and consistency.
I don't know whether its better to download the full state snapshot from redshift in the beginning, or lazily load the required data once you need it (and then use the state afterwards).

Regards,
Robert

On Fri, Aug 19, 2016 at 5:20 AM, Harshith Chennamaneni <[hidden email]> wrote:

Hi,

I've very recently come upon flink and I'm trying to use it to solve a problem that I have. 

I have a stream of User Settings updates coming through kafka queue. I need to store the most recent settings along with a history of settings for each user in redshift which then feeds into analytics dashboards.

I've been contemplating using Flink for this problem. I wanted some guidance from people experienced in Flink to help me decide if Flink is suited to this problem and if so what approach might work best. I am considering the following approaches:

1. Create a secondary key-value database with the users latest settings and lookup these settings after grouping the stream byKey(userId) to check if a setting has changed and if so create a history record. I came across this stackoverflow thread: http://stackoverflow.com/questions/38866078/how-to-look-up-and-update-the-state-of-a-record-from-a-database-in-apache-flink to help with this approach.

2. Pull the current snapshot of users from redshift and keep it as state in Flink program at program start (the snapshot isn't huge ~1GB). Subsequently lookup from this state and update it when processing events.

In both these cases I plan to create a Redshift sink that batches updates to history as well as latest state and persists to redshift by batches (through s3 and copy command for history, through a update on join for snapshot).

Is one of these designs more suited to working with Flink? Is there an alternative I should consider?

Thanks!

-H