External DB as sink - with processing guarantees

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

External DB as sink - with processing guarantees

Josh
Hi all,

I want to use an external data store (DynamoDB) as a sink with Flink. It looks like there's no connector for Dynamo at the moment, so I have two questions:

1. Is it easy to write my own sink for Flink and are there any docs around how to do this?
2. If I do this, will I still be able to have Flink's processing guarantees? I.e. Can I be sure that every tuple has contributed to the DynamoDB state either at-least-once or exactly-once?

Thanks for any advice,
Josh
Reply | Threaded
Open this post in threaded view
|

Re: External DB as sink - with processing guarantees

Nick Dimiduk
Pretty much anything you can write to from a Hadoop MapReduce program can be a Flink destination. Just plug in the OutputFormat and go.

Re: output semantics, your mileage may vary. Flink should do you fine for at least once.

On Friday, March 11, 2016, Josh <[hidden email]> wrote:
Hi all,

I want to use an external data store (DynamoDB) as a sink with Flink. It looks like there's no connector for Dynamo at the moment, so I have two questions:

1. Is it easy to write my own sink for Flink and are there any docs around how to do this?
2. If I do this, will I still be able to have Flink's processing guarantees? I.e. Can I be sure that every tuple has contributed to the DynamoDB state either at-least-once or exactly-once?

Thanks for any advice,
Josh
Reply | Threaded
Open this post in threaded view
|

Re: External DB as sink - with processing guarantees

Josh
Thanks Nick, that sounds good. I would still like to have an understanding of what determines the processing guarantee though. Say I use a DynamoDB Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if it's at-least-once, is there a way to adapt it to achieve exactly-once?

Thanks,
Josh

On 12 Mar 2016, at 02:46, Nick Dimiduk <[hidden email]> wrote:

Pretty much anything you can write to from a Hadoop MapReduce program can be a Flink destination. Just plug in the OutputFormat and go.

Re: output semantics, your mileage may vary. Flink should do you fine for at least once.

On Friday, March 11, 2016, Josh <[hidden email]> wrote:
Hi all,

I want to use an external data store (DynamoDB) as a sink with Flink. It looks like there's no connector for Dynamo at the moment, so I have two questions:

1. Is it easy to write my own sink for Flink and are there any docs around how to do this?
2. If I do this, will I still be able to have Flink's processing guarantees? I.e. Can I be sure that every tuple has contributed to the DynamoDB state either at-least-once or exactly-once?

Thanks for any advice,
Josh
Reply | Threaded
Open this post in threaded view
|

Re: External DB as sink - with processing guarantees

Fabian Hueske-2
Hi Josh,

Flink can guarantee exactly-once processing within its data flow given that the data sources allow to replay data from a specific position in the stream. For example, Flink's Kafka Consumer supports exactly-once.

Flink achieves exactly-once processing by resetting operator state to a consistent state and replaying data. This means that data might actually be processed more than once, but the operator state will reflect exactly-once semantics because it was reset. Ensuring exactly-once end-to-end it difficult, because Flink does not control (and cannot reset) the state of the sinks. By default, data can be sent more than once to a sink resulting in at-least-once semantics at the sink.

This issue can be addressed, if the sink provides transactional writes (previous writes can be undone) or if the writes are idempotent (applying them several times does not change the result). Transactional support would need to be integrated with Flink's SinkFunction. This is not the case for Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but you would need to implement a SinkFunction with transactional support or use idempotent writes if you want to achieve exactly-once results.

Best, Fabian

2016-03-12 9:57 GMT+01:00 Josh <[hidden email]>:
Thanks Nick, that sounds good. I would still like to have an understanding of what determines the processing guarantee though. Say I use a DynamoDB Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if it's at-least-once, is there a way to adapt it to achieve exactly-once?

Thanks,
Josh

On 12 Mar 2016, at 02:46, Nick Dimiduk <[hidden email]> wrote:

Pretty much anything you can write to from a Hadoop MapReduce program can be a Flink destination. Just plug in the OutputFormat and go.

Re: output semantics, your mileage may vary. Flink should do you fine for at least once.

On Friday, March 11, 2016, Josh <[hidden email]> wrote:
Hi all,

I want to use an external data store (DynamoDB) as a sink with Flink. It looks like there's no connector for Dynamo at the moment, so I have two questions:

1. Is it easy to write my own sink for Flink and are there any docs around how to do this?
2. If I do this, will I still be able to have Flink's processing guarantees? I.e. Can I be sure that every tuple has contributed to the DynamoDB state either at-least-once or exactly-once?

Thanks for any advice,
Josh

Reply | Threaded
Open this post in threaded view
|

Re: External DB as sink - with processing guarantees

Josh
Hi Fabian,

Thanks, that's very helpful. Actually most of my writes will be idempotent so I guess that means I'll get the exact once guarantee using the Hadoop output format!

Thanks,
Josh

On 12 Mar 2016, at 09:14, Fabian Hueske <[hidden email]> wrote:

Hi Josh,

Flink can guarantee exactly-once processing within its data flow given that the data sources allow to replay data from a specific position in the stream. For example, Flink's Kafka Consumer supports exactly-once.

Flink achieves exactly-once processing by resetting operator state to a consistent state and replaying data. This means that data might actually be processed more than once, but the operator state will reflect exactly-once semantics because it was reset. Ensuring exactly-once end-to-end it difficult, because Flink does not control (and cannot reset) the state of the sinks. By default, data can be sent more than once to a sink resulting in at-least-once semantics at the sink.

This issue can be addressed, if the sink provides transactional writes (previous writes can be undone) or if the writes are idempotent (applying them several times does not change the result). Transactional support would need to be integrated with Flink's SinkFunction. This is not the case for Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but you would need to implement a SinkFunction with transactional support or use idempotent writes if you want to achieve exactly-once results.

Best, Fabian

2016-03-12 9:57 GMT+01:00 Josh <[hidden email]>:
Thanks Nick, that sounds good. I would still like to have an understanding of what determines the processing guarantee though. Say I use a DynamoDB Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if it's at-least-once, is there a way to adapt it to achieve exactly-once?

Thanks,
Josh

On 12 Mar 2016, at 02:46, Nick Dimiduk <[hidden email]> wrote:

Pretty much anything you can write to from a Hadoop MapReduce program can be a Flink destination. Just plug in the OutputFormat and go.

Re: output semantics, your mileage may vary. Flink should do you fine for at least once.

On Friday, March 11, 2016, Josh <[hidden email]> wrote:
Hi all,

I want to use an external data store (DynamoDB) as a sink with Flink. It looks like there's no connector for Dynamo at the moment, so I have two questions:

1. Is it easy to write my own sink for Flink and are there any docs around how to do this?
2. If I do this, will I still be able to have Flink's processing guarantees? I.e. Can I be sure that every tuple has contributed to the DynamoDB state either at-least-once or exactly-once?

Thanks for any advice,
Josh