Flink DynamoDB stream connector losing records

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink DynamoDB stream connector losing records

Jiawei Wu
Hi,

I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong.

After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May I know if someone encountered the same issue before? Or is it a known issue in Flink 1.8?

Thanks,
Jiawei
Reply | Threaded
Open this post in threaded view
|

Re: Flink DynamoDB stream connector losing records

Andrey Zagrebin-5
Hi Jiawei,

Could you try Flink latest release 1.11?
1.8 will probably not get bugfix releases.
I will cc Ying Xu who might have a better idea about the DinamoDB source.

Best,
Andrey

On Thu, Sep 10, 2020 at 3:10 PM Jiawei Wu <[hidden email]> wrote:
Hi,

I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong.

After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May I know if someone encountered the same issue before? Or is it a known issue in Flink 1.8?

Thanks,
Jiawei
Reply | Threaded
Open this post in threaded view
|

Re: Flink DynamoDB stream connector losing records

Jiawei Wu
Hi Andrey,

Thanks for your suggestion, but I'm using Kinesis analytics application which supports only Flink 1.8....

Regards,
Jiawei

On Thu, Sep 10, 2020 at 10:13 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Jiawei,

Could you try Flink latest release 1.11?
1.8 will probably not get bugfix releases.
I will cc Ying Xu who might have a better idea about the DinamoDB source.

Best,
Andrey

On Thu, Sep 10, 2020 at 3:10 PM Jiawei Wu <[hidden email]> wrote:
Hi,

I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong.

After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May I know if someone encountered the same issue before? Or is it a known issue in Flink 1.8?

Thanks,
Jiawei
Reply | Threaded
Open this post in threaded view
|

Re: Flink DynamoDB stream connector losing records

Jiawei Wu
And I suspect I have throttled by DynamoDB stream, I contacted AWS support but got no response except for increasing WCU and RCU.

Is it possible that Flink will lose exactly-once semantics when throttled?

On Thu, Sep 10, 2020 at 10:31 PM Jiawei Wu <[hidden email]> wrote:
Hi Andrey,

Thanks for your suggestion, but I'm using Kinesis analytics application which supports only Flink 1.8....

Regards,
Jiawei

On Thu, Sep 10, 2020 at 10:13 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Jiawei,

Could you try Flink latest release 1.11?
1.8 will probably not get bugfix releases.
I will cc Ying Xu who might have a better idea about the DinamoDB source.

Best,
Andrey

On Thu, Sep 10, 2020 at 3:10 PM Jiawei Wu <[hidden email]> wrote:
Hi,

I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong.

After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May I know if someone encountered the same issue before? Or is it a known issue in Flink 1.8?

Thanks,
Jiawei
Reply | Threaded
Open this post in threaded view
|

Re: Flink DynamoDB stream connector losing records

Andrey Zagrebin-5
Generally speaking this should not be a problem for exactly-once but I am not familiar with the DynamoDB and its Flink connector.
Did you observe any failover in Flink logs?

On Thu, Sep 10, 2020 at 4:34 PM Jiawei Wu <[hidden email]> wrote:
And I suspect I have throttled by DynamoDB stream, I contacted AWS support but got no response except for increasing WCU and RCU.

Is it possible that Flink will lose exactly-once semantics when throttled?

On Thu, Sep 10, 2020 at 10:31 PM Jiawei Wu <[hidden email]> wrote:
Hi Andrey,

Thanks for your suggestion, but I'm using Kinesis analytics application which supports only Flink 1.8....

Regards,
Jiawei

On Thu, Sep 10, 2020 at 10:13 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Jiawei,

Could you try Flink latest release 1.11?
1.8 will probably not get bugfix releases.
I will cc Ying Xu who might have a better idea about the DinamoDB source.

Best,
Andrey

On Thu, Sep 10, 2020 at 3:10 PM Jiawei Wu <[hidden email]> wrote:
Hi,

I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong.

After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May I know if someone encountered the same issue before? Or is it a known issue in Flink 1.8?

Thanks,
Jiawei
Reply | Threaded
Open this post in threaded view
|

Re: Flink DynamoDB stream connector losing records

Ying Xu
Hi Jiawei: 

Sorry for the delayed reply.  When you mention certain records getting skipped, is it from the same run or across different runs.  Any more specific details on how/when records are lost? 

FlinkDynamoDBStreamsConsumer is built on top of FlinkKinesisConsumer , with similar offset management mechanism.  In theory it shouldn't lose exactly-once semantics in the case of getting throttled.  We haven't run it in any AWS kinesis analytics environment though. 

Thanks. 


On Thu, Sep 10, 2020 at 7:51 AM Andrey Zagrebin <[hidden email]> wrote:
Generally speaking this should not be a problem for exactly-once but I am not familiar with the DynamoDB and its Flink connector.
Did you observe any failover in Flink logs?

On Thu, Sep 10, 2020 at 4:34 PM Jiawei Wu <[hidden email]> wrote:
And I suspect I have throttled by DynamoDB stream, I contacted AWS support but got no response except for increasing WCU and RCU.

Is it possible that Flink will lose exactly-once semantics when throttled?

On Thu, Sep 10, 2020 at 10:31 PM Jiawei Wu <[hidden email]> wrote:
Hi Andrey,

Thanks for your suggestion, but I'm using Kinesis analytics application which supports only Flink 1.8....

Regards,
Jiawei

On Thu, Sep 10, 2020 at 10:13 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Jiawei,

Could you try Flink latest release 1.11?
1.8 will probably not get bugfix releases.
I will cc Ying Xu who might have a better idea about the DinamoDB source.

Best,
Andrey

On Thu, Sep 10, 2020 at 3:10 PM Jiawei Wu <[hidden email]> wrote:
Hi,

I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong.

After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May I know if someone encountered the same issue before? Or is it a known issue in Flink 1.8?

Thanks,
Jiawei
Reply | Threaded
Open this post in threaded view
|

Re: Flink DynamoDB stream connector losing records

Cranmer, Danny

Hi Jiawei,

 

I agree that the offset management mechanism uses the same code as Kinesis Stream Consumer and in theory should not lose exactly-once semantics. As Ying is alluding to, if your application is restarted and you have snapshotting disabled in AWS there is a chance that records can be lost between runs. However, if you have snapshotting enabled then the application should continue consuming records from the last processed sequence number.

 

I am happy to take a deeper look if you can provide more information/logs/code.

 

Thanks,

 

From: Ying Xu <[hidden email]>
Date: Monday, 14 September 2020 at 19:48
To: Andrey Zagrebin <[hidden email]>
Cc: Jiawei Wu <[hidden email]>, user <[hidden email]>
Subject: RE: [EXTERNAL] Flink DynamoDB stream connector losing records

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Hi Jiawei: 

 

Sorry for the delayed reply.  When you mention certain records getting skipped, is it from the same run or across different runs.  Any more specific details on how/when records are lost? 

 

FlinkDynamoDBStreamsConsumer is built on top of FlinkKinesisConsumer , with similar offset management mechanism.  In theory it shouldn't lose exactly-once semantics in the case of getting throttled.  We haven't run it in any AWS kinesis analytics environment though. 

 

Thanks. 

 

 

On Thu, Sep 10, 2020 at 7:51 AM Andrey Zagrebin <[hidden email]> wrote:

Generally speaking this should not be a problem for exactly-once but I am not familiar with the DynamoDB and its Flink connector.

Did you observe any failover in Flink logs?

 

On Thu, Sep 10, 2020 at 4:34 PM Jiawei Wu <[hidden email]> wrote:

And I suspect I have throttled by DynamoDB stream, I contacted AWS support but got no response except for increasing WCU and RCU.

 

Is it possible that Flink will lose exactly-once semantics when throttled?

 

On Thu, Sep 10, 2020 at 10:31 PM Jiawei Wu <[hidden email]> wrote:

Hi Andrey,

 

Thanks for your suggestion, but I'm using Kinesis analytics application which supports only Flink 1.8....

 

Regards,

Jiawei

 

On Thu, Sep 10, 2020 at 10:13 PM Andrey Zagrebin <[hidden email]> wrote:

Hi Jiawei,

 

Could you try Flink latest release 1.11?
1.8 will probably not get bugfix releases.

I will cc Ying Xu who might have a better idea about the DinamoDB source.

 

Best,

Andrey

 

On Thu, Sep 10, 2020 at 3:10 PM Jiawei Wu <[hidden email]> wrote:

Hi,

 

I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong.

 

After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May I know if someone encountered the same issue before? Or is it a known issue in Flink 1.8?

 

Thanks,

Jiawei

Reply | Threaded
Open this post in threaded view
|

Re: Flink DynamoDB stream connector losing records

Jiawei Wu
Hi Ying and Danny,

Sorry for the late reply, I just got back from vacation.

Yes I'm running Flink in Kinesis Data Analytics with Flink 1.8, and checkpoint is enabled. This fully managed solution limits my access to Flink logs, so far I didn't get any logs related to throttle or fail over. The reason why I suspect throttle is the root cause is because some AWS lambda that connects to the same DynamoDB stream has higher throttle right after Flink starts consuming the DynamoDB stream, in this case I believe the throttle will also happen on Flink side. I'm actively working with AWS support to try to find some logs on this.

At the same time, when you say 'in theory should not lose exactly-once semantics', does that mean Flink will retry when throttle? I notice there is a parameter "flink.shard.getrecords.maxretries" and it's default value is 3. Will Flink skip this record when all retry attempts failed?

Thanks,
Jiawei



On Tue, Sep 15, 2020 at 4:38 PM Cranmer, Danny <[hidden email]> wrote:

Hi Jiawei,

 

I agree that the offset management mechanism uses the same code as Kinesis Stream Consumer and in theory should not lose exactly-once semantics. As Ying is alluding to, if your application is restarted and you have snapshotting disabled in AWS there is a chance that records can be lost between runs. However, if you have snapshotting enabled then the application should continue consuming records from the last processed sequence number.

 

I am happy to take a deeper look if you can provide more information/logs/code.

 

Thanks,

 

From: Ying Xu <[hidden email]>
Date: Monday, 14 September 2020 at 19:48
To: Andrey Zagrebin <[hidden email]>
Cc: Jiawei Wu <[hidden email]>, user <[hidden email]>
Subject: RE: [EXTERNAL] Flink DynamoDB stream connector losing records

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Hi Jiawei: 

 

Sorry for the delayed reply.  When you mention certain records getting skipped, is it from the same run or across different runs.  Any more specific details on how/when records are lost? 

 

FlinkDynamoDBStreamsConsumer is built on top of FlinkKinesisConsumer , with similar offset management mechanism.  In theory it shouldn't lose exactly-once semantics in the case of getting throttled.  We haven't run it in any AWS kinesis analytics environment though. 

 

Thanks. 

 

 

On Thu, Sep 10, 2020 at 7:51 AM Andrey Zagrebin <[hidden email]> wrote:

Generally speaking this should not be a problem for exactly-once but I am not familiar with the DynamoDB and its Flink connector.

Did you observe any failover in Flink logs?

 

On Thu, Sep 10, 2020 at 4:34 PM Jiawei Wu <[hidden email]> wrote:

And I suspect I have throttled by DynamoDB stream, I contacted AWS support but got no response except for increasing WCU and RCU.

 

Is it possible that Flink will lose exactly-once semantics when throttled?

 

On Thu, Sep 10, 2020 at 10:31 PM Jiawei Wu <[hidden email]> wrote:

Hi Andrey,

 

Thanks for your suggestion, but I'm using Kinesis analytics application which supports only Flink 1.8....

 

Regards,

Jiawei

 

On Thu, Sep 10, 2020 at 10:13 PM Andrey Zagrebin <[hidden email]> wrote:

Hi Jiawei,

 

Could you try Flink latest release 1.11?
1.8 will probably not get bugfix releases.

I will cc Ying Xu who might have a better idea about the DinamoDB source.

 

Best,

Andrey

 

On Thu, Sep 10, 2020 at 3:10 PM Jiawei Wu <[hidden email]> wrote:

Hi,

 

I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong.

 

After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May I know if someone encountered the same issue before? Or is it a known issue in Flink 1.8?

 

Thanks,

Jiawei