(DEPRECATED) Apache Flink User Mailing List archive.

Flink and Amazon EMR

Classic

List

Threaded

4 messages Options

Marco Villalobos-2

Flink and Amazon EMR

Just curious, has anybody had success with Amazon EMR with RocksDB and checkpointing in S3?

That's the configuration I am trying to setup, but my system is running more slowly than expected.

Piotr Nowojski-4

Re: Flink and Amazon EMR

Hi,

Yes, it's working. You would need to analyse what's working slower than expected. Checkpointing times? (Async duration? Sync duration? Start delay/back pressure?) Throughput? Recovery/startup? Are you being rate limited by Amazon?

Piotrek

czw., 28 sty 2021 o 03:46 Marco Villalobos <[hidden email]> napisał(a):

Just curious, has anybody had success with Amazon EMR with RocksDB and checkpointing in S3?

That's the configuration I am trying to setup, but my system is running more slowly than expected.

Marco Villalobos-2

Re: Flink and Amazon EMR

Thank you.

Checkpoints timeout often, even though the timeout limit is 20 minutes. The volume of records in our processing window that require checkpointing is large (between 200000 and 2 million). I made the assumption that Flink would batch a blob of bytes to S3, and not create an S3 call per record. Is this assumption correct?

I need to look into whether I am being rate-limited by amazon. I assumed that a rate limiting error would have bubbled up as an error in the logs. I will find a way to assure that error is logged or captured somehow.

How would backpressure come into play during checkpointing? I would expect Amazon to have enough resources. When I turn my sink (the next operator) into a print, it fails during checkpointing as well.

I will explore what you mentioned though. Thank you.

On Mon, Feb 1, 2021 at 6:53 AM Piotr Nowojski <[hidden email]> wrote:

Hi,

Yes, it's working. You would need to analyse what's working slower than expected. Checkpointing times? (Async duration? Sync duration? Start delay/back pressure?) Throughput? Recovery/startup? Are you being rate limited by Amazon?

Piotrek

czw., 28 sty 2021 o 03:46 Marco Villalobos <[hidden email]> napisał(a):
Just curious, has anybody had success with Amazon EMR with RocksDB and checkpointing in S3?

That's the configuration I am trying to setup, but my system is running more slowly than expected.

Piotr Nowojski-4

Re: Flink and Amazon EMR

Hi Marco,

> Is this assumption correct?

Yes. More or else each operator is first creating a copy of its state locally and uploading to S3 this whole file at once.

Please first take a look which part of checkpointing is taking so long.

Re backpressure. Keep in mind that Checkpoint Barriers need to travel through the job graph. If your job is very heavily back pressured with low record throughput, checkpoints might be timeouting because Checkpoint Barriers do not manage to propagate through the job graph quickly enough. For example, take a look at my response earlier today. [1]

Best,

Piotrek

[1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-Failures-from-Backpressure-possibly-due-to-overloaded-network-buffers-td41124.html

pon., 1 lut 2021 o 17:16 Marco Villalobos <[hidden email]> napisał(a):

Thank you.

Checkpoints timeout often, even though the timeout limit is 20 minutes. The volume of records in our processing window that require checkpointing is large (between 200000 and 2 million). I made the assumption that Flink would batch a blob of bytes to S3, and not create an S3 call per record. Is this assumption correct?

I need to look into whether I am being rate-limited by amazon. I assumed that a rate limiting error would have bubbled up as an error in the logs. I will find a way to assure that error is logged or captured somehow.

How would backpressure come into play during checkpointing? I would expect Amazon to have enough resources. When I turn my sink (the next operator) into a print, it fails during checkpointing as well.

I will explore what you mentioned though. Thank you.

On Mon, Feb 1, 2021 at 6:53 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

Yes, it's working. You would need to analyse what's working slower than expected. Checkpointing times? (Async duration? Sync duration? Start delay/back pressure?) Throughput? Recovery/startup? Are you being rate limited by Amazon?

Piotrek

czw., 28 sty 2021 o 03:46 Marco Villalobos <[hidden email]> napisał(a):
Just curious, has anybody had success with Amazon EMR with RocksDB and checkpointing in S3?

That's the configuration I am trying to setup, but my system is running more slowly than expected.