Key by Kafka partition / Kinesis shard

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Key by Kafka partition / Kinesis shard

Yegor Roganov
Hello

To learn Flink I'm trying to build a simple application where I want to save events coming from Kinesis to S3.
I want to subscribe to each shard, and within each shard I want to batch for 30 seconds, or until 1000 events are observed. These batches should then be uploaded to S3.
What I don't understand is how to key my source on shard id, and do it in a way that doesn't induce unnecessary shuffling.
Is this possible with Flink?
Reply | Threaded
Open this post in threaded view
|

Re: Key by Kafka partition / Kinesis shard

Till Rohrmann
Hi Yegor,

If you want to use Flink's keyed windowing logic, then you need to insert a keyBy/shuffle operation because Flink currently cannot simply use the partitioning of the Kinesis shards. The reason is that Flink needs to group the keys into the correct key groups in order to support rescaling of the state.

What you can do, though, is to create a custom operator or use a flatMap to build your own windowing operator. This operator could then use the partitioning of the Kinesis shards by simply collecting the events until either 30 seconds or 1000 events are observed.

Cheers,
Till

On Wed, Apr 28, 2021 at 11:12 AM Yegor Roganov <[hidden email]> wrote:
Hello

To learn Flink I'm trying to build a simple application where I want to save events coming from Kinesis to S3.
I want to subscribe to each shard, and within each shard I want to batch for 30 seconds, or until 1000 events are observed. These batches should then be uploaded to S3.
What I don't understand is how to key my source on shard id, and do it in a way that doesn't induce unnecessary shuffling.
Is this possible with Flink?
Reply | Threaded
Open this post in threaded view
|

Re: Key by Kafka partition / Kinesis shard

raghav280392
Hi Yegor

The trigger implementation in Flink does not support  trigger by event count and duration together. You can update the existing CountTrigger implementation to support your functionality.
You can use the CustomTrigger.java (minor enhancement of CountTrigger) as such which I have attached in this thread. TestWindow is the window function which lets you receive the grouped events. You check the diff of CountTrigger and CustomTrigger for your better understanding.

Usage
stream.timeWindow(Time.seconds(10)).trigger(CustomTrigger.of(3)).apply(new TestWindow());

Thank you
Raghavendar T S





Virus-free. www.avast.com

On Thu, Apr 29, 2021 at 1:04 PM Till Rohrmann <[hidden email]> wrote:
Hi Yegor,

If you want to use Flink's keyed windowing logic, then you need to insert a keyBy/shuffle operation because Flink currently cannot simply use the partitioning of the Kinesis shards. The reason is that Flink needs to group the keys into the correct key groups in order to support rescaling of the state.

What you can do, though, is to create a custom operator or use a flatMap to build your own windowing operator. This operator could then use the partitioning of the Kinesis shards by simply collecting the events until either 30 seconds or 1000 events are observed.

Cheers,
Till

On Wed, Apr 28, 2021 at 11:12 AM Yegor Roganov <[hidden email]> wrote:
Hello

To learn Flink I'm trying to build a simple application where I want to save events coming from Kinesis to S3.
I want to subscribe to each shard, and within each shard I want to batch for 30 seconds, or until 1000 events are observed. These batches should then be uploaded to S3.
What I don't understand is how to key my source on shard id, and do it in a way that doesn't induce unnecessary shuffling.
Is this possible with Flink?


--
Raghavendar T S

CustomTrigger.java (3K) Download Attachment
TestWindow.java (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Key by Kafka partition / Kinesis shard

Yegor Roganov
In reply to this post by Till Rohrmann
Hi Till, thank you for your reply.

> What you can do, though, is to create a custom operator or use a flatMap to build your own windowing operator.
Since my stream wouldn't be keyed, does this mean that I would need to use "Managed Operator State" (aka raw state)?

On Thu, Apr 29, 2021 at 10:34 AM Till Rohrmann <[hidden email]> wrote:
Hi Yegor,

If you want to use Flink's keyed windowing logic, then you need to insert a keyBy/shuffle operation because Flink currently cannot simply use the partitioning of the Kinesis shards. The reason is that Flink needs to group the keys into the correct key groups in order to support rescaling of the state.

What you can do, though, is to create a custom operator or use a flatMap to build your own windowing operator. This operator could then use the partitioning of the Kinesis shards by simply collecting the events until either 30 seconds or 1000 events are observed.

Cheers,
Till

On Wed, Apr 28, 2021 at 11:12 AM Yegor Roganov <[hidden email]> wrote:
Hello

To learn Flink I'm trying to build a simple application where I want to save events coming from Kinesis to S3.
I want to subscribe to each shard, and within each shard I want to batch for 30 seconds, or until 1000 events are observed. These batches should then be uploaded to S3.
What I don't understand is how to key my source on shard id, and do it in a way that doesn't induce unnecessary shuffling.
Is this possible with Flink?
Reply | Threaded
Open this post in threaded view
|

Re: Key by Kafka partition / Kinesis shard

Yegor Roganov
In reply to this post by raghav280392
Hi Raghavendar, thank you for your reply.

> stream.timeWindow(Time.seconds(10)).trigger(CustomTrigger.of(3)).apply(new TestWindow());
What would this stream be keyed on?

On Thu, Apr 29, 2021 at 11:58 AM Raghavendar T S <[hidden email]> wrote:
Hi Yegor

The trigger implementation in Flink does not support  trigger by event count and duration together. You can update the existing CountTrigger implementation to support your functionality.
You can use the CustomTrigger.java (minor enhancement of CountTrigger) as such which I have attached in this thread. TestWindow is the window function which lets you receive the grouped events. You check the diff of CountTrigger and CustomTrigger for your better understanding.

Usage
stream.timeWindow(Time.seconds(10)).trigger(CustomTrigger.of(3)).apply(new TestWindow());

Thank you
Raghavendar T S





Virus-free. www.avast.com

On Thu, Apr 29, 2021 at 1:04 PM Till Rohrmann <[hidden email]> wrote:
Hi Yegor,

If you want to use Flink's keyed windowing logic, then you need to insert a keyBy/shuffle operation because Flink currently cannot simply use the partitioning of the Kinesis shards. The reason is that Flink needs to group the keys into the correct key groups in order to support rescaling of the state.

What you can do, though, is to create a custom operator or use a flatMap to build your own windowing operator. This operator could then use the partitioning of the Kinesis shards by simply collecting the events until either 30 seconds or 1000 events are observed.

Cheers,
Till

On Wed, Apr 28, 2021 at 11:12 AM Yegor Roganov <[hidden email]> wrote:
Hello

To learn Flink I'm trying to build a simple application where I want to save events coming from Kinesis to S3.
I want to subscribe to each shard, and within each shard I want to batch for 30 seconds, or until 1000 events are observed. These batches should then be uploaded to S3.
What I don't understand is how to key my source on shard id, and do it in a way that doesn't induce unnecessary shuffling.
Is this possible with Flink?


--
Raghavendar T S
Reply | Threaded
Open this post in threaded view
|

Re: Key by Kafka partition / Kinesis shard

Till Rohrmann
Yes you would have to use the operator state for this. This would have the limitation that rescaling would probably not properly work. Also if the assignment of shards to operators changes upon failure recovery it can happen that it generates some incorrect results (some elements from shard 1 might end up on an operator which then consumes shard 2, for example).

Cheers,
Till

On Thu, Apr 29, 2021 at 2:51 PM Yegor Roganov <[hidden email]> wrote:
Hi Raghavendar, thank you for your reply.

> stream.timeWindow(Time.seconds(10)).trigger(CustomTrigger.of(3)).apply(new TestWindow());
What would this stream be keyed on?

On Thu, Apr 29, 2021 at 11:58 AM Raghavendar T S <[hidden email]> wrote:
Hi Yegor

The trigger implementation in Flink does not support  trigger by event count and duration together. You can update the existing CountTrigger implementation to support your functionality.
You can use the CustomTrigger.java (minor enhancement of CountTrigger) as such which I have attached in this thread. TestWindow is the window function which lets you receive the grouped events. You check the diff of CountTrigger and CustomTrigger for your better understanding.

Usage
stream.timeWindow(Time.seconds(10)).trigger(CustomTrigger.of(3)).apply(new TestWindow());

Thank you
Raghavendar T S





Virus-free. www.avast.com

On Thu, Apr 29, 2021 at 1:04 PM Till Rohrmann <[hidden email]> wrote:
Hi Yegor,

If you want to use Flink's keyed windowing logic, then you need to insert a keyBy/shuffle operation because Flink currently cannot simply use the partitioning of the Kinesis shards. The reason is that Flink needs to group the keys into the correct key groups in order to support rescaling of the state.

What you can do, though, is to create a custom operator or use a flatMap to build your own windowing operator. This operator could then use the partitioning of the Kinesis shards by simply collecting the events until either 30 seconds or 1000 events are observed.

Cheers,
Till

On Wed, Apr 28, 2021 at 11:12 AM Yegor Roganov <[hidden email]> wrote:
Hello

To learn Flink I'm trying to build a simple application where I want to save events coming from Kinesis to S3.
I want to subscribe to each shard, and within each shard I want to batch for 30 seconds, or until 1000 events are observed. These batches should then be uploaded to S3.
What I don't understand is how to key my source on shard id, and do it in a way that doesn't induce unnecessary shuffling.
Is this possible with Flink?


--
Raghavendar T S