Keyed stream, parallelism, load balancing and ensuring that the same key go to the same Task Manager and task slot

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Keyed stream, parallelism, load balancing and ensuring that the same key go to the same Task Manager and task slot

KristoffSC
Hi community,
I'm trying to build a PoC pipeline for my project and I have few questions
regarding load balancing between task managers and ensuring that keyed
stream events for the same key will go to the same Task Manager (hence the
same task slot).

Lets assume that we have 3 task managers, 3 task slot each. So it gives us 9
task slots in total.
The source is a Kafka topic with N partitions. Events are "linked" with each
other by transactionId (long) field. So they can be keyed by this field.
Events for particular transactionId can be spanned across many partitions
(we don't have control over this).

The pipeline is:
1. Kafka Source -> produces RawEvents (map operator).
2. Enrichment with AsuncFuntion(simple DB/cache call) produces
EnrichedEvents with map operator.
3. Key EnrichedEvents by tradeId, buffer events for some time, sort them by
sequenceNumber (Window aggregation) and emit a new event based on those.
N sorted EnrichedEvents produces one TransactionEvent for this
transactionId.
4. Sink TransactionEvents

Requirements:
1. Have high task slot utilization (Low number of idle/un-addressed task
slots).
2. EnrichedEvents for the same transactionId should go to the same TaskSlot
(hence the same TaskManager).

Question:
How this can be achieved?
How parallelism value for each operator should be set?

Note:
Probably I can already key the original RawEvents on transactionId.

Thanks,
Krzysztof



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Keyed stream, parallelism, load balancing and ensuring that the same key go to the same Task Manager and task slot

KristoffSC
Hi :)
Any thoughts about this?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

RE: Keyed stream, parallelism, load balancing and ensuring that the same key go to the same Task Manager and task slot

Theo
In reply to this post by KristoffSC
Hi Krzysztof,

You can just key your stream by transaction id. If you have lots of
different transaction ids, you can expect the load to be evenly
distributed. All events with the same key (==transaction id) will be
processed by the same task slot.

If you only have a few kafka partitions, you could key by transaction id
as early as possible in order to fully utilize your cluster. Remember,
however, that each keyby will cause a network shuffle, so it's probably
not worth it to fist key by transaction id, then by traded, and afterwards
again by transaction id.

Best regards
Theo

-----Original Message-----
From: KristoffSC <[hidden email]>
Sent: Dienstag, 17. Dezember 2019 23:35
To: [hidden email]
Subject: Keyed stream, parallelism, load balancing and ensuring that the
same key go to the same Task Manager and task slot

Hi community,
I'm trying to build a PoC pipeline for my project and I have few questions
regarding load balancing between task managers and ensuring that keyed
stream events for the same key will go to the same Task Manager (hence the
same task slot).

Lets assume that we have 3 task managers, 3 task slot each. So it gives us
9 task slots in total.
The source is a Kafka topic with N partitions. Events are "linked" with
each other by transactionId (long) field. So they can be keyed by this
field.
Events for particular transactionId can be spanned across many partitions
(we don't have control over this).

The pipeline is:
1. Kafka Source -> produces RawEvents (map operator).
2. Enrichment with AsuncFuntion(simple DB/cache call) produces
EnrichedEvents with map operator.
3. Key EnrichedEvents by tradeId, buffer events for some time, sort them
by sequenceNumber (Window aggregation) and emit a new event based on
those.
N sorted EnrichedEvents produces one TransactionEvent for this
transactionId.
4. Sink TransactionEvents

Requirements:
1. Have high task slot utilization (Low number of idle/un-addressed task
slots).
2. EnrichedEvents for the same transactionId should go to the same
TaskSlot (hence the same TaskManager).

Question:
How this can be achieved?
How parallelism value for each operator should be set?

Note:
Probably I can already key the original RawEvents on transactionId.

Thanks,
Krzysztof



--
Sent from:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/