Hi community,
I'm trying to build a PoC pipeline for my project and I have few questions regarding load balancing between task managers and ensuring that keyed stream events for the same key will go to the same Task Manager (hence the same task slot). Lets assume that we have 3 task managers, 3 task slot each. So it gives us 9 task slots in total. The source is a Kafka topic with N partitions. Events are "linked" with each other by transactionId (long) field. So they can be keyed by this field. Events for particular transactionId can be spanned across many partitions (we don't have control over this). The pipeline is: 1. Kafka Source -> produces RawEvents (map operator). 2. Enrichment with AsuncFuntion(simple DB/cache call) produces EnrichedEvents with map operator. 3. Key EnrichedEvents by tradeId, buffer events for some time, sort them by sequenceNumber (Window aggregation) and emit a new event based on those. N sorted EnrichedEvents produces one TransactionEvent for this transactionId. 4. Sink TransactionEvents Requirements: 1. Have high task slot utilization (Low number of idle/un-addressed task slots). 2. EnrichedEvents for the same transactionId should go to the same TaskSlot (hence the same TaskManager). Question: How this can be achieved? How parallelism value for each operator should be set? Note: Probably I can already key the original RawEvents on transactionId. Thanks, Krzysztof -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi :)
Any thoughts about this? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
In reply to this post by KristoffSC
Hi Krzysztof,
You can just key your stream by transaction id. If you have lots of different transaction ids, you can expect the load to be evenly distributed. All events with the same key (==transaction id) will be processed by the same task slot. If you only have a few kafka partitions, you could key by transaction id as early as possible in order to fully utilize your cluster. Remember, however, that each keyby will cause a network shuffle, so it's probably not worth it to fist key by transaction id, then by traded, and afterwards again by transaction id. Best regards Theo -----Original Message----- From: KristoffSC <[hidden email]> Sent: Dienstag, 17. Dezember 2019 23:35 To: [hidden email] Subject: Keyed stream, parallelism, load balancing and ensuring that the same key go to the same Task Manager and task slot Hi community, I'm trying to build a PoC pipeline for my project and I have few questions regarding load balancing between task managers and ensuring that keyed stream events for the same key will go to the same Task Manager (hence the same task slot). Lets assume that we have 3 task managers, 3 task slot each. So it gives us 9 task slots in total. The source is a Kafka topic with N partitions. Events are "linked" with each other by transactionId (long) field. So they can be keyed by this field. Events for particular transactionId can be spanned across many partitions (we don't have control over this). The pipeline is: 1. Kafka Source -> produces RawEvents (map operator). 2. Enrichment with AsuncFuntion(simple DB/cache call) produces EnrichedEvents with map operator. 3. Key EnrichedEvents by tradeId, buffer events for some time, sort them by sequenceNumber (Window aggregation) and emit a new event based on those. N sorted EnrichedEvents produces one TransactionEvent for this transactionId. 4. Sink TransactionEvents Requirements: 1. Have high task slot utilization (Low number of idle/un-addressed task slots). 2. EnrichedEvents for the same transactionId should go to the same TaskSlot (hence the same TaskManager). Question: How this can be achieved? How parallelism value for each operator should be set? Note: Probably I can already key the original RawEvents on transactionId. Thanks, Krzysztof -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |