Hi All, I have a job as attached.Since the numberof slots is set to 1 per TM, I would assume that the job would be processed in parallel in 2 different TMs and that each consumer in each TM is connected to 1 partition of the topic. This therefore should have kept the overall processing time the same or less !!! The co-flatmap connects the 2 streams & uses ValueState (checkpointed in FS). I think this is distributed among the TMs. My understanding is that the search of values state could be costly between TMs. Do you sense something wrong here? Best Regards CVP plan.jpg (15K) Download Attachment |
Hi All, Any updates on this?On Thu, Jan 5, 2017 at 1:21 PM, Chakravarthy varaga <[hidden email]> wrote:
|
Just noticed there are only two partitions per topic. Regardless of how large parallelism set. Only two of those will get partition assigned at most. Sent from my iPhone
|
Hi! You are right, parallelism 2 should be faster than parallelism 1 ;-) As ChenQin pointed out, having only 2 Kafka Partitions may prevent further scaleout. Few things to check: - How are you connecting the FlatMap and CoFlatMap? Default, keyBy, broadcast? - Broadcast for example would multiply the data based on parallelism, can lead to slowdown when saturating the network. - Are you using the standard Kafka Source (which Kafka version)? - Is there any part in the program that multiplies data/effort with higher parallelism (does the FlatMap explode data based on parallelism)? Stephan On Fri, Jan 6, 2017 at 7:27 PM, Chen Qin <[hidden email]> wrote:
|
Hi Stephen . Kafka version is: 0.9.0.1 the connector is flinkconsumer09 . The flatmap n coflatmap are connected by keyBy . No data is broadcasted and the data is not exploded based on the parallelism Cvp On 6 Jan 2017 20:16, "Stephan Ewen" <[hidden email]> wrote:
|
Anything that I could check or collect for you for investigation ? On Sat, Jan 7, 2017 at 1:35 PM, Chakravarthy varaga <[hidden email]> wrote:
|
Hi Guys, I understand that you are extremely busy but any pointers here is highly appreciated. I can proceed forward towards concluding the activity !On Mon, Jan 9, 2017 at 11:43 AM, Chakravarthy varaga <[hidden email]> wrote:
|
Hi CVP, changing the parallelism from 1 to 2 with every TM having only one slot will inevitably introduce another network shuffle operation between the sources and the keyed co flat map. This might be the source of your slow down, because before everything was running on one machine without any network communication (apart from reading from Kafka). Do you also observe a further degradation when increasing the parallelism from 2 to 4, for example (given that you've increased the number of topic partitions to at least the maximum parallelism in your topology)? Cheers, Till On Tue, Jan 10, 2017 at 11:37 AM, Chakravarthy varaga <[hidden email]> wrote:
|
Hi Tim, Thanks for your response. The results are the same. kafka partitions = 4 per topic parallesim for job = 3 I'm wondering if parallelism is enabled on multiple sources irrespective of the partition size. What I did is to enable 1 partition for the 2nd topic (1 event/hour) and 4 partitions for 100K events topic. And deployed a 3 parallelism job and the results are the same... On Wed, Jan 11, 2017 at 1:11 PM, Till Rohrmann <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |