Unbalanced job scheduling

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Unbalanced job scheduling

AndreaKinn
Hi all,
I want to expose you my program flow.

I have the following operators:

kafka-source -> timestamp-extractor -> map -> keyBy -> window -> apply ->
LEARN -> SELECT -> process -> cassandra-sink

the LEARN and SELECT operators belong to an external library supported by
flink. LEARN is a very heavy operation compared to the other operators.

Unfortunately LEARN has a max parallelism of 1, so if I have a cluster of 2
TM with 1 slot each and I set parallelism = 2 I will have one TM which
performs a parallel instances of all the operators and the single instance
of LEARN while the other one TM performs just the second parallel instances
of all the operators (clearly there are no more instance of LEARN).
That's ok and I have no problem with understanding it.

*** The problem:
Actually I have 2 identical flows like this because it matches a situation
where I have two sensor streams so really I have 2 LEARN operators
corresponding to two independent streams.

By the way I noted that even in this case I have one TM which take a load of
the parallel instances of all the operators AND the single instances of
LEARN-1 and LEARN-2 while the other one TM performs just the second parallel
instances of all the operators (no LEARN instances here).

Since LEARN is an heavy operator this lead to a very unbalanced load on the
cluster, so much that the first TM is killed during the execution (looking
at the logs it probably happens because it has not enough memory, in fact
the sink execution is very very slow, it seems like the LEARN is a
bottleneck).

Honestly I can't understand why Flink don't assign 1 LEARN operator to one
TM and the other one LEARN to the other one TM.
This won't let me to stress the cluster properly because I will have always
one TM super busy and the other one quite "free" and unstressed.

Bye,
Andrea



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced job scheduling

Fabian Hueske-2
Hi Andrea,

have you looked into assigning slot sharing groups [1]?

Best, Fabian

2017-10-16 18:01 GMT+02:00 AndreaKinn <[hidden email]>:
Hi all,
I want to expose you my program flow.

I have the following operators:

kafka-source -> timestamp-extractor -> map -> keyBy -> window -> apply ->
LEARN -> SELECT -> process -> cassandra-sink

the LEARN and SELECT operators belong to an external library supported by
flink. LEARN is a very heavy operation compared to the other operators.

Unfortunately LEARN has a max parallelism of 1, so if I have a cluster of 2
TM with 1 slot each and I set parallelism = 2 I will have one TM which
performs a parallel instances of all the operators and the single instance
of LEARN while the other one TM performs just the second parallel instances
of all the operators (clearly there are no more instance of LEARN).
That's ok and I have no problem with understanding it.

*** The problem:
Actually I have 2 identical flows like this because it matches a situation
where I have two sensor streams so really I have 2 LEARN operators
corresponding to two independent streams.

By the way I noted that even in this case I have one TM which take a load of
the parallel instances of all the operators AND the single instances of
LEARN-1 and LEARN-2 while the other one TM performs just the second parallel
instances of all the operators (no LEARN instances here).

Since LEARN is an heavy operator this lead to a very unbalanced load on the
cluster, so much that the first TM is killed during the execution (looking
at the logs it probably happens because it has not enough memory, in fact
the sink execution is very very slow, it seems like the LEARN is a
bottleneck).

Honestly I can't understand why Flink don't assign 1 LEARN operator to one
TM and the other one LEARN to the other one TM.
This won't let me to stress the cluster properly because I will have always
one TM super busy and the other one quite "free" and unstressed.

Bye,
Andrea



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced job scheduling

AndreaKinn
This post was updated on .
Yes, I considered them but unfortunately I can't call setSlotSharingGroup
method on LEARN and SELECT operators (it doesn't appears in methods list).

I can call it on the other operators but this means that the two LEARN
method will be constrained in the same "unnamed" slot.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced job scheduling

Fabian Hueske-2
Setting the slot sharing group is Flink's mechanism to solve this issue.
I'd consider this a limitation of the library that provides LEARN and SELECT.

Did you consider to open an issue at (or contributing to) the library to support setting the slotSharing group?

2017-10-17 9:38 GMT+02:00 AndreaKinn <[hidden email]>:
Yes, I considered them but unfortunately I can't call setSlotSharingGroup
method on LEARN and SELECT operators.

I can call it on the other operators but this means that the two LEARN
method will be constrained in the same "unnamed" slot.

Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced job scheduling

AndreaKinn
I'm in contact with the founder of the library to deal with the problem. I'm
trying also to understand how implement myself slotSharingGroups



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/