Flink 1.3 -> 1.4 Kafka state migration issue

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.3 -> 1.4 Kafka state migration issue

Gyula Fóra
Hi,

Is it possible that the Kafka partition assignment logic has changed between Flink 1.3  and 1.4? I am trying to migrate some jobs using Kafka 0.8 sources and about half my jobs lost offset state for some partitions (but not all partitions). Jobs with parallelism 1 dont seem to be affected...

Any ideas?

Gyula
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.3 -> 1.4 Kafka state migration issue

Gyula Fóra
Migrating the jobs by setting the sources to parallelism = 1 and then scale back up after migration seems to be a good workaround, but I am wondering if something I do made this happen or this is a bug.

Gyula Fóra <[hidden email]> ezt írta (időpont: 2018. jan. 8., H, 14:46):
Hi,

Is it possible that the Kafka partition assignment logic has changed between Flink 1.3  and 1.4? I am trying to migrate some jobs using Kafka 0.8 sources and about half my jobs lost offset state for some partitions (but not all partitions). Jobs with parallelism 1 dont seem to be affected...

Any ideas?

Gyula
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.3 -> 1.4 Kafka state migration issue

Tzu-Li (Gordon) Tai
Hi Gyula,

Is your 1.3 savepoint from Flink 1.3.1 or 1.3.0?
In those versions, we had a critical bug that caused duplicate partition assignments in corner cases, so the assignment logic was altered from 1.3.1 to 1.3.2 (and therefore also 1.4.0).

If you indeed was using 1.3.1 or 1.3.0, and you are certain that the savepoint does not contain duplicate partition assignments caused by the bug, then yes restoring with DOP 1 and then rescaling again is a good workaround.

Please see the 1.3.2 release announcement [1] for details.

Best,
Gordon

[1] http://flink.apache.org/news/2017/08/05/release-1.3.2.html


On Jan 8, 2018 6:57 AM, "Gyula Fóra" <[hidden email]> wrote:
Migrating the jobs by setting the sources to parallelism = 1 and then scale back up after migration seems to be a good workaround, but I am wondering if something I do made this happen or this is a bug.

Gyula Fóra <[hidden email]> ezt írta (időpont: 2018. jan. 8., H, 14:46):
Hi,

Is it possible that the Kafka partition assignment logic has changed between Flink 1.3  and 1.4? I am trying to migrate some jobs using Kafka 0.8 sources and about half my jobs lost offset state for some partitions (but not all partitions). Jobs with parallelism 1 dont seem to be affected...

Any ideas?

Gyula

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.3 -> 1.4 Kafka state migration issue

Gyula Fóra
Hi,
Thanks Gordon, should have read the announcement :)

This might indeed be the case here, I will just use the workaround. At least this is a known issue, almost got a heart attack today :D

Cheers,
Gyula

Tzu-Li (Gordon) Tai <[hidden email]> ezt írta (időpont: 2018. jan. 8., H, 17:56):
Hi Gyula,

Is your 1.3 savepoint from Flink 1.3.1 or 1.3.0?
In those versions, we had a critical bug that caused duplicate partition assignments in corner cases, so the assignment logic was altered from 1.3.1 to 1.3.2 (and therefore also 1.4.0).

If you indeed was using 1.3.1 or 1.3.0, and you are certain that the savepoint does not contain duplicate partition assignments caused by the bug, then yes restoring with DOP 1 and then rescaling again is a good workaround.

Please see the 1.3.2 release announcement [1] for details.

Best,
Gordon




On Jan 8, 2018 6:57 AM, "Gyula Fóra" <[hidden email]> wrote:
Migrating the jobs by setting the sources to parallelism = 1 and then scale back up after migration seems to be a good workaround, but I am wondering if something I do made this happen or this is a bug.

Gyula Fóra <[hidden email]> ezt írta (időpont: 2018. jan. 8., H, 14:46):
Hi,

Is it possible that the Kafka partition assignment logic has changed between Flink 1.3  and 1.4? I am trying to migrate some jobs using Kafka 0.8 sources and about half my jobs lost offset state for some partitions (but not all partitions). Jobs with parallelism 1 dont seem to be affected...

Any ideas?

Gyula