Flink Migration

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Migration

Navneeth Krishnan
Hi All,

We are currently on a very old version of flink 1.4.0 and it has worked pretty well. But lately we have been facing checkpoint timeout issues. We would like to minimize any changes to the current pipelines and go ahead with the migration. With that said our first pick was to migrate to 1.5.6 and later migrate to a newer version.

Do you guys think a more recent version like 1.6 or 1.7 might work? We did try 1.8 but it requires some changes in the pipelines. 

When we tried 1.5.6 with docker compose we were unable to get the task manager attached to jobmanager. Are there some specific configurations required for newer versions?

Logs:

8-28 07:36:30.834 [main] INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils  - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics

2020-08-28 07:36:30.853 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Retrieved new target address jobmanager/172.21.0.8:6123.

2020-08-28 07:36:31.279 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:31.280 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.281 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.281 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.282 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.283 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.284 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.684 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:31.686 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.687 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.688 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.688 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.689 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.690 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:32.490 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:32.491 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.493 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.494 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.495 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:32.496 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.497 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:34.099 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:34.100 [main] INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - TaskManager will use hostname/address 'e6f9104cdc61' (172.21.0.9) for communication.


Flink Conf

jobmanager.rpc.address: jobmanager

rest.address: jobmanager


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Flink Migration

Yun Tang
Hi Navneeth

First of all, I suggest to upgrade Flink version to latest version.
And you could refer here [1] for the savepoint compatibility when upgrading Flink.

For the problem that cannot connect address, you could login your pod and run 'nslookup jobmanager' to see whether the host could be resolved.
You can also check the service of 'jobmanager' whether work as expected via 'kubectl get svc' .


Best
Yun Tang


From: Navneeth Krishnan <[hidden email]>
Sent: Friday, August 28, 2020 17:00
To: user <[hidden email]>
Subject: Flink Migration
 
Hi All,

We are currently on a very old version of flink 1.4.0 and it has worked pretty well. But lately we have been facing checkpoint timeout issues. We would like to minimize any changes to the current pipelines and go ahead with the migration. With that said our first pick was to migrate to 1.5.6 and later migrate to a newer version.

Do you guys think a more recent version like 1.6 or 1.7 might work? We did try 1.8 but it requires some changes in the pipelines. 

When we tried 1.5.6 with docker compose we were unable to get the task manager attached to jobmanager. Are there some specific configurations required for newer versions?

Logs:

8-28 07:36:30.834 [main] INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils  - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics

2020-08-28 07:36:30.853 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Retrieved new target address jobmanager/172.21.0.8:6123.

2020-08-28 07:36:31.279 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:31.280 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.281 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.281 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.282 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.283 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.284 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.684 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:31.686 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.687 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.688 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.688 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.689 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.690 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:32.490 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:32.491 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.493 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.494 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.495 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:32.496 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.497 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:34.099 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:34.100 [main] INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - TaskManager will use hostname/address 'e6f9104cdc61' (172.21.0.9) for communication.


Flink Conf

jobmanager.rpc.address: jobmanager

rest.address: jobmanager


Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Flink Migration

Arvid Heise-3
Hi Navneeth,

if everything worked before and you just experience later issues, it would be interesting to know if your state size grew over time. An application over time usually needs gradually more resources. If the user base of your company grows, so does the amount of messages (be it click stream, page impressions, or any kind of transactions). Often time, also the operator state grows. Sometimes, it's just that the events themselves become more complex and thus you need more overall bandwidth. This means that from time to time, you need to increase the memory of Flink (for state) or the number of compute nodes (to handle more events). In the same way, you need to make sure that your sink scales as well.

If you fail to keep up with the demand, the application gradually becomes more unstable (for example by running out of memory repeatedly). I'm suspecting that this may happen in your case.

First, it's important to understand what the bottleneck is. Web UI should help to narrow it down quickly. You can also share your insights and we can discuss further strategies.

If nothing works out, I also recommend an upgrade. Your best migration path would be to use Flink 1.7, which should allow a smoother transition for state [1]. I'd guess that afterwards, you should be able to migrate to 1.11 with almost no code changes.


On Fri, Aug 28, 2020 at 1:43 PM Yun Tang <[hidden email]> wrote:
Hi Navneeth

First of all, I suggest to upgrade Flink version to latest version.
And you could refer here [1] for the savepoint compatibility when upgrading Flink.

For the problem that cannot connect address, you could login your pod and run 'nslookup jobmanager' to see whether the host could be resolved.
You can also check the service of 'jobmanager' whether work as expected via 'kubectl get svc' .


Best
Yun Tang


From: Navneeth Krishnan <[hidden email]>
Sent: Friday, August 28, 2020 17:00
To: user <[hidden email]>
Subject: Flink Migration
 
Hi All,

We are currently on a very old version of flink 1.4.0 and it has worked pretty well. But lately we have been facing checkpoint timeout issues. We would like to minimize any changes to the current pipelines and go ahead with the migration. With that said our first pick was to migrate to 1.5.6 and later migrate to a newer version.

Do you guys think a more recent version like 1.6 or 1.7 might work? We did try 1.8 but it requires some changes in the pipelines. 

When we tried 1.5.6 with docker compose we were unable to get the task manager attached to jobmanager. Are there some specific configurations required for newer versions?

Logs:

8-28 07:36:30.834 [main] INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils  - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics

2020-08-28 07:36:30.853 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Retrieved new target address jobmanager/172.21.0.8:6123.

2020-08-28 07:36:31.279 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:31.280 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.281 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.281 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.282 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.283 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.284 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.684 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:31.686 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.687 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.688 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.688 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:31.689 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:31.690 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:32.490 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:32.491 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.493 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.494 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.495 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:32.496 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused)

2020-08-28 07:36:32.497 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)

2020-08-28 07:36:34.099 [main] INFO  org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to address jobmanager/172.21.0.8:6123

2020-08-28 07:36:34.100 [main] INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - TaskManager will use hostname/address 'e6f9104cdc61' (172.21.0.9) for communication.


Flink Conf

jobmanager.rpc.address: jobmanager

rest.address: jobmanager


Thanks



--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng